Wayback Machine Site Harvesting

Posted: Sun, Feb 25, 2024

I love me a bit of web spelunking. Online archaeology. Internet dumpster-diving. Whatever term fits best. The act of trawling old and abandoned websites, whether they’re hanging on by a thread via Lycos' Angelfire or Tripod, they’ve been salvaged via one of the various GeoCities archival projects, or just digging through the Wayback Machine. When I was an avid Tumblr user back in the day I used to harvest all kinds of graphics, interviews, videos from bands, films, TV shows that I loved. Not done it for a while now, but being grown with a full time job and hardly any free time will do that to you. The joys!

One part of this web history shit I love is rebuilding old websites that have been lost to time. Sure, you can click about in the Wayback Machine and wait 10 years for each page to load if it doesn’t time out, or flick through specific timestamps for a page cos an image is broken in the May 2001 timestamp but not the June 2001 one, and so on, but I have to admit that it does my fucking head in sometimes. It’s very satisfying to have it all harvested, all links and images working, the works. I host them here to highlight what beauties have been lost, and hopefully give them a new lease of life. I also enjoy reworking the site itself under the hood - replacing 16 nested tables with a div or two and a spot of CSS scratches a wonderful itch in my brain.

So, there’s a few ways to go about scraping files that I’ve tried, and then one which worked about as well as I think it’s going to go. A TL;DR is at the bottom if you want to skip me ramblings.

manually

The first site I rebuilt was the official site for ABC’s ‘Two Guys and a Girl’, formerly ‘Two Guys, a Girl, and a Pizza Place’, a short-lived knock-off of Friends that ran from 1998 to 2001. It was a bit of a comfort show for me, but more importantly I really loved their website’s design.

I pulled all of their files from the Wayback Machine individually (right click -> Save Page As...), ripped out all the Wayback shite, then cleaned up all the markup, fixed the links, and rewrote all the hard-coded styling into way-more-manageable CSS. Manually saving pages is fine for really basic sites with a few pages, another example being the Chu Ishikawa site I did more recently. However, I have a much fatter project in the works, a site that sprawls out with tons of pages and a shit ton of barely-preserved files, and that manual shit just won’t cut it.

httrack

I remember using this once, yonks ago, to harvest the official The Matrix site years ago, which wasn’t me most successful mission. There’s actually a great write-up on the Super User StackExchange site by Cecil Curry which I’d recommend taking a look at, where they use httrack with various flags to harvest a site. Their work builds off user mpy on the same page, also a really insightful read!

This… half-worked for me. There were a few snags, though.

It flattened the entire file tree, appending -{x} to each index.html file it found
All links on every page led to the Wayback Machine, rather than being relative
It pulled 50+ pages (out of maybes a 100), but didn’t make a strong attempt to grab images, so many HTML pages are blank
It found and created files but said files all had a size of 0 bytes, so it clearly shat itself somewhere
It wouldn’t check different timestamps of pages, so would return a 404 file when an earlier-timestamped page would be full of content
Cherry on top, all pages were stuffed with the Wayback Machine’s bullshit – all the scripts, markup wrangling, analytics – which takes an age to rip out

However, it’s important to note that their answer is from 2014, edited 2015. I know the Internet Archive, since at least 2019, has introduced extensive rate limiting which can interfere if you’re sending many requests for files from somewhere other than in your browser. I’m struggling to find extensive information on exact limits introduced at exact times, but I can see trends between various projects on GitHub suffering from issues ranging as early as 2017.

wayback machine downloader - my winner, for now

I thought of tinkering with the flags I used for httrack in case I could get it to work, but not before giving this thing a whirl which looks to be exactly what’s needed. You’ll need Ruby installed.

Out the box, I ran into rate limiting issues, as did many other users. That last link has some fixes there, but they felt a bit disgusting to me. Any code where you’re throwing a sleep(x) to fix a problem is suspicious. A much cleaner fix can be found in an open pull request here! I’m not sure if it’s spot on like as I’ll be honest I couldn’t be arsed to pull the source code and do any tests, but it looks far more promising. Apologies to all QA testers reading.

So we’re going from this:

  # connection 1
  retrieve index.html
  # close connection

  # connection 2
  retrieve about.html
  # close connection

to something more like this:

  # connection 1
  retrieve index.html
  retrieve about.html
  retrieve goatse.jpg
  ...
  # close connection

i.e. opening a single request and fetching everything we can, rather than multiple requests which will very quickly get you a temp block.

I went and applied the changes directly to the .rb files, which if you’re on some flavour of Linux should be around here:

/var/lib/gems/X/gems/wayback_machine_downloader-X/lib

(Of course, replacing X with whatever version of Ruby/the project you have.)

And then gave it another go. It fetched the lot, and the links were 90% adjusted to relative links!

Also, a major thing, is the inclusion of id_ after the timestamp in each URL. See here for what I mean. This flag means that the original file as it was archived is returned, not the the bloated crap that IA delivers. So there’s a shit ton of work that doesn’t need cleaning up either!

Of course, this is all half the battle. I now have a folder full of clean, neatly indexed source files, I just have to rework every single page!

tl;dr

Have a site in the Wayback Machine you want to preserve, whether it’s for offline viewing or you want to rehost it?

Check out Wayback Machine Downloader on GitHub. Once you’ve installed the gem, before you use it, apply the changes from pull request #280. You can just use your favourite text editor to do this.

The README for the downloader is very self-explanatory. I used it with no flags, but there’s a lot of options if you need them.

Happy scraping!


	Wayback Machine Site Harvesting Posted: Sun, Feb 25, 2024 I love me a bit of web spelunking. Online archaeology. Internet dumpster-diving. Whatever term fits best. The act of trawling old and abandoned websites, whether they’re hanging on by a thread via Lycos' Angelfire or Tripod, they’ve been salvaged via one of the various GeoCities archival projects, or just digging through the Wayback Machine. When I was an avid Tumblr user back in the day I used to harvest all kinds of graphics, interviews, videos from bands, films, TV shows that I loved. Not done it for a while now, but being grown with a full time job and hardly any free time will do that to you. The joys! One part of this web history shit I love is rebuilding old websites that have been lost to time. Sure, you can click about in the Wayback Machine and wait 10 years for each page to load if it doesn’t time out, or flick through specific timestamps for a page cos an image is broken in the May 2001 timestamp but not the June 2001 one, and so on, but I have to admit that it does my fucking head in sometimes. It’s very satisfying to have it all harvested, all links and images working, the works. I host them here to highlight what beauties have been lost, and hopefully give them a new lease of life. I also enjoy reworking the site itself under the hood - replacing 16 nested `table`s with a `div` or two and a spot of CSS scratches a wonderful itch in my brain. So, there’s a few ways to go about scraping files that I’ve tried, and then one which worked about as well as I think it’s going to go. A TL;DR is at the bottom if you want to skip me ramblings. manually The first site I rebuilt was the official site for ABC’s ‘Two Guys and a Girl’, formerly ‘Two Guys, a Girl, and a Pizza Place’, a short-lived knock-off of Friends that ran from 1998 to 2001. It was a bit of a comfort show for me, but more importantly I really loved their website’s design. I pulled all of their files from the Wayback Machine individually (right click -> `Save Page As...`), ripped out all the Wayback shite, then cleaned up all the markup, fixed the links, and rewrote all the hard-coded styling into way-more-manageable CSS. Manually saving pages is fine for really basic sites with a few pages, another example being the Chu Ishikawa site I did more recently. However, I have a much fatter project in the works, a site that sprawls out with tons of pages and a shit ton of barely-preserved files, and that manual shit just won’t cut it. httrack I remember using this once, yonks ago, to harvest the official The Matrix site years ago, which wasn’t me most successful mission. There’s actually a great write-up on the Super User StackExchange site by Cecil Curry which I’d recommend taking a look at, where they use httrack with various flags to harvest a site. Their work builds off user mpy on the same page, also a really insightful read! This… half-worked for me. There were a few snags, though. It flattened the entire file tree, appending `-{x}` to each `index.html` file it found All links on every page led to the Wayback Machine, rather than being relative It pulled 50+ pages (out of maybes a 100), but didn’t make a strong attempt to grab images, so many HTML pages are blank It found and created files but said files all had a size of 0 bytes, so it clearly shat itself somewhere It wouldn’t check different timestamps of pages, so would return a 404 file when an earlier-timestamped page would be full of content Cherry on top, all pages were stuffed with the Wayback Machine’s bullshit – all the scripts, markup wrangling, analytics – which takes an age to rip out However, it’s important to note that their answer is from 2014, edited 2015. I know the Internet Archive, since at least 2019, has introduced extensive rate limiting which can interfere if you’re sending many requests for files from somewhere other than in your browser. I’m struggling to find extensive information on exact limits introduced at exact times, but I can see trends between various projects on GitHub suffering from issues ranging as early as 2017. wayback machine downloader - my winner, for now I thought of tinkering with the flags I used for httrack in case I could get it to work, but not before giving this thing a whirl which looks to be exactly what’s needed. You’ll need Ruby installed. Out the box, I ran into rate limiting issues, as did many other users. That last link has some fixes there, but they felt a bit disgusting to me. Any code where you’re throwing a `sleep(x)` to fix a problem is suspicious. A much cleaner fix can be found in an open pull request here! I’m not sure if it’s spot on like as I’ll be honest I couldn’t be arsed to pull the source code and do any tests, but it looks far more promising. Apologies to all QA testers reading. So we’re going from this: `# connection 1 retrieve index.html # close connection` `# connection 2 retrieve about.html # close connection` to something more like this: `# connection 1 retrieve index.html retrieve about.html retrieve goatse.jpg ... # close connection` i.e. opening a single request and fetching everything we can, rather than multiple requests which will very quickly get you a temp block. I went and applied the changes directly to the `.rb` files, which if you’re on some flavour of Linux should be around here: `/var/lib/gems/X/gems/wayback_machine_downloader-X/lib` (Of course, replacing X with whatever version of Ruby/the project you have.) And then gave it another go. It fetched the lot, and the links were 90% adjusted to relative links! Also, a major thing, is the inclusion of `id_` after the timestamp in each URL. See here for what I mean. This flag means that the original file as it was archived is returned, not the the bloated crap that IA delivers. So there’s a shit ton of work that doesn’t need cleaning up either! Of course, this is all half the battle. I now have a folder full of clean, neatly indexed source files, I just have to rework every single page! tl;dr Have a site in the Wayback Machine you want to preserve, whether it’s for offline viewing or you want to rehost it? Check out Wayback Machine Downloader on GitHub. Once you’ve installed the gem, before you use it, apply the changes from pull request #280. You can just use your favourite text editor to do this. The README for the downloader is very self-explanatory. I used it with no flags, but there’s a lot of options if you need them. Happy scraping!


	back to [ Coding Quarter ]