March 4, 2024

A visit to the physical Internet archive

While in San Francisco for the AI ​​Engineers Summit earlier this month, I took the opportunity to visit the Internet Archive, the actual physical archive in the Californian city of Richmond, about a twenty-minute drive from San Francisco.

I purchased a ticket to “go behind the scenes at the physical archive” on Wednesday, October 11, and arrived just before the 6 pm start time. I was glad I hadn’t arrived sooner, as the location of the physical archive was (unsurprisingly) a warehouse in an industrial area of ​​Richmond. There didn’t seem to be anything else to do in the area.

I had ordered the Uber driver to drop me off in a parking lot with an Internet Archive sign. But when I looked around, I couldn’t see any public entrance to the warehouse. There were a few other internet history nerds who seemed confused, so we awkwardly introduced ourselves and discussed whether we were in the right place. Finally, a couple of people at the end of the street, about 200 meters away, saw us and waved us over.

Internet Archive physical archive, Richmond, CA.

It turned out that a group of people had already settled inside the main building, drinking complimentary colas, beers or mineral water and eating snacks. The crowd was a mix of older people (perhaps from the generation that worked in Silicon Valley during the 1960s and 1970s) and younger geeks (I assume many were librarians or professional webheads; I’m an example of the latter).

When the tour began about half an hour later, thirty or forty people gathered in front of an enthusiastic man in a red shirt with thinning gray hair. Of course, it was Internet Archive founder Brewster Kahle. At first, I was surprised that he led the tour himself, but it soon became clear that Kahle lives and breathes the Internet Archive’s mission. He began by showing us the shipping containers filled with old books and other materials, while breaking down some facts (“Internet Archive is a nonprofit library; we started it 27 years ago, 1996”).

Brewster Kahle in front of containers

Brewster Kahle vs. (real, physical) containers.

Later in the tour, Kahle enthusiastically showed off the book-scanning machine, pointed out the stacks of boxes gifted to the archive (full of books, videos, records, discs, cassettes and other media) and stood to the side proudly as their film The Archivists told us how they convert old home videos into high-resolution digital files. It was a fascinating look at the daily operations of the Internet Archive, whose staff is made up of several friendly and probably liberal-minded Californians, including Brewster’s son, Caslon.

What the Internet archive stores

The Internet Archive is perhaps best known for its Wayback Machine, which debuted in 2001 and has been archiving web pages since 1996. “We collect about a billion URLs every day, a surprisingly large number,” Kahle said during her tour. . “And now there are two and a half billion URLs in the Wayback Machine collection: these old web pages. And it is consulted about six or seven thousand times per second.”

But the physical archive, as its informal name suggests, is a repository of physical media: books, catalogs, old computer disks, movies, cassette records and tapes, and much more. When new media arrives, Internet Archive staff first decide if it’s a duplicate of something they already have, a process they call “deduplication.” If it is a hoax, it is discarded or given away. Otherwise, the physical item is digitized and then stored. (As an aside, the Internet Archive says it only makes digital copies of a book available if you own the physical copy.)

AI Film Scanners

A specially designed vintage film scanner at the Internet Archive.

“We’ve been digitizing books since the early 2000s,” Kahle said, “and we ended up building our own book scanners.” He added that AI digitizes “around a million books a year” and they have digitized around 7 or 8 million books in total (on its about page, AI says it has “41 million books and texts,” (so most of them must be text elements other than books).

As for music, it is a type of media that has historically had multiple formats: LP, CD, cassettes, MP3, etc. Kahle was particularly enthusiastic about 78 RPM records, which he said existed from about 1900 to 1950. or 3 million of them,” he said, “[and] We have digitized around 450,000.”

Boxes of multimedia items.

Boxes of multimedia items, overseen by a cardboard cutout of Darth Vader.

“We basically tried to cover all types of media,” Kahle continued. “And what I have discovered is that the moment […] things have become obsolete, it’s happening faster and faster. […] Not only do you not have access to the same things; Even if you have access, it is not presented to you in such a way that you will actually use it.”

Note: If you are interested in donating items to the Internet Archive, please see this web page for a list of the media types it currently accepts.

How Internet Archive continues

Someone in the touring group asked Kahle how often the AI ​​needs to buy new servers to store this constant stream of new media.

“Continuously,” he replied. “We bought a new pair of racks, because they always come in a pair, every two months. [or] three months. […] In a rack, you can now fit about five petabytes.”

Internet Archive Server Racks

Two previous generations of Internet Archive storage machines; On the left, the StorageTek 9710 from the 90s, and on the right, the first generation PetaBox (2004).

Of course, AI has been in the news this year due to legal attacks from both the book publishing industry and the music industry (the latter in relation to the 78 RPM record project). Kahle made several critical comments about these legal challenges during the tour, but it was clear that he had taken a toll on the AI. “That’s still in court,” he sighed, referring to the book publishers’ lawsuit, “and it’s incredibly expensive.”

So how does AI survive? Kahle said the AI ​​runs primarily on donations, from 110,000 people averaging about $5 per person, as well as from “foundations that give us large amounts of money.” AI also offers subscription services to libraries and other organizations.

“We also survived, well, without spending much,” he added. “I mean, you notice that the servers don’t have air conditioning, right? If it’s hot, we just open the windows. So, it’s green. But it is also economical.”

richmond streets

Outside the physical Internet Archive in Richmond, CA. A fun night for an Internet history nerd!

Cluster Created with Sketch.

Leave a Reply

Your email address will not be published. Required fields are marked *