Library of Congress grapples with problem of making Twitter archive accessible

Tools

Working with Twitter, the Library of Congress has created an archive of approximately 170 billion tweets organized by date, says an LOC report released this month. Now, the technological challenge is how to make the archive accessible to researchers and policymakers in a comprehensive and useful way.

"It is clear that technology to allow for scholarship access to large data sets is lagging behind technology for creating and distributing such data," states the LOC report. "Even the private sector has not yet implemented cost-effective commercial solutions because of the complexity and resource requirements of such a task."

Although the Library has received about 400 inquiries from researchers around the world, LOC has not yet provided them with access to the Twitter archive because executing a single search of just the fixed 2006-2010 archive on the Library's systems could take 24 hours, which "severely limits" the number of possible searches.

LOC could significantly reduce the search time with an extensive infrastructure of hundreds, if not thousands, of servers. However, the Library concluded that this is a cost-prohibitive and impractical solution. As a near-term fix, the Library is working to develop a basic level of access that can be implemented while archival access technologies catch up.

In April 2010, LOC and Twitter signed an agreement providing the Library the public tweets from the company's inception through the date of the agreement, an archive of tweets from 2006 through April, 2010. The Library is currently processing data from the original 2006-2010 archive and organizing the material by time and in hourly files, a project that is slated for completion this month.

Under the same agreement with the Library, Twitter provides all public tweets on an ongoing basis through Colorado-based Gnip, the delivery agent for moving data to LOC. According to the Library's report, the volume of tweets LOC receives each day has grown from 140 million in February 2011 to nearly half a billion tweets each day as of October 2012.

The report says senior Library officials recently met with Gnip senior management in Washington to explore the possibility of developing a research- and scholarship-focused interface to the archive using Gnip's existing historical Twitter product offerings.

For more:
-read the LOC report

Related Articles:
THOMAS.gov reboots
Air Force, Treasury, Library of Congress seeking apps