We have made our input files (which sites we used in our measurements), data for our live and archival measurement runs, and the code we used for our analysis available for re-use. Please let us know if you do anything with it!
Input files for our live runs in 2015 and 2016 are available here and were generated by examining the top 500 websites on the Alexa top sites list.
For our archival input files, which are available here, we did the following (as described in Section 5 of our paper):
2003-2016: Alexa. For 2010-2016, we use Wayback Machine archives of Alexa’s top million sites list (a csv file). For 2003-2009, we approximate the top 500 by scraping Alexa’s own historical API (when available) and archives of individual Alexa top 100 pages. Because of inconsistencies in those sources, our final lists contain 459-500 top sites for those years.
1996-2002: Popular Links from Homepages. In 2002, only the Alexa top 100 are available; before 2002, we only have ComScore’s list of 20 top sites. Thus, to build a list of 500 popular sites for the years 1996-2002, we took advantage of the standard practice at the time of publishing links to popular domains on personal websites. Specifically, we located archives of the People pages of the Computer Science or similar department at the top 10 U.S. CS research universities as of 1999, as reported in that year by U.S. News Online. We identified the top 500 domains linked to from the homepages accessible from those People pages, and added any ComScore domains that were not found by this process. We ran this process using People pages archived in 1996 and 1999; these personal pages were not updated or archived frequently enough to get finer granularity. We used the 1996 list as input to our 1996-1998 measurements, and the 1999 list as input for 1999-2002.
Download Python .pickle files of our collected data: