As users browse the web, their browsing behavior may be observed and aggregated by third-party websites ("trackers") that they don't visit directly. These trackers are generally embedded by host websites in the form of advertisements, social media widgets (e.g., the Facebook "Like" button), or web analytics platforms (e.g., Google Analytics).
Though web tracking and its privacy implications have received much attention in recent years, that attention has come relatively recently in the history of the web and lacks full historical context. In this work, we conduct a longitudinal archaeological study of tracking on the web from 1996 to 2016. Our key insight: that the Internet Archive’s Wayback Machine enables a retrospective analysis of properties of the web, even though researchers did not anticipate in advance the need to study these properties over time. We evaluate the potential and limitations of the Wayback Machine for this purpose and offer strategies to overcome several challenges we encountered in relation to using its data to study tracking.
We built a new tool, Tracking Excavator, which enables us to provide, for the first time, concrete evidence of web tracking behaviors dating back to 1996. We find that tracking has become more widespread, that tracking behaviors have become more complex (and thus harder for privacy-conscious users to evade), and that the top trackers are able to capture increasing fractions of users’ browsing behaviors.
This work builds on our previous work systematically studying web tracking in the present (see associated software) and developing a new defense for social media trackers (now integrated into the Electronic Frontier Foundation's Privacy Badger tool).