Tracking Excavator: Uncovering Tracking in the Web's Past

Ada Lerner, Anna Kornfeld Simpson, Tadayoshi Kohno, and Franziska Roesner

Security & Privacy Research Lab, Paul G. Allen School of Computer Science & Engineering, University of Washington

FAQ

What is web tracking?
When websites other than the ones you directly visit gather information about you as you browse the web, this is third-party web tracking. For example, when you visit websiteA.com, the site may embed third-party content, like advertisements from advertisers, buttons from social networks, or code from web analytics engines. These third-parties are invisible to you when you visit websiteA.com — the web page looks just like a regular web page to you. But these third parties gather information about your visit to websiteA.com, and to any other websites on which they are embedded.

Web tracking has some benefits: it lets advertisers show you more relevant advertisements, it lets websites and social media sites personalize content for you, and it lets website developers learn about how people use their websites. But web tracking also raises privacy concerns (see below).

Why do some people consider web tracking a privacy concern?
Though web tracking has some benefits (see above), it also raises potential privacy concerns. By tracking you across the web, a tracker can compile a profile of your browsing behavior and generate an accurate profile of you — even if they are tracking you anonymously. Some people are uncomfortable with third parties compiling this much information about them, often without their knowledge. For example, this Wall Street Journal article discusses the privacy concerns surrounding web tracking.

If a website can track me, does that mean it is developing a profile of me?
Different websites have different privacy policies with respect to the data they collect about you. Some explicitly use the data they collect to track you (for instance, for the purpose of targeted advertising). Others state that they use the data only for diagnostics and discard all data after a short period of time (e.g., Google's +1 button). Our measurements do not distinguish between trackers with different privacy policies. Rather, we aim to detect all trackers who have the ability to track users, regardless of what they actually do with the data they collect.

How does web tracking work?
Cookie-based tracking is the most common form of web tracking. These trackers generally set a cookie in your browser, which is simply a small file that contains text, often including a unique identifier. This identifier might identify you anonymously or, if you are logged in to the tracker’s site, as a specific user of that site. Whenever a request is made to a website, including the embedded third-party trackers, your browser automatically attaches the cookies to the request. This mechanism lets the tracker that set the cookie identify you again and thus link your visits between websites.

Tracking may also use other mechanisms, such as other ways of storing unique identifiers, including HTML5 Local Storage. Some trackers do not need to store unique identifiers at all but instead use browser and machine fingerprinting techniques. Fingerprinting takes advantage of the fact that your machine and browser configuration (for example, the browser version you have installed, the fonts you have installed, the plugins you have installed) is unique compared to other users.

What can I do if I'm worried about web tracking?
There are several browser settings you can enable to help reduce tracking: enable the "Do Not Track" option and block third-party cookies. There are also several browser extensions that can help. For example, consider (in no particular order) Privacy Badger, Ghostery, Crumble, or uBlock Origin.

Why did you do this research?
There have been numerous studies on web tracking, but they have only been able to examine a relatively brief, and relatively recent, point in the web’s history. We wanted to know whether it’s possible to investigate the evolution of web tracking from the beginning of the web, to put existing studies in historical context and to better understand the evoluation of web tracking.

How did you do an archaeological study of web tracking?
We used the Internet Archive's Wayback Machine. The Wayback Machine keeps snapshots of websites from the past (for example: here is what our department website http://cs.washington.edu looked like in 1996). We built a tool called Tracking Excavator, which automatically browses through the Internet Archive and investigates tracking behaviors over time.

How does this research help us understand web tacking?
First, we demonstrate that it is possible to use the Wayback Machine to study properties of web tracking over time -- including for times before the first known studies of web tracking. Our research investigates the strengths and limitations of the Wayback Machine for conducting such a retroactive study and develops techniques for effectively using archived data.

Second, our results provide, for the first time, concrete evidence of web tracking behaviors as early as 1996, and illustrate how web tracking has become steadily more prevalent and more complex since then.

What are some of your most interesting findings?
We have published our results in a peer-reviewed research paper, which is available for download here: [PDF]
Here are some of our most interesting findings:

It is possible to retroactively study web tracking from 1996-present using the Wayback Machine.
Web tracking has become steadily more widespread, despite concerns in recent years about its privacy implications. Though this finding may confirm intuition, our work provides the first concrete evidence of this trend over the last 20 years of the web.
Tracking has become more complex. First, more trackers exhibit more types of behaviors, which has the effect of making it harder for privacy-conscious users to evade these behaviors. (See here for a description of different types of tracking behaviors.) Second, we see increasing use of fingerprinting-related and other browser APIs.
Though most types of cookie-based tracking show increases over time, trackers that used pop-ups to force users to “visit” their webpage had their peak in the early 2000s and then disappeared after major browsers started blocking pop-ups by default. This peak happened before any known studies on web tracking and thus could only have been detected by archaeological measurements like ours.
The percentage of top sites (and therefore the percentage of user traffic) observed by the most pervasive trackers has also increased over time.