My advisor gave me a great bit of wisdom many years ago: 20 minutes of student perspiration is worth a week of advisor intuition. Looks like that’s the case here. Hat tip to Don, Sara, and Morgan for posting a link to the New York Times article that has identified someone from the AOL data. Turns out that one Thelma Arnold was identified. It isn’t clear if the NYT only used the data from AOL or used some of their own via the method suggested by MC in the comments of the original article I posted (although coincidentally I was talking with someone from the NYT yesterday here at SIGIR, and I got the impression that all the click logs the NYT has been keeping are unused and not easily available to reporters. So I suspect Ms. Arnold was tracked down using high-tech means such as calling all the Arnolds in Lilburn, GA. DexOnline has 25.).
So, what does this show?
- Yes, you can identify a person via their queries. My intuition and protestations to the contrary, from the NYT and some of the other examples, it’s pretty clear that you can identify people without too much difficulty.
- People search using private data. I never thought people would search using SSNs and driver’s license numbers… but I guess they do.
It also means that the AOL data is likely going to be the last public release of search engine query logs. While some data can be anonymized (for example, the MSN Search data we licensed to academics had nearly all numbers turned into ### prior to release), you can’t anonymize everything, and thus it will always be likely that someone can be tracked down using the query logs. So we’re done.
Certainly, it is possible that someone will have some kind of opt-in program that customers can select where their queries will be explicitly made available to academics. However, the problem with that is the self-selecting set of users that do that are going to be a different set of users than the general public, and thus it isn’t as useful to academics (although again, this is just my intuition here).
So, I’m wrong. AOL did release PII, which is clearly an error on their part. And thus I’d call out to everyone that they should delete the data and certainly not use it for any purpose whatsoever — even research. Fields of science that routinely use data collected from human subjects, such as psychology or social science, have clear guidelines requiring prior consent of the subjects. Thus, even though it may be possible to get the consent of people like Ms. Arnold and some others, it’s inconceivable that AOL will get consent from everyone (even should they send e-mail to all of them, apologize, and ask for consent). And consent after the fact isn’t the same as prior consent. So there’s no other option — the AOL data, tempting as it may be, cannot be used ethically. It’s as simple as that.
While there may be many privacy advocates who would agree that you can’t ever anonymise data, and after the bad press from AOL probably no one else will; I think that it is possible to anonymise search logs for academic research by ensuring:
* The queries are not linked to a unique user - this is the only way someones search *history* can be determined - now there is an argument you could still identify an individual from one search, but this is a hell of a lot harder, almost impossible and would reveal no more than the HTTP headers given to whatever website you click on (which most people probably don’t realise they give out anyway)
* Make sure the queries aren’t time/date stamped so a webmaster can’t linka query to personal information submitted to them, as I suggested in the comments to your previous article. Even without this, with the point above they still would have no more information on you anyway
* Strip SSN, Credit Card Numbers, email addresses as Microsoft did, and queries of more than say 5 words where people are just pasting an email or whatever
I don’t know if such data would be useful to researchers, but if AOL had released data in this format, I think your challenge would have been almost impossible to meet. The question is then whether queries about identfiable people (you couldnt know who made the search but you could identfy the person being searched for) are accetpable - i.e. the query “Arnold, Lilburn, GA, gay” for example.