For civic policy wonks and anyone interested in open data, New York City's taxi trip logs are a fascinating specimen – over 20 gigabytes worth of trip and fare data comprising more than 173 million individual rides. And that's just from 2013.
The data, obtained via freedom of information request, was anonymized to protect the identities of the NYC Taxi and Limousine Commission's drivers. Or so the people who released the data set thought. The reality is, it didn't take long before a software developer found that, with a bit of code, those identities were actually quite easy to ascertain.
"This anonymization is so poor that anyone could, with less then 2 hours work, figure which driver drove every single trip in this entire data set," wrote Vijay Pandurangan, a former Google engineer and founder and CEO of a company called Mitro. "It would even be easy to calculate drivers' gross income, or infer where they live."
This isn't the first time this has happened, that big data sets full of personal information – supposedly obscured, or de-identified, as the process is called – have been reverse engineered to reveal some or even all of the identities contained within. It makes you wonder: Is there really such a thing as a truly anonymous data set in the age of big data?
That's a question privacy advocates and researchers have been fighting over, for, well, years. On the one hand, some argue that you can never really anonymize a data set – that, with enough secondary information, it's not difficult to corroborate, cross-reference, and thus re-identify at least some of the people catalogued within. On the other, Ontario's Information and Privacy Commissioner Ann Cavoukian, along with the Information Technology and Innovation Foundation's senior analyst Daniel Castro, published a report last month attempting to put this "myth" to rest. "Setting the record straight," the report was titled. "De-identification *does* work."
That might sound like a bore, but think about it this way: there's more than taxi cab data at stake here. Pretty much everything you do on the Internet these days is a potential data set. And data has value. The posts you like on Facebook, your spending habits as tracked by Mint, the searches you make on Google – the argument goes that the social, economic and academic potential of sharing these immensely detailed so-called "high dimensional" data sets with third parties is too great to ignore.
If that's the case, you better hope there's a pretty surefire way to scrub data sets of our personal information before release. Cavoukian and Castro worry that we'll be so scared off by incidents where data has been poorly de-identified – and mistake those scenarios for examples of why de-identification doesn't work – that we'll decide not to share our data at all.
De-identification techniques recommended by a European advisory body on data protection and privacy include the randomization or removal of direct identifiers – name, address and the like – normalizing data by removing outlying values, adding noise to the data to obscure exact values, and generalizing data by, say, only including a person's birth month, but not the day, or the province a person lives in but not the city. When combined, these techniques are said to give identifying data a high degree of anonymity. But not everyone believes they work.
University of Colorado Law School associate professor Paul Ohm's 2009 paper on the topic made the bold claim that "data can be either useful or perfectly anonymous but never both." To prove his point, Ohm cited three well-known data sets that had supposedly been de-identified – a set of AOL search queries, a Netflix viewing habits data study and a data set of ZIP codes, gender and birth date. In each case, when researchers correlated the data with another data set, they found that some identities could still be revealed.
A similar situation was cited by Princeton University researchers Arvind Narayanan and Edward W. Felten in a recent response to Cavoukian and Castro. The pair wrote that, in one data set where location data had supposedly been anonymized, it was still possible in 95 per cent of test cases to re-identify users "given four random spatio-temporal points" – and 50 per cent if the researchers only had two. In other words, de-identifying location data is moot if you know where a target lives, where they work and have two other co-ordinates they visit with regularity.
Cavoukian and Castro argue in their paper that many of these supposed successful re-identification attacks – including the ones mentioned by Ohm, Narayanan and Felten – are actually based on data sets that were poorly de-identified in the first place (and in the case of the location data studio, question how easy it would be to actually get "four random spatio-temporal points"). Case in point: the New York City taxi data set wasn't actually anonymous data, but pseudonymous data. All the people preparing the data did was replace the drivers' identities – a direct identifier – with anonymous values that were trivial for Pandurangan to reverse engineer. The creators didn't add any noise to the dataset, and didn't make any values more vague. The basic tenants of de-identification just weren't there.
"Could the NYC taxi dataset have been de-identified reliably?" asked Narayanan and Felten. "Nobody knows." Cavoukian and Castro still think the answer is clear, but so much for setting the record straight.