Researchers Anonymized Data Does Little To Protect User Privacy

While our anxiety around how this data will be used has grown considerably in recent years, culminating with the launch of a federal probe by the DOJ in recent weeks, it’s done little to stop the flow of information from individuals to companies, or from one company to another. The data trade, in fact, has overtaken oil as the world’s fastest-growing commodity market according to some experts. And while we grow increasingly anxious about it, there’s little we can do to stop its flow. We’re assuaged at the thought of our data being anonymized, crucial data points stored as individual blips on a massive database — one that’s so large, with so many of these markers, that it’s nearly impossible to trace back to a single human. Or, that’s what we were told, anyway. But this has never been true. In fact, we’ve known since the mid-1990s, when Dr. Latanya Sweeney, Professor of Government and Technology in Residence at Harvard University, blew that notion to pieces by identifying the medical records of William Weld (then the Governor of Massachusetts) from just three data points in an anonymous database. Dr. Sweeney, who also heads the Data Privacy Lab at the Institute of Quantitative Social Sciences at Harvard, needed only Weld’s zipcode, his date of birth, and gender to correctly identify him among countless others. Pressed by NGOs and legislators to truly anonymize data before sharing it with others, companies started to rely on a new method called sampling. In a sample database, no individual, or company, would have access to only a small piece of an anonymous database, and not the entire thing. In theory, it would lower the risk of re-identification of anonymous individuals by splitting the data into several, smaller samples. This makes it unlikely that any one person would be re-identified, because the number of anonymous data points on each person would be split across several databases — and no company or individual would be able to access all of them. According to the Office of the Australian Information commissioner, sampling “[creates] uncertainty that any particular person is even included in the dataset.” Or, to put it simply, sampling will prevent re-identification of anonymous individuals. But this too is false. According to a trio of European researchers, individuals in a sample database can be re-identified 83 percent of the time using just three data points: their gender, date of birth, and zip code. They created a handy tool (that doesn’t store collected data) that you can use to find out how likely you are to be re-identified by these three data points. For me that’s 45 percent of the time, much better than average, but still shockingly high. In an article published in Nature Communications, the team developed a statistical model that could correctly identify 99.98 percent of Americans using 15 characteristics from an anonymized dataset, including age, gender, and marital status. And what they aren’t tracking, they’re buying. Data brokers are big business, and exist solely to provide competitive insights into everything from your household income to who you voted for in the last election. According to the researchers: Anonymized data is better than the alternative, but it’s clear that we have some work to do in increasing our understanding of what’s collected and how it may be used against us. We believe that, in general, it is time to move away from de-identification and tighten the rules for constitute truly anonymized data. Making sure data can be used statistically, e.g., for medical research is extremely important but cannot happen at the expense of people’s privacy. Datasets such as the NIGMS and NIH genetic data, the Washington State Health Data, the NYC Taxicab dataset, the Transport For London bike sharing dataset, and the Australian de-identified Medicare Benefits Schedule (MBS) and Pharmaceutical Benefits Schedule (PBS) datasets have been show to be easily re-identifiable.