Maybe I’m the last one to know about this, but just in case… Did you know that the Census Bureau is changing what and how it’ll release data the 2020 Census? I mentioned this in passing in my last post, but here’s a little more, most of which you can find by reading the posts and presentations on the Census’ Disclosure Avoidance and the 2020 Census page.
The data scientists at the Census Bureau have been experimenting for the last few years, and it turns out that they were able to reconstruct private information about individual citizens using a process not unlike solving a Sudoku puzzle or one of those grid logic puzzles. They only release de-identified data, but if you know what you’re doing you can figure out a goodly percentage of the identifications. It’s called Re-Identification or Database Reconstruction.
One way to solve this problem is to never provide 100% accurate results to queries on the database. Without 100% accurate query results, no clever data scientists can take the answers from query 1 and fit it next to the answers from queries 2 through x, or put it next to their dataset of bank loan applications or whatever, and know for sure that they have accurate answers to those queries and therefore reconstruct the database. Adjusting results by a privacy parameter, which will probably be a very very tiny parameter, is at the heart of what’s called Differential Privacy (and here a big shout out to Dr Steven Wu from the University of Minnesota, whose recent talk at Carleton helped me understand the math behind this process). The Census is presenting this parameter or set of parameters as their “Privacy Budget.” Once they’ve disclosed everything in the budget, they can’t disclose more without doing harm.
As of the presentation from October 1st, 2019 that’s posted to the Disclosure Avoidance and the 2020 Census page, the Census hasn’t yet quite decided what this privacy parameter will be. It sounds like they’re thinking it’ll vary depending on exactly how sensitive the information is or the amount of other information that’s been requested. But of course the main question is exactly how accurate the disclosed results have to be in order to allow people to do the work they need to do to run the country and distribute aid and know about our population. The more accurate the disclosures, the more risk of privacy problems. The more obscured the results, the more risk that we’ll struggle to know what we need to know in order to do what we need to do.
If this is a topic you’re interested in or care about, there are a bunch of things published by the Census statistician John Abowd that I found while exploring this topic with data-interested faculty here at Carleton.
- Abowd, John. 2016. “Why Statistical Agencies Need to Take Privacy-Loss Budgets Seriously, and What It Means When They Do.” Labor Dynamics Institute, December. https://digitalcommons.ilr.cornell.edu/ldi/32.
- Abowd, John M. 2016. “How Will Statistical Agencies Operate When All Data Are Private?” Journal of Privacy and Confidentiality 7 (3). https://doi.org/10.29012/jpc.v7i3.404.
- Abowd, John M., and Ian Schmutte. 2017. “Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods.” Labor Dynamics Institute, April. https://digitalcommons.ilr.cornell.edu/ldi/37.
- Abowd, John M., and Ian M. Schmutte. 2019. “An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices.” American Economic Review 109 (1): 171–202. https://doi.org/10.1257/aer.20170627.
I’m really curious to see how this plays out, not least because it’ll impact all the researchers I work with who use census data in so very many of their research projects. But I’m also curious to see how this and similar efforts bleed out into other Big Data conversations in my world, like Learning Analytics for example.