Image

Publications and Presentations

Libraries and Librarianship

Teaching and Learning

2020 Census: the demo software, data, and more on the debate about differential privacy

Since last I wrote, I’ve been able to learn more about a question that had plagued me: what of IPUMS, that unparalleled resource for census micro-data? For one thing, I was sure they must be thinking about privacy already – micro-data must be handled with care. For another, the 2020 Census “Privacy Budget” work is likely to make IPUMS’s work pretty complicated or even impossible.

Turns out, they are of course way ahead of me. Here’s their page on the topic, which also links to the Census’ released software and documentation, as well as to data available from IPUMS that researchers can use to test that software.

I also happen to know someone who works at IPUMS, and he says they’ve already been doing record swapping and injecting “noise” into their data to protect unique cases.

I certainly hope that everyone involved can figure out how to protect people’s privacy while still allowing vital research. This seems like an impossibly tangled knot to me.

2020 Census and the “Privacy Budget” (aka Differential Privacy)

Maybe I’m the last one to know about this, but just in case… Did you know that the Census Bureau is changing what and how it’ll release data the 2020 Census? I mentioned this in passing in my last post, but here’s a little more, most of which you can find by reading the posts and presentations on the Census’ Disclosure Avoidance and the 2020 Census page.

CC-BY-SA image by Salix alba posted at Wikimedia Commons

The data scientists at the Census Bureau have been experimenting for the last few years, and it turns out that they were able to reconstruct private information about individual citizens using a process not unlike solving a Sudoku puzzle or one of those grid logic puzzles. They only release de-identified data, but if you know what you’re doing you can figure out a goodly percentage of the identifications. It’s called Re-Identification or Database Reconstruction.

One way to solve this problem is to never provide 100% accurate results to queries on the database. Without 100% accurate query results, no clever data scientists can take the answers from query 1 and fit it next to the answers from queries 2 through x, or put it next to their dataset of bank loan applications or whatever, and know for sure that they have accurate answers to those queries and therefore reconstruct the database. Adjusting results by a privacy parameter, which will probably be a very very tiny parameter, is at the heart of what’s called Differential Privacy (and here a big shout out to Dr Steven Wu from the University of Minnesota, whose recent talk at Carleton helped me understand the math behind this process). The Census is presenting this parameter or set of parameters as their “Privacy Budget.” Once they’ve disclosed everything in the budget, they can’t disclose more without doing harm.

As of the presentation from October 1st, 2019 that’s posted to the Disclosure Avoidance and the 2020 Census page, the Census hasn’t yet quite decided what this privacy parameter will be. It sounds like they’re thinking it’ll vary depending on exactly how sensitive the information is or the amount of other information that’s been requested. But of course the main question is exactly how accurate the disclosed results have to be in order to allow people to do the work they need to do to run the country and distribute aid and know about our population. The more accurate the disclosures, the more risk of privacy problems. The more obscured the results, the more risk that we’ll struggle to know what we need to know in order to do what we need to do.

If this is a topic you’re interested in or care about, there are a bunch of things published by the Census statistician John Abowd that I found while exploring this topic with data-interested faculty here at Carleton.

  • Abowd, John. 2016. “Why Statistical Agencies Need to Take Privacy-Loss Budgets Seriously, and What It Means When They Do.” Labor Dynamics Institute, December. https://digitalcommons.ilr.cornell.edu/ldi/32.
  • Abowd, John M. 2016. “How Will Statistical Agencies Operate When All Data Are Private?” Journal of Privacy and Confidentiality 7 (3). https://doi.org/10.29012/jpc.v7i3.404.
  • Abowd, John M., and Ian Schmutte. 2017. “Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods.” Labor Dynamics Institute, April. https://digitalcommons.ilr.cornell.edu/ldi/37.
  • Abowd, John M., and Ian M. Schmutte. 2019. “An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices.” American Economic Review 109 (1): 171–202. https://doi.org/10.1257/aer.20170627.

I’m really curious to see how this plays out, not least because it’ll impact all the researchers I work with who use census data in so very many of their research projects. But I’m also curious to see how this and similar efforts bleed out into other Big Data conversations in my world, like Learning Analytics for example.

It’s all coming up data

Image by Gerd Altmann from Pixabay

My roots are about as humanistic/artsy as they come. I majored in English and minored in Art. Then I got a masters in literary studies. Then I got my masters in library and information science. That last degree was first time the word “science” was part of my life in any way more substantial than checking the box next to required credits for graduation. (It took me a long time, actually, to figure out why “library science” was a pair of words that go together, but be that as it may, my official degree is Master of Library and Information Science, which sounds very scientific to me.) From there I became the Librarian for Languages and Literature here at Carleton. All my previous passions neatly packaged in a single job.

This week I’ve been thinking a lot about an almost-decade-old paper by dana boyd and Kate Crawford. “Six Povocations for Big Data” made a big impression on me back in 2011 purely because epistemology and ways of knowing are my stock in trade, and this paper felt somehow Very True to me at the time. (Plus I’d just heard dana boyd talk at a library conference and was pretty much determined to listen to anything she ever said from then on.) This week I pulled it out of my Zotero library for a refresher, and it felt even more Very True to me now.

Part of my current re-fascination with this piece is that my liaison departments have taken a decided turn toward data. It has become increasingly clear to me that a major role I can play for my new liaison department (Computer Science — hey look! a second science in my life!) is to become a Data Librarian Lite(TM). Not a huge surprise there, and something I’m greatly enjoying learning. But they’re definitely not the only department I serve that’s turning to data. I’ve gotten more and more requests for linguistics corpora (spawning a new page on my Linguistics Research Guide just this weekend). And multiple faculty in the English department are working with digital textual analysis and literature corpora.

So yes, the phrases that stood out to me 8 years ago from boyd and Crawford’s piece stood out again: “Big Data is no longer just the domain of actuaries and scientists. … Big Data creates a radical shift in how we think about research … about the constitution of knowledge, the process of research, how we should engage with information, and the nature and categorization of reality” (pages 2-3).

But then there was this gem: “Claims to objectivity and accuracy are misleading” (boyd and Crawford, page 4). That meant one thing to me in 2011, but a lot has happened since then that has made me understand this statement in new ways. First there’s all the research into bias in algorithms (e.g. Safiya Noble). Then last week there was a talk presented in the Computer Science department here about “differential privacy” and (tangentially in the talk, but centrally to my world) the 2020 census’ plan to add deliberate small inaccuracies into reported results in order to protect respondents’ privacy. So not only are claims to objectivity and accuracy misleading, but too much accuracy has become harmful enough that we’re backing away from it in key areas.

Meanwhile, as epistemologies shift and the world of research continues to remake itself, I’ll be over here learning to be a librarian who navigates the various worlds of data in addition to being a librarian who absolutely values close reading and minute observation. And I’m loving it.

Enabling, in a good way

One of the things I love about my job is that my overarching function is to make things possible. I love making things possible.

Sometimes this means pointing people toward a resource that fits their information need, but more often it means helping them think about what would make their work possible. Helping them translate their questions into the language and mechanisms of search systems and information pathways, helping them think about what part of their overwhelmingly large research question might make for a manageable project while still feeling meaningful, helping them think about what broader concepts might give context to a frustratingly specific question, validating their curiosities, validating their sense that the process isn’t necessarily easy or straightforward, and on and on. At the very least, we’re always looking for concrete next steps while keeping our eyes on some (hopefully) meaningful and interesting goal. Honestly, a lot of the work is making things that feel scary and uncertain and anxiety-provoking feel manageable and actionable.

I’m really lucky that this kind of enabling can be my role in life.