Skip to content

Category: Random Thoughts

2020 Census and the “Privacy Budget” (aka Differential Privacy)

Maybe I’m the last one to know about this, but just in case… Did you know that the Census Bureau is changing what and how it’ll release data the 2020 Census? I mentioned this in passing in my last post, but here’s a little more, most of which you can find by reading the posts and presentations on the Census’ Disclosure Avoidance and the 2020 Census page.

CC-BY-SA image by Salix alba posted at Wikimedia Commons

The data scientists at the Census Bureau have been experimenting for the last few years, and it turns out that they were able to reconstruct private information about individual citizens using a process not unlike solving a Sudoku puzzle or one of those grid logic puzzles. They only release de-identified data, but if you know what you’re doing you can figure out a goodly percentage of the identifications. It’s called Re-Identification or Database Reconstruction.

One way to solve this problem is to never provide 100% accurate results to queries on the database. Without 100% accurate query results, no clever data scientists can take the answers from query 1 and fit it next to the answers from queries 2 through x, or put it next to their dataset of bank loan applications or whatever, and know for sure that they have accurate answers to those queries and therefore reconstruct the database. Adjusting results by a privacy parameter, which will probably be a very very tiny parameter, is at the heart of what’s called Differential Privacy (and here a big shout out to Dr Steven Wu from the University of Minnesota, whose recent talk at Carleton helped me understand the math behind this process). The Census is presenting this parameter or set of parameters as their “Privacy Budget.” Once they’ve disclosed everything in the budget, they can’t disclose more without doing harm.

As of the presentation from October 1st, 2019 that’s posted to the Disclosure Avoidance and the 2020 Census page, the Census hasn’t yet quite decided what this privacy parameter will be. It sounds like they’re thinking it’ll vary depending on exactly how sensitive the information is or the amount of other information that’s been requested. But of course the main question is exactly how accurate the disclosed results have to be in order to allow people to do the work they need to do to run the country and distribute aid and know about our population. The more accurate the disclosures, the more risk of privacy problems. The more obscured the results, the more risk that we’ll struggle to know what we need to know in order to do what we need to do.

If this is a topic you’re interested in or care about, there are a bunch of things published by the Census statistician John Abowd that I found while exploring this topic with data-interested faculty here at Carleton.

  • Abowd, John. 2016. “Why Statistical Agencies Need to Take Privacy-Loss Budgets Seriously, and What It Means When They Do.” Labor Dynamics Institute, December. https://digitalcommons.ilr.cornell.edu/ldi/32.
  • Abowd, John M. 2016. “How Will Statistical Agencies Operate When All Data Are Private?” Journal of Privacy and Confidentiality 7 (3). https://doi.org/10.29012/jpc.v7i3.404.
  • Abowd, John M., and Ian Schmutte. 2017. “Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods.” Labor Dynamics Institute, April. https://digitalcommons.ilr.cornell.edu/ldi/37.
  • Abowd, John M., and Ian M. Schmutte. 2019. “An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices.” American Economic Review 109 (1): 171–202. https://doi.org/10.1257/aer.20170627.

I’m really curious to see how this plays out, not least because it’ll impact all the researchers I work with who use census data in so very many of their research projects. But I’m also curious to see how this and similar efforts bleed out into other Big Data conversations in my world, like Learning Analytics for example.

1 Comment

Citation Style Alignment

Good friend and Professor Extraordinaire Adriana Estill shared this alignment chart with me this morning. Happy Friday!

(Not sure what this is? Here’s more about alignment charts. I think this must be the original, by Jonathon Owen.)


Edited to Add: If you’ve ever supported legal citation, you’ll know why Friend of the Blog, Pete Smith of Sheffield Hallan University, suggested this addition, which I have added to the alignment chart:

2 Comments

Information Has Value – Computer Science edition

After yesterday’s post I had a fascinating discussion with someone who codes for a living about whether patents were a viable research resource in CS. First off, they’re extremely hard to understand. And yes, I definitely agree, and it’s a good reminder that when I talk about this with students I also talk explicitly about what I expect they’ll be able to learn from the exercise.

Hensel, Otto A. 1900. Rocking or oscillating bath-tub. United States US643094A, filed January 6, 1899, and issued February 6, 1900.
  1. If you find a patent that you think is related to your topic, look at other similarly classified patents to see what problems people are tackling in the field and who is tackling them.
  2. As you look through similarly classified patents, collect vocabulary that you can use in future searches. After all, most search systems simply match letters in a row rather than semantics, so if people are talking about the same thing but using different words to do so, you won’t find that whole side of the conversation.

While reading in order to understand the patented process is probably not feasible for most people, reading instrumentally has been super useful for me when exploring CS topics.

So far so good, but what really set me thinking was this industry coder’s take on the disadvantages of reading patents. Apparently he’s told not to read patents because knowingly infringing on someone else’s IP brings worse penalties than unknowingly infringing. In order to mitigate penalties, they don’t look at patents. So now I’m wondering how to guide students as they prepare for a world in which, at least some of the time, lack of information has value. And how do I square that with the idea of the very real costs involved in having a bunch of people reinventing wheels and falling into the same pitfalls, all so that if they get sued it won’t be quite so bad? And how do I square that with how this upends the progress narrative of the sciences in general, a set of disciplines which so carefully finds gaps in knowledge and then fills them, or finds the limits of current knowledge and then pushes those limits back bit by bit?

I wonder if it matters what sector you’re in, or even what specific companies you’re working for. And I wonder how liberal arts students might engage with this conundrum in a way that prepares them for life after graduation, whether that life involves CS careers or not.

Leave a Comment

A change is as good as a rest?

Things have been pretty intense around here for mumble-mumble years, and for most of the last 3-4 years I’ve been doing more than just my own job while we cover for open positions and run searches and stuff. (There was a lovely 6 months in there while we were fully staffed, but that didn’t last.) But now that my department is fully staffed again I find that my brain isn’t braining very well. Even after a lovely long visit to my family, apparently my brain has more resting to do before it’ll kick back into gear after such a marathon.

While I can’t seem to grapple with programs and ideas and other “big” things, I’m irresistibly drawn to the mundane, the minutia, the system code, and yes, the spreadsheet. Goodbye forest; hello trees.

I find myself looking for things to do that involve spreadsheets or CSS — big things, tiny things, it doesn’t matter. That’s all I want to do these days. Can I make you a pretty button to go around your website link? Please? (Ironically, my own blog is still looking pretty 2012… maybe sometime I should turn my attention to modernizing things around here… and get an SSL certificate.)

Maybe I’m so drawn to these projects because unlike my normal work you can always see change and/or progress while working on CSS or a spreadsheet. Or maybe it’s just that I don’t do a ton with these kinds of things during the school year, so it’s just a chance to change up what I’m thinking about. Whatever it is, I’m basically useless for normal stuff right now, but I have a whole lot of patience and energy for spreadsheets.

1 Comment