Image

Publications and Presentations

Libraries and Librarianship

Teaching and Learning

Preparing for your database lists to malfunction on occasion

If you’re like us, you have a LibGuides-based list of databases for your library, and it went down for a bit this morning. If you’re like me, this pretty much cripples you until the list is back up and running. This kind of downtime could happen no matter your platform, so if you’re like me you might also want to have an option available to get around your database list during downtime. Here’s what I came up with about a year ago and was happy to have in place this morning.

Each month I download a CSV file of our databases – LibGuides has a nifty “export all” option and I click that. Then I paste that data dump information into a hidden sheet on a Google Sheet that I’ve made available to anyone with a link. From that messy data dump, I built formulae onto a visible sheet in that Sheet (really, Google has GOT to get better about its names for apps so that I don’t have refer to a sheet on a Sheet). These formulae help me display a full alphabetical list of all of our databases, their descriptions, their base URLs (a note on that in a second), vendor, and whether or not they need proxy access from off campus. Basically all the formulae say are “if the corresponding cell on that base sheet is blank, don’t put anything here, but if there’s information there, then display it here.” It looks like this:

=iF(isblank(Raw!B2),"",Raw!B2)

Then I use a slightly fancier formula to build a proxified version of that base URL into column dedicated to off-campus access. It goes like this:

=if(isblank(F2),"",if(F2="yes",CONCATENATE("http://ezproxy.carleton.edu/login?url=",C2),C2))

Translated, that formula means “If the cell in Column F that says whether this database needs a proxy string is blank, leave this cell blank (this just makes for a cleaner spreadsheet without a lot of error cells where the formula is there even if there’s no database listed). If that proxy cell is set to YES, then put together our proxy string and base URL, and put the resulting URL here. If the proxy cell is set NO, just put the base URL here.”

Then I hid the Proxy check column (column F) because nobody really needs to see that if they’re using the spreadsheet. I just needed it for calculation purposes. (Sure I could have referred to that proxy check cell on the base sheet, rather than bring it to the visible sheet and then hide it, but sometimes I just feel like doing things easiest way that occurs to me in the moment. Don’t judge!)

Finally, I gave this back-up spreadsheet a nicer URL: http://bit.ly/libedatabases And I posted this URL in places where librarians can find it when needed (such as our documentation for QuestionPoint cooperative librarians, our intranet, etc).

So now if our proxy server goes down (rendering our database list mostly useless), we can use the base URLs from this spreadsheet, at least from on campus or in combination with a VPN. It’s better than nothing. And if the whole database list goes down, we have access to all of our databases and their URLs from this list.


Edited to add: I forgot to mention that I’ve also built a script into Google Docs to unmerge cells. For whatever reason, exports from web-based products like Springshare tend to have random merged cells, which I don’t want. The only way I know of to get rid of these (other than looking for them all and then unmerging them individually) is via script.

So! In Google Sheets, click on the “Tools” menu and then “Script Editor” and then paste in the following:

function myFunction() {
   var breakRange = SpreadsheetApp.getActive().getRange('A:T');
 for(;;) {
   try {
     breakRange.breakApart();
     break;
   } catch(e) {
     breakRange = mySheet.getRange(
       breakRange.getRowIndex(),
       breakRange.getColumnIndex(),
       Math.min(
         breakRange.getHeight()+5,
         Sheet1.getMaxRows()-breakRange.getRowIndex()+1
       ),
       Math.min(
         breakRange.getWidth()+5,
         Sheet1.getMaxColumns()-breakRange.getColumnIndex()+1
       )
     );
   }
 }
 }

This will look at columns A through T (you can edit that in that second line if you need to) and unmerge any merged cells.

When you paste in a new export, just click “Tools” and “Script editor” and then the little “play” arrow to run the script, and it’ll unmerge all those merged cells for you.

2020 Census: the demo software, data, and more on the debate about differential privacy

Since last I wrote, I’ve been able to learn more about a question that had plagued me: what of IPUMS, that unparalleled resource for census micro-data? For one thing, I was sure they must be thinking about privacy already – micro-data must be handled with care. For another, the 2020 Census “Privacy Budget” work is likely to make IPUMS’s work pretty complicated or even impossible.

Turns out, they are of course way ahead of me. Here’s their page on the topic, which also links to the Census’ released software and documentation, as well as to data available from IPUMS that researchers can use to test that software.

I also happen to know someone who works at IPUMS, and he says they’ve already been doing record swapping and injecting “noise” into their data to protect unique cases.

I certainly hope that everyone involved can figure out how to protect people’s privacy while still allowing vital research. This seems like an impossibly tangled knot to me.

2020 Census and the “Privacy Budget” (aka Differential Privacy)

Maybe I’m the last one to know about this, but just in case… Did you know that the Census Bureau is changing what and how it’ll release data the 2020 Census? I mentioned this in passing in my last post, but here’s a little more, most of which you can find by reading the posts and presentations on the Census’ Disclosure Avoidance and the 2020 Census page.

CC-BY-SA image by Salix alba posted at Wikimedia Commons

The data scientists at the Census Bureau have been experimenting for the last few years, and it turns out that they were able to reconstruct private information about individual citizens using a process not unlike solving a Sudoku puzzle or one of those grid logic puzzles. They only release de-identified data, but if you know what you’re doing you can figure out a goodly percentage of the identifications. It’s called Re-Identification or Database Reconstruction.

One way to solve this problem is to never provide 100% accurate results to queries on the database. Without 100% accurate query results, no clever data scientists can take the answers from query 1 and fit it next to the answers from queries 2 through x, or put it next to their dataset of bank loan applications or whatever, and know for sure that they have accurate answers to those queries and therefore reconstruct the database. Adjusting results by a privacy parameter, which will probably be a very very tiny parameter, is at the heart of what’s called Differential Privacy (and here a big shout out to Dr Steven Wu from the University of Minnesota, whose recent talk at Carleton helped me understand the math behind this process). The Census is presenting this parameter or set of parameters as their “Privacy Budget.” Once they’ve disclosed everything in the budget, they can’t disclose more without doing harm.

As of the presentation from October 1st, 2019 that’s posted to the Disclosure Avoidance and the 2020 Census page, the Census hasn’t yet quite decided what this privacy parameter will be. It sounds like they’re thinking it’ll vary depending on exactly how sensitive the information is or the amount of other information that’s been requested. But of course the main question is exactly how accurate the disclosed results have to be in order to allow people to do the work they need to do to run the country and distribute aid and know about our population. The more accurate the disclosures, the more risk of privacy problems. The more obscured the results, the more risk that we’ll struggle to know what we need to know in order to do what we need to do.

If this is a topic you’re interested in or care about, there are a bunch of things published by the Census statistician John Abowd that I found while exploring this topic with data-interested faculty here at Carleton.

  • Abowd, John. 2016. “Why Statistical Agencies Need to Take Privacy-Loss Budgets Seriously, and What It Means When They Do.” Labor Dynamics Institute, December. https://digitalcommons.ilr.cornell.edu/ldi/32.
  • Abowd, John M. 2016. “How Will Statistical Agencies Operate When All Data Are Private?” Journal of Privacy and Confidentiality 7 (3). https://doi.org/10.29012/jpc.v7i3.404.
  • Abowd, John M., and Ian Schmutte. 2017. “Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods.” Labor Dynamics Institute, April. https://digitalcommons.ilr.cornell.edu/ldi/37.
  • Abowd, John M., and Ian M. Schmutte. 2019. “An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices.” American Economic Review 109 (1): 171–202. https://doi.org/10.1257/aer.20170627.

I’m really curious to see how this plays out, not least because it’ll impact all the researchers I work with who use census data in so very many of their research projects. But I’m also curious to see how this and similar efforts bleed out into other Big Data conversations in my world, like Learning Analytics for example.

It’s all coming up data

Image by Gerd Altmann from Pixabay

My roots are about as humanistic/artsy as they come. I majored in English and minored in Art. Then I got a masters in literary studies. Then I got my masters in library and information science. That last degree was first time the word “science” was part of my life in any way more substantial than checking the box next to required credits for graduation. (It took me a long time, actually, to figure out why “library science” was a pair of words that go together, but be that as it may, my official degree is Master of Library and Information Science, which sounds very scientific to me.) From there I became the Librarian for Languages and Literature here at Carleton. All my previous passions neatly packaged in a single job.

This week I’ve been thinking a lot about an almost-decade-old paper by dana boyd and Kate Crawford. “Six Povocations for Big Data” made a big impression on me back in 2011 purely because epistemology and ways of knowing are my stock in trade, and this paper felt somehow Very True to me at the time. (Plus I’d just heard dana boyd talk at a library conference and was pretty much determined to listen to anything she ever said from then on.) This week I pulled it out of my Zotero library for a refresher, and it felt even more Very True to me now.

Part of my current re-fascination with this piece is that my liaison departments have taken a decided turn toward data. It has become increasingly clear to me that a major role I can play for my new liaison department (Computer Science — hey look! a second science in my life!) is to become a Data Librarian Lite(TM). Not a huge surprise there, and something I’m greatly enjoying learning. But they’re definitely not the only department I serve that’s turning to data. I’ve gotten more and more requests for linguistics corpora (spawning a new page on my Linguistics Research Guide just this weekend). And multiple faculty in the English department are working with digital textual analysis and literature corpora.

So yes, the phrases that stood out to me 8 years ago from boyd and Crawford’s piece stood out again: “Big Data is no longer just the domain of actuaries and scientists. … Big Data creates a radical shift in how we think about research … about the constitution of knowledge, the process of research, how we should engage with information, and the nature and categorization of reality” (pages 2-3).

But then there was this gem: “Claims to objectivity and accuracy are misleading” (boyd and Crawford, page 4). That meant one thing to me in 2011, but a lot has happened since then that has made me understand this statement in new ways. First there’s all the research into bias in algorithms (e.g. Safiya Noble). Then last week there was a talk presented in the Computer Science department here about “differential privacy” and (tangentially in the talk, but centrally to my world) the 2020 census’ plan to add deliberate small inaccuracies into reported results in order to protect respondents’ privacy. So not only are claims to objectivity and accuracy misleading, but too much accuracy has become harmful enough that we’re backing away from it in key areas.

Meanwhile, as epistemologies shift and the world of research continues to remake itself, I’ll be over here learning to be a librarian who navigates the various worlds of data in addition to being a librarian who absolutely values close reading and minute observation. And I’m loving it.