After yesterday’s post I had a fascinating discussion with someone who codes for a living about whether patents were a viable research resource in CS. First off, they’re extremely hard to understand. And yes, I definitely agree, and it’s a good reminder that when I talk about this with students I also talk explicitly about what I expect they’ll be able to learn from the exercise.
If you find a patent that you think is related to your topic, look at other similarly classified patents to see what problems people are tackling in the field and who is tackling them.
As you look through similarly classified patents, collect vocabulary that you can use in future searches. After all, most search systems simply match letters in a row rather than semantics, so if people are talking about the same thing but using different words to do so, you won’t find that whole side of the conversation.
While reading in order to understand the patented process is probably not feasible for most people, reading instrumentally has been super useful for me when exploring CS topics.
So far so good, but what really set me thinking was this industry coder’s take on the disadvantages of reading patents. Apparently he’s told not to read patents because knowingly infringing on someone else’s IP brings worse penalties than unknowingly infringing. In order to mitigate penalties, they don’t look at patents. So now I’m wondering how to guide students as they prepare for a world in which, at least some of the time, lack of information has value. And how do I square that with the idea of the very real costs involved in having a bunch of people reinventing wheels and falling into the same pitfalls, all so that if they get sued it won’t be quite so bad? And how do I square that with how this upends the progress narrative of the sciences in general, a set of disciplines which so carefully finds gaps in knowledge and then fills them, or finds the limits of current knowledge and then pushes those limits back bit by bit?
I wonder if it matters what sector you’re in, or even what specific companies you’re working for. And I wonder how liberal arts students might engage with this conundrum in a way that prepares them for life after graduation, whether that life involves CS careers or not.
For 14 years, I’ve been a librarian for a pretty cohesive set of language and literature departments. My BA and MA are both in literary criticism, and I studied a few languages (not fluent in any of them any more, sadly), so my core departments have felt very much like home to me.
As you probably know, I also love computer stuff. I’ve never been formally trained in any of it, but I’m a huge fan and an intrepid experimenter. Plus the CS faculty here are awesome and many of them were friends of mine already, so when the chance came for me to be their liaison I said YES. Besides, I could draw parallels from some of the strategies of language research to the strategies of CS research.
But there’s also a lot that’s very very new to me, starting with exactly how information literacy works in CS… You know, just a small thing. Where does information literacy fit into a curriculum that’s full of coding and not a whole lot of traditional literature searching?
Thankfully the faculty here and the absolutely outstanding CS and STEM librarians at the Library Society of the World have been great partners and resources for me in my first year of being the CS librarian. I’ve also made a point of attending as many presentations and functions in that department as I can, listening for how information literacy works in CS. Here’s what I’ve found so far.
Information literacy in CS – Early observations
You’re going to need a good, well-evaluated corpus to train your AI. You kind of have to know what gets included in a corpus, and how, and where that stuff originated from in order to understand what your AI can or should do with the stuff, or to interpret what it spits out. Misunderstanding your corpus can result in wonky AI results. Luckily, librarians happen to have a long history of working with the kinds of things that get included in large text or metadata corpus-type-thingies — finding, evaluating, and using them!
You’re going to need good data to develop your visualizations. I’m learning a lot from our data librarian here. The one thing I found most interesting this past year is that CS students here have high confidence that they can knit datasets together to get what they want, but they have low levels of experience in determining if the datasets in question are built on compatible methodologies and variables. Next year I’ll spend a lot more time emphasizing that I’m not cautioning against combining datasets because the combining is hard — I’m cautioning against it because the thing you create might be the worst kind of chimera.
You’re going to need to think about license agreements and copyright if you’re using stuff that other people built, including APIs. Luckily, librarians have a long history of working with intellectual property topics!
You’re probably going to need to find libraries (the code kind, not the institution kind) or algorithms or code bases to work with. I haven’t really dipped my toes into this water yet, but what I have noticed is that students talk about this process differently than faculty do. Students talk about “looking online” and evaluating for speed, memory needs, and functions. Faculty talk about finding something that will be stable over time, with good documentation and a track record. There are undertones of publisher/author credibility, reliability, and stability threaded throughout. Definitely something for me to think about.
If you want to build something new, you’ll have to know the state of the art, past and present. This is where I’m learning more… and it needs more than a sentence or two, so I’ll give it a couple whole sections.
Finding The Current State of the Art
How do you know that what you’re building is new? And how do you make sure you’re building constructively on what’s already known? Translated into library-speak: What’s the conversation on this topic, and how does this project move that conversation forward? The information need is familiar to me, but the places to find that information are … not. CS has traditional scholarly publication venues, sure, but unlike my other fields, CS draws heavily on conference papers, research and technical reports, and patents. Not only that, but a bunch of stuff is proprietary — decidedly not the case for the latest interpretations of Hamlet.
So I’ve been trying to build up my skills in the grey literature area. Current strategies include using more familiar library databases to find out the names of people, associations, or institutions that are active in an area, and taking that knowledge over to Google for some advanced googling. I’m curious to see if Inspec Analytics turns out to be helpful with this, too, to help me figure out which institutions are active in an area and might have repositories of research and technical reports.
Patents are playing a larger and larger role in my work because that’s one of the only ways I’ve found of peeking into the proprietary research. That’s where company secrets comes right up against the desire to protect IP for future profit. So I’ve been exploring ways of navigating patents and analyzing publication and citation patterns to help me figure out the past and present of a process or topic. Are there key people or companies at play in a particular area? Do those people or companies have other reports available to the public?
Delving into the past to improve the future
There was a fascinating talk here last spring by an engineer working on Non-Volatile Memory. One of her many useful insights during the talk was that back in the 1960s people were working on Mmap, and in the 1980s “Bubble Memory” was set to be the memory of the future. It didn’t become the memory of the future, so most people now don’t know the term or remember the concept, but there are a lot of things about Bubble Memory that are the same as NVM. There’s also a nearly 40-year conversation about developing persistent languages (apparently called “persistent foo,” which is awesome) vs persistent databases. One of the speaker’s points was that finding out these kinds of histories can save people from reinventing wheels, falling into the old pitfalls, and basically repeating history in the worst way.
Of course this set me to wondering how a librarian could coach students in a research strategy to find things that are the similar but not necessarily the same, and that don’t share a lot of keywords. And how would you map out and synthesize what you find in meaningful ways, but as efficiently as possible? So next I think I’ll explore the literature around persistent memory, starting with the specifics this speaker mentioned in her talk, and see which search tools give students a good way to discover this kind of overlap with historical avenues of research. Strategy suggestions welcome!
So much more to learn
Soon we’ll launch into my second school year as the CS liaison, and I have a long way to go before I’ll feel like I really know how information works in this field. What do YOU think I should know in order to be the best librarian I can be for this field?
Things have been pretty intense around here for mumble-mumble years, and for most of the last 3-4 years I’ve been doing more than just my own job while we cover for open positions and run searches and stuff. (There was a lovely 6 months in there while we were fully staffed, but that didn’t last.) But now that my department is fully staffed again I find that my brain isn’t braining very well. Even after a lovely long visit to my family, apparently my brain has more resting to do before it’ll kick back into gear after such a marathon.
While I can’t seem to grapple with programs and ideas and other “big” things, I’m irresistibly drawn to the mundane, the minutia, the system code, and yes, the spreadsheet. Goodbye forest; hello trees.
I find myself looking for things to do that involve spreadsheets or CSS — big things, tiny things, it doesn’t matter. That’s all I want to do these days. Can I make you a pretty button to go around your website link? Please? (Ironically, my own blog is still looking pretty 2012… maybe sometime I should turn my attention to modernizing things around here… and get an SSL certificate.)
Maybe I’m so drawn to these projects because unlike my normal work you can always see change and/or progress while working on CSS or a spreadsheet. Or maybe it’s just that I don’t do a ton with these kinds of things during the school year, so it’s just a chance to change up what I’m thinking about. Whatever it is, I’m basically useless for normal stuff right now, but I have a whole lot of patience and energy for spreadsheets.
If you like to be able to re-use assets easily, you have to be pretty careful not to develop dozens and dozens of nearly identical assets in the Libguides Asset database. Otherwise it’s basically impossible to know which one is THE one you want to reuse. Plus, if a website changes, and therefore the instructions you write into a description here or there changes, it pays to be able to update that kind of thing all in once in one place rather than go through every version of the asset and see if it needs to be updated individually.
Unfortunately, the Libguides system makes it very very easy for duplicate assets to multiply like rabbits. If you copy a box or start from another guide as a template, any asset in that box or guide that you don’t own will automatically duplicate itself. And sure, there are some good reasons to have that as an option, but you don’t have an option in this case — it just happens. Plus as people make guides for last-minute classes or in the middle of working on 16 other things, mistakes happen and people make new assets when they could probably have re-used one. Life happens.
Anyway, all of this means that every summer for the last several years I’ve gone through and done a database clean-up project. I figure out which assets are possibly duplicates of each other, and then I knit the actual duplicates back together into a single “parent” asset. And every summer this means that we go from about 7-8000 assets in our system down to about 6000 assets. And every Fall term we start out with a nice clean database, and sharing is super easy, and it’s a veritable asset utopia… for about 30 seconds. But imagine what it would be like without that reset? The messier the assets get, the harder it is to reuse, and so the messier the assets get.
A few people have expressed interest in replicating or building on what I do, so here are some documents to look over if you’re interested: Our Asset clean-up process, an example of this year’s working spreadsheet, and the rules we’ve made for ourselves in a large part to keep the assets as clean as possible throughout the year. (The Local Practices rules are linked from the main editor interface in Libguides so that they’re handy whenever people are editing.)