Q&A: Chris Musialek on data.gov's metadata and semantic projects
In a July blog post, Chief Software Architect for data.gov Chris Musialek announced the platform's support for draft metadata standards proposed by schema.org. Should the standards gain the blessing of schema.org, data.gov would apply the vocabularies to all federal datasets.
FierceGovernmentIT spoke with Musialek to learn how standardized vocabularies would benefit the data.gov community and what other semantic projects data.gov has underway.
FierceGovernmentIT: What do you mean when you say data.gov would 'add support' for the proposed schemas?
Chris Musialek: We're going to add a couple of machine-readable tags to the metadata that is described on each of our deep dataset details pages, so when you go on to data.gov and you search for something it provides a listing, a results page. When you click on any of those results, you get to a details page. And in that details page it describes--this is meta-data, so it's data about data, something like title or description, date something has been added--this is essentially the stuff that we collect that points users to the actual data itself on a website.
So, the schema.org support is really adding some of these sort of standardized tags into these pages that can be then readable by and understood by the major search engines so that the search engines can do a better job of understanding what our pages contain and hopefully they can do more interesting things to try to rank results better.
FGIT: So, schema.org acts sort of like a governance body, that sets these overarching standards of tags that you would apply to the meta-data?
CM: Yes. There is a group that helps with the process. It's a sort of more informal way of right now working out different sets of support for each of these different vocabularies.
Essentially the top three search engines came together and they said, 'Hey, we really need to better understand all the pages on the Internet and how can we do this?" So, they kind of banded together saying they would support working on a collection of vocabularies or schemas together.
Schema.org is again, this collection of vocabularies, ways of describing data, and it continues to expand. Different folks can actually propose a different schema for different things. They kind of go through a draft process and eventually if these things, if enough people agree on, let's say a vocabulary, then schema.org, the project itself, will sort of bless it. That's sort of where we're at with this one particular schema that we're interested in, it's called the data set schema
FGIT: So, what exactly is the data set proposal that you're endorsing, should it be agreed upon by Schema.org.
CM: So part of the process of schema.org adopting an official vocabulary is that there also must be folks who are using it or are willing to use it on the Internet. So, there needs to be customers who want to use it. We thought it was important for us to say 'Yes, we're very much interested in this and we will use this' and schema.org considers us a big enough customer that we're one of the first folks to come out publicly and say 'We're happy with where things stand right now as far as the draft is concerned.'
FGIT: How will data.gov benefit from using the schema?
CM: We benefit from the fact that these search engines can help us to kind of bring people to our site and rank some of the data sets that we have on data.gov higher on their search engines when someone's search is for something because the whole schema effort is about trying to understand what's on the different web pages, they can then use that information to rank their results higher. So we just think that for us it will help to connect more consumers with the government data sets that exist in the federal government.
Most of the searches start really at Google (NASDAQ: GOOG). When people are searching for information they don't think 'Well, maybe I should check out data.gov and search on data.gov and see if it has some information about the employment rate.' So, that's the real benefit. We're hoping that it will help connect people to the government data that's already out there on data.gov.
FGIT: I guess now you're sort of in a holding pattern until schema.org finalizes it and then you can apply that vocabulary. Are there other things that you're doing at data.gov around semantics?
CM: Exactly. Of course we can implement it today as a draft, but our worry is just that--and I think we have the approval to do that--but we just worry that it won't ever get supported and it may not gain traction until there's some sort of blessing by the body itself. That's a huge step. We're also just worried about changes too.
So there are a couple things we're working on. We set up a site called vocab.data.gov and you can check it out. Right now it's sort of a prototype but the idea with vocab.data.gov is that government also has a whole bunch of existing schemas of vocabulary that we use to describe currently our data sets.
We want other agencies to look at and realize that there are existing vocabularies that other agencies use that they can potentially re-use rather than creating their own.
Many different government agencies describe similar types of data and different agencies. The Census Bureau is a statistical agency, the Department of Labor is another one, and of course, they talk about the same types of data, similar types of data, statistical data, and so they may find it useful to use some description or some vocabulary that some place has already used rather than recreate their own.
The other thing that we're working on with respect to semantics is sort of a machine-readable list of agencies and we're really excited about this. We had an intern work on it from RPI [Rensselaer Polytechnic Institute] who has done an excellent job and we're close to kind of publishing a prototype, but there really exists no current machine-readable list of agencies.
When someone comes to our site, they have to tell us whose agency it belongs to. Luckily with data.gov we kind of have this canonical listing, it's a drop down box with a form that you fill out describing what agencies we have. But there are other data sets that come in that are not using this canonical listing.
In semantics, in a linked data way, one of the ways that you describe an instance of something is through a URI [uniform resource identifier].
FGIT: What's a URI?
CM: It's kind of like a number, but it's essentially a permanent representation of some real world thing but done in a way that leverages the world wide web and the way that the web has evolved which is this you can just link to something else. So, then it becomes very easy if we establish identifiers as just URLs. Again, URIs are very much like URLs, but it's a little bit distinct.
FGIT: So, when did the beta vocab.data.gov go live?
CM: We haven't had the time to talk to agencies about that and really think through what we want it to be and how that will work. In other words, how will we allow agencies to add their own vocabulary. We need to kind of think through how agencies will do that and sort of what does the interface look like. We set that up, I think, in April this year.