Q&A: Michael Daconta on NIEM and the semantic web

Email LinkedIn
Tools

Federal agencies must report no later than May 1 on whether the National Information Exchange Model would be suitable for information exchange with other agencies, according to recently-released guidance (.pdf) from Federal Chief Architect Kshemendra Paul.

That deadline, plus other recent examples on NIEM in the news--an HHS official, for example, denying that NIEM is a CIA plot--led FierceGovernmentIT to interview Michael Daconta, data guru and NIEM co-founder.

As the Homeland Security Department's Metadata Program manager in 2004 and 2005, Daconta also spearheaded a pan-government information sharing initiative called the Data Reference Model. In private sector work and the government, he's been a proponent of the semantic web, a metadata-fueled vision of data context understandable by computers.

These days, Daconta is chief technology officer and a co-founder of Accelerated Information Management of Vienna, Va. Check out some screen captures of his newest semantic web project or read his new company's latest white paper (.pdf) on its core methodology. You can also read a paper on NIEM he co-authored with IBM by clicking here (.pdf).

FierceGovernmentIT vigorously suggests reading the entire Q&A, and only clicking to this article if you want a synopsis of the bits about the Data Reference Model and NIEM. Among the insights Daconta offers is how big data will eventually run into a wall, why "metacrap" is crap and why arguments over the level of semantic specification miss the big point.

FGIT: Can you briefly describe NIEM and what it does?

DACONTA: The short definition would be that NIEM is an information exchange model. Its focus is on creating XML-based messages to send between computer systems to conduct some business transaction.

It started in the law enforcement community, and it performs transactions like "register a sex offender," or "send a rap sheet on a criminal," things like that. It is very much focused on messages - easily and rapidly creating messages across different IT systems.

As I said, it started in the law enforcement domain, under a different name. It was called the Global Justice XML Data Model. When I was in government, I was looking for a way to assist in integrating the different organizations within DHS, as well as improve our information sharing, both inside of DHS and outside of DHS. A big part of that was 'How do I include in the information sharing puzzle better access to state and locals?'

I evaluated several different existing data models. One was UBL. The other was GJXDM - those were the two major ones. GJXDM was technically superior and had much more traction amongst the state and locals. We expanded GJDXM both architecturally and content-wise, to include multiple domains which are part of homeland security. In doing that, I felt it was very important to make it clear that we couldn't have "justice" in the model because justice was one of many domains. So I felt it important that we had to change the name. It was led extremely well after myself and Jim Feagans [at the time, the Justice Department NIEM lead] initiated it.

Kshemendra Paul, who's now chief architect at OMB took over, and after Kshemendra, Donna Roy of DHS. Actually, Donna Roy and the new CIO over at DHS [Richard Spires] really got behind NIEM in the manner that originally we had hoped.

Now with continued support from Kshemendra, NIEM was used for recovery.gov for reporting for the states. It is now being considered and piloted by HHS to include the health care domain.

So, in that purpose of being an information exchange model, you implement that by being able to rapidly create a message - and a message represents some kind of transaction. There's two parts of all transactional type messages., You have data about the transaction itself, and then you have content, or entities, about which the transaction operates on. To support that piece, NIEM has a registry and a data model, which is a set of data entities about the things you want to include in your data model piece.

It does include a data model. It's data modeling for a purpose, and the purpose is to send a message between two or more IT systems.

FGIT: Manually?

DACONTA: There is a manual component. There are tools available, and Donna Roy has increased the awareness among the tool vendors and has gotten several tool vendors to step up and incorporate the NIEM model into their tools, so I would say at this point the tool support is really getting there.

FGIT: So why would anybody say NIEM is not a data model?

DACONTA: That's just not true. Any complex thing is going to have its people who like it and its detractors. It is a data model. Now, you could argue - part of this problem is that the data modeling space in general is not as well organized as other IT domains, like, say software programming.

If you look at the software programming field, even in terms of software modeling, software modeling was just as disorganized 10 years ago. But it had a meeting of the minds with the three major proponents of the three major modeling languages and they agreed to create the universal modeling language, UML, which has since been adopted by the Object Management Group.

The data modeling community does not have its act together nearly as well as the software development community. So, when you talk about data modeling, you can't say - let's take a simple example - which data modeling notation is the standard. Or is the number one standard. There is none. It all depends which data modeling tool is number one tool. Probably ERwin, which focuses on a specific type of database data modeling. Now, of course, semantic web is coming in with other standards like OWL. So, what I would say is that [a dispute about NIEM being a data model is] probably some sort of technical disagreement over what you mean by the term 'data model' more so than whether NIEM has a data model.

You can create a UML diagram of the data model, and it'll display entities, attributes, and relationships, which would be the three core things that anybody creating a data model would say, ‘yeah, it's a data model, it has all the basics.'

The only other aspect to discuss in this issue is that you could argue that it is an information exchange model, or information exchange data model, which means that you're scoping its purpose to say that this is a data model for creating information exchanges, not necessarily a data model for creating relational databases. A lot of people in the data modeling space, since database construction and database development is so prevalent and such an important part of IT systems that there is this subculture that when people say data modeling, they just think about database modeling. In that case, while you could take the NIEM data model and create databases from it, but not in a direct fashion, not in a push button fashion. If that's your criteria, which I disagree with - that's database modeling - whereas data modeling is a broader term.

FGIT: Isn't the whole point of NIEM to be above the database level?

DACONTA: It is a step above, but it absolutely does have data and metadata. But it is a step above in the sense that it's above your IT systems, as an information exchange model that is agreed upon by multiple organizations. By having an information exchange model, you are by definition saying that ‘I will map and populate a message, and it's okay that I map to it,' since when you're doing information exchange, you don't assume that my database is law. No, you know that just by definition by the act of communication, that I have to come to agreement with other parties. NIEM is an abstraction layer about the database, above those IT systems, and it's supposed to be.

FGIT: So, what is a semantic specification?

DACONTA: Semantics and the semantic web have been evolving for a number of years. Most rapidly, and most prominently, via the public support of Tim Berners-Lee relating to the semantic web, which has always been part of his vision. Most of the time now when you talk about semantics, you're starting to say ‘How well does your data model express additional semantic characteristics of the objects in the real world that you're trying to model?'

Let's take an example. So, I'm trying to model people. People are very complex. If I have a model that just says, the only thing you can do about people is characteristics - height, weight, etc. - and you can't do relationships, saying that person A is married to person B, then you can't express relationships between things.

Let's say you can express relationships. Can you express some detailed semantic properties of those relationships, like is it a symmetric relationship, or is it an inverse relationship - inverse to some other relationships. Let's take the inverse property. If I can say ‘I have a relationship Child Of,' which is the inverse of ‘Parent To,' therefore you can specify that. Or the most popular one - and powerful one - is whether you can specify that something is a transitive relationship. That's one of the simplest ways to get inference. If I can specify something like ‘Relative Of,' if I say ‘Jack is the Relative Of Bob,' and ‘Bob is a Relative Of Suzy,' I can successfully infer that Jack is a relative of Suzy.

The one thing about this, though - and this is the, in my opinion, key Achilles Heel of semantic fidelity - is that while there are languages and tools like OWL that can specify those things, we are still very week on the side of the business community to understand and determine how would I exploit that. We don't have very good outcomes that absolutely require that type of semantic fidelity. Given that, you could say, ‘Why would I go through the trouble of trying to figure out the semantic characteristics of all my relations if I have no idea how I would use that?'

So, NIEM has the basic level of semantics - it does have relationships between things. It does not have some of the more advanced semantic capabilities of a language like OWL. OWL has more semantic fidelity around types of relationships, and has the ability to define your entity types, like ‘person,' etc. using set theory. OWL does allow you to do that. But let me pick on OWL for a minute. There are other languages, like KIF, which allows even more semantic expressivity than OWL. So the semantic expressivity argument is far from over and there have been invented languages of varying degrees of expressivity. Again, from an IT community perspective, we could invent a million levels of semantic expressivity. That's not the important part to me. The important part to me is educating the business community and even proving how people can gain benefits and outcomes by having more semantically expressive models. I'll talk about that later in terms of the things I'm trying to do in trying to bring the semantic web to the mainstream.

So, NIEM does have a semantic specification, It is not as semantically expressive as something like OWL, but the bottom line is that the NIEM user community has not expressed a need, yet, for more semantic expressivity. That could change, in terms of health care, because under the interim final rule that HHS released on health IT standards, some of their electronic records stuff does require decision-based rule systems. The medical domain has used expert systems for years in helping with the diagnosing, the distinction between symptoms and disease. Symptoms to disease is a great semantic matching problem, because there's so many of both symptoms and so many diseases, it's difficult for a doctor to keep all that in their head, so there have been expert system efforts in the domain for years and years. They are farther along in the semantic space. As NIEM gets into domains like healthcare, it will need to increase its semantic specificity.

FGIT: When you were in government, you were also very active in putting together the data reference model [.pdf]. What ultimately happened to it?

DACONTA: I wouldn't have left until I finished it, because otherwise I wouldn't have considered my time as complete. It has been used and referenced in terms of some current administration proposals like data.gov - in fact, the current data.gov CONOPS [.doc] does talk about implementing things like query points, which is a concept right from the data reference model.

I have discussed this with Keshmanda Paul, and he is very much on board and believes that the concepts in the DRM coming to fruition is key to successfully implementing data.gov over the long haul.

That's one part of it. The other part is that the DRM received some pushback, specifically by the intelligence community, over the initial specification. I wanted it to be very detailed, very specific to the point where we had begun developing a specific DRM XML data model. I look at things very directly, and my charter was how do I improve information sharing. In order to guarantee improvement of information sharing, I believed you had to get to that level of specificity. Having said that, because we received some pushback from an important member of the interagency working group, we had to come to a compromise, and the compromise was to make the DRM an abstract data model, of which multiple communities - similar to the way the DoD does it with communities of interest - could implement the DRM in their own ways.

I do believe, though, that has caused some difficulty in implementing the DRM. Some organizations, like DHS, [are farther along than others], probably because I laid a lot of framework and the people who followed me, like Donna Roy, understood where we were trying to go and kept that going. DHS has in their enterprise data management office very good, highly qualified technical people who understand how to implement these concepts in an organization.

You always have to be careful about a one size fits all approach, because you do want things to complete. However, I do not believe that standards, especially government standards, are an area where you look for too much competition. Too much competition kills the whole notion of standards. When you have ridiculous statements that is acceptable for there to be more than one office documents standard, it defeats the whole purpose and definition of the word ‘standard.'

FGIT: Does the recent requirement from OMB [.pdf] that agencies at least consider using NIEM make what you had hoped the DRM would achieve closer to reality?

DACONTA: Yes. Let me explain. The DRM has three parts. It has data description, data context and information exchange, with exchange packages. It used the word "exchange packages."

NIEM - certainly for DHS - was absolutely our implementation of the exchange part of the DRM. The short answer is that NIEM absolutely is a good implementation for the information exchange part of the DRM.

I think this is very promising. Getting a standard information exchange model across the government will absolutely improve information sharing. It's not a question of ‘maybe,' it will improve it.

FGIT: You mentioned some semantic web projects...

DACONTA: I do have a tool that I'm working on. It's a knowledge based editor. If you go to Accelerated Information Management's website, and you go to toolbox, you'll see I started posting some screen caps of the program, and I'm working on expanding its capabilities. I still consider it in the very early alpha stages. There's a lot of functionality missing. I will send out versions if anyone is interested in alpha-testing it. I have two user scenarios completed for this, and by that I mean semantic user scenarios. One of them is posted on that site - the genealogy inference of simple semantic user scenario. And then I have another one done for budget alerting - and that's posted on there also.

But I'm working on two more sophisticated semantic use cases, like ones that would take advantage of transitive matching. I've decided that I'm not going to publically release it until I have those top three or four user scenarios completed, and then I can be very confident that even the freeware version is at least useful in those three or four key ways.

So, having said that, the reason why I'm doing this is that several years ago I wrote one of the first major books on the semantic web back in 2004. And, I believe it's been stuck in academia. I think it has improved, has progressed, I think more people now do believe that more semantic expressivity is the only way to go.

Google is still getting good traction and still making improvements off what they would call big data techniques - expanded brute force and non-semantic heuristics or non-semantic techniques for taking action.

You have only a limited number of ways to do thing. You can either directly act off of the data, and that's what databases do. You have these very specific, tight models of data, with limited semantics, mostly just around the weight of this is X, the height of this, etc. And you can have computers make very simple decisions based on those facts.

The other technique that Google is exploiting is big data. What big data means is if I can look at what a billion other people have said when they typed in the same thing, I've got a high probability that you probably want the same thing. It's still not perfect, it still won't work in all situations, but big data certainly helps probabilistic matching. That's what Google is exploiting right now, even in terms of things like translation. Google doesn't need to construct the rules for translating, if they have enough data.

If I‘ve got enough data that says that I get this text and I look at a translation and I see how a million other people have translated it, I can pretty reliably say ‘This is the way you translate it, without even knowing if it's right.'

So I have exact matching techniques - database world. I have probabilistic techniques - they're going strong right now, but even they will hit the wall. And even they are not good for all domains, because you're going to get domains where you say ‘Probably guessing based on past performance is not good enough. I need concrete decision support based on detailed semantics and guaranteed logical entailment.' Once your data techniques fail, you're going to have to go to more logical, semantic entailment techniques.

But let me make this clear - I'm not just a fan of saying ‘it's the semantic, logical technique, or nothing.' I say, use all of them. You use both semantic techniques and probabilistic techniques. You even use direct matching techniques, because if you can direct match - done! Direct match is first, probabilistic match techniques are second, semantic techniques are third. Go for all of them.

You've got to look at these techniques like a pyramid. I call it the data optimization pyramid. In other words, how much time you want to spend optimizing your data? And the more time and energy you want to spend - and if you look at it like a pyramid, that's important, because you're not going to optimize all your data. But there is a piece at the top of mission-critical data for things like e-discovery where you would say ‘It is so important that we get these things right that we're going to spend the time semantically optimizing the top portion of that data.'

FGIT: So I guess this is as good a point as anywhere to bring up the ‘metacrap' argument.

DACONTA: Oh yes, I'd be happy to discuss that. There's a book, "Data and Reality," which goes into this problem in much more detail - the idea that there are limits to semantic techniques, that a semantic representation is still a model and a model of reality is never reality.  

If you take that notion that the models have to be perfect, then, yes, you can come up with an argument [like metacrap]. But I refute it based on a very simple practical example. Maps are also not perfect representations of the terrain, but they are very useful, and they get the job done, and we use them all the time. I don't have to have a perfect model of reality for that model to be useful.

Let's talk a little more about metadata. Metadata is key to being able to reliably create information. Because metadata is all the description information about your data that tells you how to better use it, or who should use it, etc. Most organizations don't want to say ‘now let's spend a little bit more time and money describing what that data is, what is it about, how do I use it.' Because, even in government and organizationally, it is discipline to think in an enterprise manner. It is not a manner that many organizations, especially the federal government, have mastered.

The biggest impediment to that is that most funding is on a project-by-project basis. Therefore, the people with the funding don't really care, most of the time, about enterprise issues. Until we change things like how we fund thing, by maybe giving - as an example, there's a FISMA tax. I absolutely believe that there should in general be an enterprise-level taxes on programs to pay for the necessary enterprise level functions that need to occur. Until we need to serious about that, the enterprise level functions will always suffer.

FGIT: And what are you doing now at your new company?

DACONTA: A little bit more than a year ago, myself and a partner, a gentlemen named Harold Klink, created an organization specially focused around the information management area. So we started Accelerated Information Management with the clear goal that most people doing information management techniques are just focused on the tasks. ‘Hey, I need to create a business glossary or, hey, I need to do data integration, or I need to do master data management.' But, unfortunately, since some of those things can be money pits, they have a bad reputation. But they shouldn't. They have a bad reputation, because people are focused on the activity, and not on the outcome.

What we've been working on what are the specific steps, and the architectural framework of how you take the outcome and work backward form that to the activity sets and then work from the activity sets to the dependencies to the resources you need. We just released a  new white paper [.pdf]