Categories
Online Marketing

A RESTful JSON-LD Architecture for Unraveling Hidden References to Research Data

You later made in Stuttgart, that’s one and we as the Manheim university live. We live in mountain expanded by the DFG, determine Research Foundation, and we are now in the second funding phase in phase, so just that we are on the same level. Maybe, let’s start with some easy stuff research data? Well, that’s just some raw data. Imagine some numbers! You have measured. You have made some experiments and you measuring that, and that will be kind of in maybe an intermediate step in your research process and your then heading for the publication, where you do some nice analysis.

This would be the first possibility, but you can also kind of take research data from a data provider or from some a visit official statistics like four countries. There are a lot of data are available, also for other organization or stuff like that, or you can just take the research data from your colleague, given that he or she shares it with you or maybe has even published it. So if you’re using some other work.

Other scholarly works. Well then you have to cite it and the citation is just some more formal, structured reference to another soft gallery work. Data citation is then kind of this thing if the scullery were his research data. So let’s start maybe Ben. Does this data citation? Actually started, I have here a timeline, I think we have seen at least some of the date so more or less already and first question.

I would like to ask: when was the first structured data citation used in a publication? So I claim maybe around the year 2000, if you have any proof or more hints, just send it as we are interested kind of to find out, maybe a more accurate date next question. Well, when was the first unstructured reference to research data used in a publication, and here we say well that’s 1609 or before – and the proof just follows here – is one of the first unstructured data citations, it’s a paper or a book actually by Kepler.

So he’s the author and title well actually the whole stuff here above that’s the title: if you translate it into English, it says new astronomy based open courses or celestial physics treated by means of commentaries on the motions of the stars from the observations of two jabrai. So we see here he mentions this observation, his research data from teacher Bryce, so he cites that he does some data citation kind of.

How does he do that? Well, just by well a little bit above. Should it be from the observation of teacher Bryce? Oh that’s. Just the sentences here part of the title, which is the suggestion that these or the data citation kinda well, that was a long time ago, and actually now we have some data citations principles. So this are the eight data citation principles. First, one is well it’s just important, it’s actually as important as other citations, so you should too kind of the same thing as you’re doing for other citations, and you should make it easy or facilitates to give credit and attribution to the authors or to the contributors Of this research data, you should evidence, so you should do data citations whenever you’re using some research data.

You should cite them also as a citation, a unique identifications and global identifiers, for example, access. How can you access the research data, persistence and two more currently, there are hundred actually exactly one hundred institutional supporters. This means, if your institution want to become one hundred and first you have maybe to hurry up a little bit. There are some data, centers publishers and societies, also some library societies are around among them and some others supporters.

Well, that was the principal kind of house kind of in the practice. How does a data citation look? Well here? Is one format suggested by data site. It’s just kind of informant. You can use. You start with the creator than the publication year. In parentheses, column, title point version, point, publisher, point, resource type point identified, so not so seeing fancy in there. There is an example. You can maybe move around a little bit and another order publication year, maybe in the end or some other rules about the separation stuff like that, so that’s normal, seeing what citation styles actually are forcing you to do, and actually there are also some other well-known citation.

Styles, like API, who already has a data citation guidelines, included some other examples I have here also from the NLM or the Chicago Manual, have at least four databases. They are talking about internal some journals. Diocese wells are listed here, but actually in practice the people – they are still doing the same thing as 400 years ago, namely they are citing or references the research data in the text, just by mentioning some words.

So, for example, the first one is the caption oven table and somewhere there’s a reference to the research data. Second one is well, it just mentioned the igloo study and if you read a little bit around you see there should be some connection to reading literacy. Maybe not the first thing with men, you see you come to mind mind if you see igloo, and this third example also it can be scattered around in the text, as here with diverter years, are not in the same place and some other words in between.

Maybe so, how do you process this? How you can find now the research data? What would be the steps you need? Well, there are different steps you have to perform now. Actually, the first one we have just done, namely the detection of data citation in the running full text. Second, one is well, you have to resolve and normalize kind of the data citations. So, for example, the igloo that stands for the German internet, Sonali grown true lazy into zoo, ok and the this sir, that’s kind of the abbreviation for the socio-economic panel or also in German, the socio-economic panel, and you can even write that differently.

So there are different possibilities or variants here. Next thing is the unique identify the data citations. Actually, there was an igloo study in 2001. There was an other one in 2006 and there was one in 2011, which one was referenced in the paper before and last step kind of find actually now really decided research data, so you actually or after some URL or maybe then just the location and well.

This steps they are kind of annoying. You don’t want to spend a lot of time, and actually you don’t maybe not no time at all, just see it, maybe even in the beginning, that would be nice. So the question is here: is it possible that we can some of these steps out alive? Maybe some tools, some algorithm via work – can help us here and that’s exactly the goal or one of the goal for the info lace project automating these processing steps.

So this means automatically unraveling hidden references in the running text to research data into structured data citations with you our eyes, and this all should happen in a flexible, long-term, sustainable infrastructure. So here’s an overall view about the project. So, as always, we need some data, like the full text, metadata stuff, like that from research, data from publications and then our other ratings.

They can work on that they are relying on door. There are some data mining algorithms and they are using some bootstrapping strategies and some other stuff like there. We will not call will not speak much more about the algorithms, but focus on the technical architecture where they are actually in, and the technical architecture relies on linked open data and will provide some rest who AP is in between.

There is kind of some abstract, modeling stuff, some structures and antiques, which kind of connects. Also these algorithms and technical architecture – and you can get out things are out of it. So there is an integration. We are trying to integrate it in as much as possible, and maybe you see that here better, so as the end user. Well, you can, for example, search in a discovery system. Then you’re receiving some publication according to yours, wizard to your search and then it would be nice to see on what data, whether they’re this publication, relying on or if you go the other way.

If you’re searching in a data repository finding some research data, it would actually be nice to see which publication were built on top of this research data. So we need some in length linking stuff in between them. However, the users they are not only searching in discovery systems and data repository, but they are also searching in other stuff, which I should mention in a minute. First, this there is also an in question here: how actually two best incorporate data connections into library, catalog? That question comes from Horizon with court 1914 Library Edition, so you search or use the search also somewhere else, mainly, for example, in google scholar, or they can maybe search on the colonel website or wherever actually somewhere in the web.

They can make in search, and so actually it’s also a good question to ask here where and how is the integration of data citation for our users, users most useful, so we see here actually there’s a lot of stuff of different systems. We would like to cover in this integration and there who really need a really flexible infrastructure which allows us to do that and that’s the next. What we want to show you all right, so we’ve just seen that there are various agents involved or possibly involved with with the results of our of our project, and this is a like a 10,000 view, a view of our architecture.

We have an internal API which does all the heavy lifting, that’s all it takes extraction and so on that’s written in Java on, and it should be to be mostly self-contained servers. On the other hand, we have our public API that should be it’s flexible as possible. Should support as many different see realizations and data formats as possible and allow a data model to be as complex as needed, but still be really fast, so speed speed is of the essence for us and that’s why we relate on some some principles when we started Designing this this whole architecture.

The main thing should be that the API usability is more important than then expressivity of all parts of the model. So we want to support it at the at the right places, but in general the API should be easy to maintain easy to consume for for possible well developers, and it would be possible to understand the data model. So we try to postpone the making making the data model extremely complicated part to later and start with something simple.

Of course it should be restful ish. Also, not all the aspects of restful architecture are followed closely for Orthodox league, but still its protocol independent. So we can reproduce every everything on a local client without HTTP. That’s really important, because it has to be fast as we said, and we decided to use adjacent store versus a triple store, because it’s really fast. It has native water lists or arrays, which everyone who has developed the RDF software knows is a real pain with RDF, and it has a deterministic structure again, which RDF has not, and that makes it easy to use it for closed work validation, which is really important For us, so in general we started out with keeping it simple and that’s also understandable if we look at the main operations input in fullest at the moment.

There’s this bootstrapping part that we try to learn from a simple seat. Word new patterns to find data set references as Phillip show before and there’s the multiple levels of recursion involved and it’s an iterative process and it’s really tough on cpu and on ram. So here speed is much more important than expressivity. That’s for text extraction, so extracting text from PDF, well, which we do a lot and for applying patterns that we found using this bootstrapping process to to text files.

Again, these must be really fast and there shouldn’t be anything anytime. We lost with a civilization or or description or complicated data structure problems now for the data set resolution. That’s that’s the part about. If we have some string like it so ep, what does that refer? To which databases must we must be searched? How do we rank these? These results? How can we automate the intuition that people put into resolving these data set resolutions, and here the expressivity is much more important than speed, because we won perfect results.

So I I still think that the modeling has its married and it’s important for us. For example, data set granularity. So if someone refers to SOE p, does he mean the whole panel, the whole survey every every year or just a single year or that’s one aspect? Then there are data set references, data set references which cannot be automatically resolved without context like the people right as the results of our study shows – and we know we have to know who are those people and with what’s the context.

Where did we get this from or something like page 15 of the verb panel, which we don’t know what? It is because we can’t find it anywhere, but we still want to find one to a state that we found somewhere that someone references, something that’s called the door panel, also for for doing like a biometric analysis or graphing. The relationships of the entities in our in our data store. We also want to have the possibility to model this deeply and also for for mining the provenance of the things in our data store.

It would be really helpful to understand it as a set of statements. Instead of the set of documents, so the question is: how do we get the best of both this world of the modeling and of keeping it simple and to show your dad I’ll, just briefly explain or architecture? So we have an HTTP server which which handles the API calls, which has a rdf json-ld content negotiating middle mayor, and we have a MongoDB storage because that’s really easy to set up and easy to deploy and fast we’re using the Mongoose document with a and document Mapper to to make it a bit easier to to work in code and then we have mapping tool that will map between the mongoose imam and end and the incoming data.

So for once this handles are the f requests. It handles a request for our schema, but it also handles the RESTful API requests and exposes our data model in different in different civilizations and thought. That’s all controlled by something we call the teasing. And if you are asking yourself what this spidery thing in the middle is with the arrows to everything else, I will explain that now so T Sigma is our staff develop format it’s based on, but it’s just json-ld with a bit of a different syntax, more oriented Towards turtle, because that’s easier to read and easier to write, and in this we keep all all the different aspects we keep the descriptive part.

We keep the database schema part and also the presentation part. So these are the parts that describe the rdf semantics of our data model. We have a class execution with two properties block and algorithm, which I described in a context. So it’s we happily stole that from json-ld. Then we have the database schema part. So there’s a collection, execution and the property algorithm and they are the algorithm – may be required or should be indexed and so on and ask me this is just just an example.

This one shouldn’t be displayed in the API front end so yeah we’re mixing we’re mixing different levels, but we’re keep them all in one place which makes it really easy to to adapt and to to fix things. So one schema to rule them all. That’s our general idea from this one file. We generate our ontology, we generate our REST, API, endpoints and the documentation for them. We regenerate our database schema and the indexes that make this database fast and a date, a model explorer which allows us to to get a better understanding of how the syrups and now, let’s hope that it works the joy of live demos, alright.

So what we see here are the the aspects of our data model. I won’t dive into that. Much detail just want to say that execution is the most important thing because we are doing heavily algorithmic stuff here, but we also have links between entities, for example and patterns, and if I open one of these, I see that the context aspect is always the RTF Part and I can just open all the RDF descriptions right so so here that’s that’s now we’re on the on the RDF level, but we could also check out the database level and to see, for example, to find out by some some queries really.

So maybe some some field isn’t indexed and we can always jump into the real rdf description of some class, for example, here a search query described as a turtle in this case could also be. I find it always helpful to look at it in json-ld, because that’s really tours or anything else or I could even go crazy and look at some visualization right. So that’s the data model part. Let’s look at what we can do with that.

So I just showed that in the demo brook that’s nice right, so I jump right into it. We have our API exposed using the swagger interface or all the things rest addy, something something are generated from that file. We can do all the HTTP verbs that are relevant for s get post and so on, but what I want to show you is our simplified api calls to execute something, and what I want to execute is a short version of that learning algorithm I just copy It and paste it so just really quickly.

What i’m doing here is i’m executing this frequency based bootstrapping algorithm for all the files that were attacked with this particular tag, and I start with the seed elbows. So I know that Ibis is the data set reference and I want to find out what can be used from justice, information and a lot of files. So let’s try it out. Okay, things posted and I get a response, also in the location, header and now it has started an execution.

That’s running a synchronously on the server. I open that up and I get again the trip of you. So I could look at this, but we have a bit of a nicer interface, so we see the algorithm is at fifty percent. Just have to hit f5 a few times. Why did runs? I can just show you. These are all the the well the not insurance that you can configure for an algorithm and, let’s see if you finished yes, he finished right, so he has generated a lot of patterns.

Oops. All these patterns are, of course, the referenceable resources generated a lot of textual references, so these are the things like it’s open one and maybe in turtle. These are the extracted elements of the of the text, so Birds left off of the thing we found and so on, and the pattern stable pattern as well, which are just fancy friends. Just a fancy word for regular expression and everything can be tagged and that’s how we organize our stuff, because that proved to be really fast and simple.

And now that we’ve learned something: let’s apply this to a PDF file. So for that, we’ve written a small JavaScript library which is really thin. Just that’s what what I just showed you and I will choose a fire, choose two files and we’ll try to analyze them and what had happened? What happens now? Is he uploads those files to our store? He extracts the text from the PDF files and now he tries to apply patterns or patterns from a surgeon with a certain tag and that I just created to to those text, files and yeah.

There’s a there’s a kind of funny buck where it just jumps around. But let’s read it in the again the drip of you and jump into the mind of you. Where does the same thing, but he has already found a lot of patterns right and, and he has found some links, so let’s open one of those. We see this as a link. Again, these things are referenceable and the entities from which they link, into which tailing, in this case, from a publication and to a data set, are referenceable as well.

It’s just a little thing that they are not turned into links in this interface, but still it should be d referenceable, okay, and we see that it he has. He has found a reference from this publication, that’s referenced by this entity to something that’s called anomie elbows. Don’t worry no, but that’s the DOI of the theme, so we’ve gone the full way and yeah. So that’s that’s it. If you have any questions, feel free to ask them get in touch with us.

If you have any data that you want to run through this yeah and try it out, it’s well, it’s kind of not that stable but or rapid in rapid development, depending on how you look at it. Thank you very much.