How to build corpora in OpenMinTeD
Scholarly publications that have been imported into the OpenMinTeD platform can be used by researchers for TDM processing via a query-based creation of corpora.
Researchers can build a corpus by selecting publications from various sources based on specific criteria, e.g. "a corpus of English articles in the biomedicine area", and then apply TDM services on them.
End-users issue a query in the OpenMinTed registry in a simple way: they are presented with a faceted view of the OpenMinTeD contents (i.e. of all registered content providers) and, by selecting from a range of criteria, a query is gradually built. Results from all registered content providers are presented to the end-user and, after refinement and careful elicitation of the final query, the associated content is transferred to the OpenMinTeD’s registry and becomes available (in the form of a corpus) for the subsequent steps of a TDM application.
Implementation details for content providers
OpenMinTed has investigated several architectural options on how to integrate existing content providers (such as OpenAIRE and CORE but not limited to) and chose an approach whereby content is managed in the external services of the providers but is accessible in the OpenMinTeD platform through a federated search strategy. Content is made available to the OpenMinTed platform through a simple API, defining simple operations to search and retrieve content.
A lazy deposit/caching strategy has been employed to avoid redundant queries (in simple terms, a record is fetched only the first time it is requested and remains persistent locally for further requests). Extra care is taken to ensure reproducibility of the created corpus by storing an exact version of the content used in it.
Thus, a corpus included in the OpenMinTeD Registry essentially consists of a list of publications. Each publication is identified (equivalent to a primary key) by its content (e.g. full text pdf) hash value and a set of metadata files (compatible with the OMTD-SHARE schema) that describe the resource. In most cases, this set consists of just one item; still, it cannot be ruled out that the same resource is described by multiple metadata files (for example, different metadata files from OpenAIRE or CORE, updates in metadata fields, richer metadata from a content provider, etc.)