Providers of corpora
Content may be imported in OpenMinTeD in the form of single documents or already packaged in the form of corpora, i.e. collections of single documents.
Corpora may come (upon bilateral agreements) from repositories of language resources, or discipline-specific repositories, or uploaded by users for processing with TDM applications.
If you wish to share corpora through OpenMinTeD, you will find more information here.
What types of corpora
Corpora in the OpenMinTeD framework refer mainly to collections of documents that will be used as mining source in the TDM process. If they are uploaded in OpenMinTeD, they may not necessarily be composed of scholarly works. Examples include reference corpora (i.e. corpora deemed representative of general language or a sublanguage usage), news corpora, collections of domain-specific texts, such as manuals, technical reports, etc., as well as annotated corpora, such as treebanks, morphologically tagged golden corpora etc. Nevertheless, in order to be mined they must follow the technical requirements that have been defined for corpora built through the OpenMinTeD mechanism1. Otherwise, they can be used (upon availability of the respective components/applications) for other objectives, such as training Machine Learning models, evaluating the performance of applications, etc.
Minimum requirements for corpora
If you want to share your corpus through OpenMinTeD, you must
- ensure that the single documents comprising the corpus adhere to the minimal level of the OpenMinTeD Interoperability specifications,
- describe the corpus with a metadata record compliant with the OMTD-SHARE schema, at least at the minimal level,
- prepare, package and register a zipped file with the contents (texts) of the corpus according to the instructions for uploading corpora.
1. In the case of single documents (publications) uploaded in the registry, the OpenMinTeD platform includes a mechanism for automatically generating corpora based on user criteria selected from a faceted view - more details are included in the Building corpora of scholarly content offered in OpenMinTeD. ↩