I want to annotate a couple of XML-Files with the German STW Thesaurus for Economics. You can get the files here as ZIP-Archives in RDF/XML, N3 and Turtle (~14MB each).
So I wrote a Python-Script that deletes Stopwords, lemmatizes and does Part-of-Speech-Tagging. Now I want to check if a noun in one of the XML-Files is in the STW-Ontology. If yes, I'd like to do different options for a later to be done Automated Classification:
- If it is an
skos:altLabel
Word, replacing it with theskos:prefLabel
Word - Do nothing with the text, but add the
skos:prefLabels
at the end of the file with a count of the appearances of theskos:prefLabel
and the associatedskos:altLabels
- Using e.g.
skos:broader
to find e.g. the Economic sectors or the Commodities related to theskos:prefLabel
.
I know GATE and Apolda, which are able to do this, but they're Java-based and I'd like to do everything from one Python-Script at the end.
Are there any suggestions?