I have a folder with XML files and I´m trying to count the occurrences of specific words inside one element. More specifically, I want to count the occurences of , for example, the word "Impfstoff" (which is in the element "mpeg7text:Keyword") in all XMLs in the folder. The XML looks like this:
<?xml version="1.0" encoding="UTF-8" standalone="true"?>
-<Description xsi:type="ContentEntityType">
-<MultimediaContent xsi:type="mpeg7text:TextType">
-<mpeg7text:Text>
-<mpeg7text:ModelMetadata>
<mpeg7text:Version>tfidf_newmodel</mpeg7text:Version>
</mpeg7text:ModelMetadata>
-<mpeg7text:TextDescriptor xsi:type="mpeg7text:KeywordExtractionType">
-<mpeg7text:Keyword>
<mpeg7text:Keyword>Impfstoff</mpeg7text:Keyword>
<mpeg7text:Relevance>121.58288081128799</mpeg7text:Relevance>
<mpeg7text:Frequency>22</mpeg7text:Frequency>
<mpeg7text:Confidence>1.0</mpeg7text:Confidence>
</mpeg7text:Keyword>
The code I have so far is:
import os
import lxml.etree as et
for filename in os.listdir(path):
if not filename.endswith('.xml'): continue
if filename.endswith('.xml'):
fullname = os.path.join(path, filename)
root = et.parse(fullname)
root.xpath('count(.//mpeg7text:Keyword/mpeg7text:Keyword[.=Corona])')
But I get this error:
XPathEvalError Traceback (most recent call last)
<ipython-input-11-0ca6983bfdd7> in <module>
7 #root = tree.getroot(tree)
8 root = et.parse(fullname)
----> 9 root.xpath('count(.//mpeg7text:Keyword/mpeg7text:Keyword[.=Corona])')
src/lxml/etree.pyx in lxml.etree._ElementTree.xpath()
src/lxml/xpath.pxi in lxml.etree.XPathDocumentEvaluator.__call__()
src/lxml/xpath.pxi in lxml.etree._XPathEvaluatorBase._handle_result()
XPathEvalError: Undefined namespace prefix
I am kind of lost what to do now to be able to count the words in the element mpeg7text:Keyword Did anyone do something like this before and can help here? That would be awesome!
Cheers!