0

I have a folder with XML files and I´m trying to count the occurrences of specific words inside one element. More specifically, I want to count the occurences of , for example, the word "Impfstoff" (which is in the element "mpeg7text:Keyword") in all XMLs in the folder. The XML looks like this:


<?xml version="1.0" encoding="UTF-8" standalone="true"?>

-<Description xsi:type="ContentEntityType">

   -<MultimediaContent xsi:type="mpeg7text:TextType">

      -<mpeg7text:Text>

         -<mpeg7text:ModelMetadata>
             <mpeg7text:Version>tfidf_newmodel</mpeg7text:Version>
         </mpeg7text:ModelMetadata>

         -<mpeg7text:TextDescriptor xsi:type="mpeg7text:KeywordExtractionType">

         -<mpeg7text:Keyword>

           <mpeg7text:Keyword>Impfstoff</mpeg7text:Keyword>
           <mpeg7text:Relevance>121.58288081128799</mpeg7text:Relevance>
           <mpeg7text:Frequency>22</mpeg7text:Frequency>
           <mpeg7text:Confidence>1.0</mpeg7text:Confidence>

         </mpeg7text:Keyword>

The code I have so far is:

import os
import lxml.etree as et
for filename in os.listdir(path):
    if not filename.endswith('.xml'): continue
    if filename.endswith('.xml'):
        fullname = os.path.join(path, filename)
        root = et.parse(fullname)
        root.xpath('count(.//mpeg7text:Keyword/mpeg7text:Keyword[.=Corona])')

But I get this error:

XPathEvalError                            Traceback (most recent call last)
<ipython-input-11-0ca6983bfdd7> in <module>
      7         #root = tree.getroot(tree)
      8         root = et.parse(fullname)
----> 9         root.xpath('count(.//mpeg7text:Keyword/mpeg7text:Keyword[.=Corona])')

src/lxml/etree.pyx in lxml.etree._ElementTree.xpath()

src/lxml/xpath.pxi in lxml.etree.XPathDocumentEvaluator.__call__()

src/lxml/xpath.pxi in lxml.etree._XPathEvaluatorBase._handle_result()

XPathEvalError: Undefined namespace prefix

I am kind of lost what to do now to be able to count the words in the element mpeg7text:Keyword Did anyone do something like this before and can help here? That would be awesome!

Cheers!

kathi94
  • 29
  • 3

2 Answers2

0

To use namespaces in python, see this answer

As alternative XPath you could use:

root.xpath("count(.//*[local-name()='Keyword']/*[local-name()='Keyword'][text()='Corona'])")

See the use of quotes around the 'Corona'.

But since you already using // and therefore finding all Keyword elements, you just could as well use:

root.xpath("count(.//*[local-name()='Keyword'][text()='Corona'])")
Siebe Jongebloed
  • 3,906
  • 2
  • 14
  • 19
  • Thanks, I tried it with your edits and it didi something, but I think that the script does not iterate over the whole folder of XMLs. When I did: ` for filename in files: if not filename.endswith('.xml'): continue if filename.endswith('.xml'): fullname = os.path.join(dirpath, filename) root = et.parse(fullname) tree= root.getroot() out = tree.xpath("count(.//*[local-name()='Keyword']/*[local-name()='Keyword'][text()='Eigentum'])") ` I got 0.0 although this word occurres a few times... – kathi94 Jun 28 '21 at 10:17
0

If you switch to XPath 2.0+ (available in Python using the Saxon/C library) then you can search the whole directory with a single XPath expression:

count(collection('my/folder?select=*.xml')//*:Keyword[.='Impfstoff'])
Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Sounds interesting, how can I use this expression? I tried: out = count(collection('C:/my/directory/path?select=*.xml')//*:Keyword[.='Impfstoff']) But that doesn´t work. Sorry if that sounds weird, but I´m quite new with coding. – kathi94 Jun 28 '21 at 11:29
  • It has to be a URI not a filename, so "file:///C:/my/directory". But it's possible you also went wrong at the API level. Best to raise a new question to supply full details of what you were doing and details of how it failed. – Michael Kay Jun 28 '21 at 14:06