Extract XML-data with python

Question

I have a huge list of different authors and their selected works in a <list> in XML (namend bibliography.xml). Here is an example:

<list type="index">
                <item><persName>Poe, Edgar Allan</persName>, <note>1809—1849</note>, <bibl>The Black
                        Cat 1843 (<abbr>Cat.</abbr>).</bibl> — <bibl>The Gold-Bug 1843
                            (<abbr>Bug.</abbr>).</bibl> — <bibl>The Raven 1845
                        (<abbr>Rav.</abbr>).</bibl></item>

                <item><persName>Melville, Herman</persName>, <bibl>Benito Cereno 1855
                            (<abbr>Ben.</abbr>)</bibl> — <bibl>Moby-Dick 1851
                        (<abbr>MobD.</abbr>)</bibl> — <bibl>Typee: A Peep at Polynesian Life 1846
                            (<abbr>PolyL.</abbr>)</bibl></item>
                
                <item><persName>Barth, John</persName>, <note>(*1930)</note>, <bibl>The Sot-Weed
                        Factor 1960 (<abbr>Fac.</abbr>)</bibl> — <bibl>Giles Goat-Boy 1960
                            (<abbr>Gil.</abbr>)</bibl></item>
            </list>

import xml.etree.ElementTree as ET

tree = ET.parse('bibliography.xml')
root = tree.getroot()

for work in root:
    if(work.tag=='item'):
        print work.get('persName')
            if (attr.tag=='abbr')
                print (attr.text)

obviously it's not working, but since I'm absolutely new to python, I can't wrap my mind around about what I'm doing wrong. Would be highly appreciated if someone could help me out here.

Okay, that's weird beacuse Oxygen and some other validators are fine with the XML. Keep in mind that I just posted a snippet of the ``, not the whole TEI-Header, body etc. — SparrowSilencio, Feb 27 '21 at 12:33

score 0 · Answer 1 · answered Feb 27 '21 at 12:43

Even I tried the same way as you did and landed up in the same problem. I had no option but to convert the whole xml into pretty-xml, and treat it as a single string. Then iterate each line to for a specific tag.

import xml.dom.minidom

dom = xml.dom.minidom.parse("bibliography.xml")
pretty_xml = dom.toprettyxml()
pretty_xml = pretty_xml.split("\n")
start, end = [], [] # store the beginning and the end of "item" tag

for idx in range(len(pretty_xml)):
        if "item" in pretty_xml[idx]:
            if "/" not in pretty_xml[idx]:
                start.append(idx)
            else:
                end.append(idx)

Now you know that between start[0] and end[0] you have your first data point available. Like wise iterate for all elements of both list sequentially with "if" conditions, the structure would be somewhat like this (I am not writing the whole code):

for idx in range(len(start)):
    for line in pretty_xml[start[idx] + 1 : end[idx]]:
        line.split("persName")[1].replace("<","").replace(">","").replace("/","")
         ...
         ...

(If you find a better structured approach, do let me know.)

Thanks a lot for your answer. I tried it but I got a response saying: >IndexError: list index out of range There's some kind of solution for my problem also here on stackoverflow (https://stackoverflow.com/questions/37619848/python-loop-list-index-out-of-range) but it still can't make it work. — SparrowSilencio, Feb 27 '21 at 17:50
Are you getting error in the first part or the second part of the code snippet (that I shared) ? Were you able to populate "start" and "end" lists? — Tanveer, Mar 02 '21 at 09:43
first part says "dom = xml.dom.minidom.parse("bibliography.xml")" second part says the already mentioned above — SparrowSilencio, Mar 05 '21 at 08:38

Greg · Answer 2 · 2021-02-28T22:26:46.493

0

Consider using XPath to get the data. Simply call tree.xpath("//item") to return all items.

Below is a working example based on XML snippet. tree.getroot() will only work depending on full xml.

Basic working example:

import lxml.etree as etree

xml = '''<list type="index">
            <item><persName>Poe, Edgar Allan</persName>, <note>1809—1849</note>, <bibl>The Black
                    Cat 1843 <abbr>(Cat.).</abbr></bibl> — <bibl>The Gold-Bug 1843
                        <abbr>(Bug.)</abbr>.</bibl> — <bibl>The Raven 1845
                    <abbr>(Rav.)</abbr>.</bibl></item>

            <item><persName>Melville, Herman</persName>, <bibl>Benito Cereno 1855
                        (<abbr>Ben.</abbr>)</bibl> — <bibl>Moby-Dick 1851
                    (<abbr>MobD.</abbr>)</bibl> — <bibl>Typee: A Peep at Polynesian Life 1846
                        (<abbr>PolyL.</abbr>)</bibl></item>
            
            <item><persName>Barth, John</persName>, <note>(*1930)</note>, <bibl>The Sot-Weed
                    Factor 1960 (<abbr>Fac.</abbr>)</bibl> — <bibl>Giles Goat-Boy 1960
                        (<abbr>Gil.</abbr>)</bibl></item>
        </list>
'''
tree = etree.fromstring(xml)
#root = tree.getroot()

for work in tree.xpath("//item"):
    persName = work.find('persName').text.strip()
    abbr =' '.join([x.text for x in work.xpath('bibl/abbr')])
    print (f'{persName} {abbr}')

Output:

Poe, Edgar Allan (Cat.). (Bug.) (Rav.)
Melville, Herman Ben. MobD. PolyL.
Barth, John Fac. Gil.

edited Feb 28 '21 at 22:26

answered Feb 27 '21 at 13:11

Greg

4,468
3
16
26

Thanks a lot, that worked. But only if there's just one `` as an author, if there are more ``'s, each in one `` as in my case, it will just print one name followed by all ``'s, without printing the related name. And I'm wondering if I have to put the whole xml-data into that script as well or if I can just link to the xml-file within the script? – SparrowSilencio Feb 27 '21 at 17:54
You can probably replace `work.find('persName')` with `work.xpath('//persName')` or `work.findall('persName')` and preform for each loop on results. If you supply XML example, then I can update answer. – Greg Feb 27 '21 at 19:09
Depending on XML, you may be able to do `for work in tree.xpath("//persName"):` – Greg Feb 27 '21 at 19:11
Thanks for the reply. If I replace it with each of your suggestions, I get the response „AttributeError: 'list' object has no attribute 'text'“ I'll give more xml-data in my post above, I just edited it P.S.: the list goes on and one in the same way, about 500 bibliographical entries) – SparrowSilencio Feb 27 '21 at 19:54
running the xml in your question still achieved the correct results. The error „AttributeError: 'list' object has no attribute 'text' - it's most like you're calling `.text` on a list (and not an item). – Greg Feb 28 '21 at 10:48
With the first code you've posted I do get results, but unfortunately not the right ones, e.g. the first line looks like this: `Poe, Edgar Allan Cat. Bug. Rav. Ben. MobD. PolyL. Fac. Gil.` It just lists all of the abbreviations, not just the ones that belong to each author. Replacing `work.find('persName')` with your two posted suggestions gives back `AttributeError: 'list' object has no attribute 'text'` twice. The third suggestion gives back `AttributeError: 'NoneType' object has no attribute 'text'`. – SparrowSilencio Feb 28 '21 at 18:54
Double backslashes in XPath will search the entire xml node. Therefore you'll have to select the node. I've updated the answer with `work.xpath('bibl/abbr')` (you can replace xpath with findall) – Greg Feb 28 '21 at 22:29

Extract XML-data with python

2 Answers2