Parsing XML in Python

Question

I have a large XML file and I need to format it to get some needed data from particular elements in it and print out only data needed into another file. In the XML file I have a number of text tags belonging to different conversations with id's and authors who have id's after the author tag. I do not need all the texts from all authors but the specific ones whom I have their id's. How do I write a function that specifies it to only select and write out conversations where author = id1 or id2 or id3.......etc? This is what the document looks like...

 <conversations>
  <conversation id="e621da5de598c9321a1d505ea95e6a2d">
    <message line="1">
      <author>97964e7a9e8eb9cf78f2e4d7b2ff34c7</author>
      <time>03:20</time>
      <text>Hola.</text>
    </message>
    <message line="2">
      <author>0158d0d6781fc4d493f243d4caa49747</author>
      <time>03:20</time>
      <text>hi.</text>
    </message>
  </conversation>
  <conversation id="3c517e43554b6431f932acc138eed57e">
    <message line="1">
      <author>505166bca797ceaa203e245667d56b34</author>
      <time>18:11</time>
      <text>hi</text>
    </message>
    <message line="2">
  </conversation>
  <conversation id="3c517e43554b6431f932acc138eed57e">
     <author>505166bca797ceaa203e245667d56b34</author>
      <time>18:11</time>
      <text>Aujourd.</text>
    </message>
    <message line="3">
      <author>4b66cb4831680c47cc6b66060baff894</author>
      <time>18:11</time>
      <text>hey</text>
    </message>
  </conversation>

   </conversations>

What have you tried so far? There are lots of questions about XML parsing in Python here on StackOverflow, and lots of examples elsewhere. We can you provide you with better answers if you can provide us with specific technical questions (I tried *this* and I expect it to do *that* but instead it did *something else*...) — larsks, Aug 19 '17 at 01:26
Your xml is not formatted correctly line 21, is not closed, 33 too — diek, Aug 19 '17 at 03:09

diek · Answer 1 · 2017-08-19T04:15:28.400

0

import xml.etree.ElementTree as ET
tree = ET.parse('conversations.xml')
for node in tree.iter():
    if node.tag == "conversations":
        continue
    if node.tag == "conversation":
        print("\n")  # visual break, new conversation
        print("{} {}".format(node.tag, node.attrib))
        continue
    if node.tag == "message":
        print("{} {}".format(node.tag, node.attrib))
        continue
    print("{} {}".format(node.tag, node.text))

So using the above you should be able to check for id, using similar logic If you are searching for 97964e7a9e8eb9cf78f2e4d7b2ff34c7, etc, make a list or dict.

authors = ['97964e7a9e8eb9cf78f2e4d7b2ff34c7']
for node in tree.iter():
    if node.tag == "author" and node.text in authors:
        print('found')

edited Aug 19 '17 at 04:15

answered Aug 19 '17 at 03:59

diek

657
7
16

Thank you so much Diek, you are a life saver. – T. A Aug 20 '17 at 04:11
@T.A glad to help, xml can be a pain. Please accept the answer when you have a minute, thank you – diek Aug 20 '17 at 15:23
I am actually getting message location I.e something like this: , how do I get out the node tag and value and export it to a file that will be saved on my computer. Thanks I really do not know much about xml and etree. – T. A Aug 20 '17 at 18:20
I am actually getting message location I.e something like this: , how do I get out the node tag and value and export it to a file that will be saved on my computer. Thanks I really do not know much about xml and etree. this is actually what I've done: – T. A Aug 20 '17 at 19:35
Thanks, I've been able to print out the node tags and texts, but it didn't follow the condition of list in authors but printed out every author available. – T. A Aug 20 '17 at 23:13
Thismycode: import xml.etree.ElementTree as ET tree = ET.parse(location.xml) root = tree.getroot() for node in tree.iter(): authors = ['97964e7a9e8eb9cf78f2e4d7b2ff34c7'] if node.tag == "author" and node.text in authors: print('found') – T. A Aug 20 '17 at 23:34
I mentioned this before, ensure that your code is compliant, if there are any mistakes you need to fix them. Use this online checker http://www.xmlvalidation.com/ So keep in mind that there is a structure, compliant xml has an open and close, tags . There are other ways to go to sub trees, but your code had a simple pattern that allowed for my approach and works well imo. – diek Aug 21 '17 at 01:41
Here is a demo, as you can see, it only identifies the 2 of the 4 authors . http://i.imgur.com/PCASheO.png – diek Aug 21 '17 at 01:53
exactly the output of the demo is the same I get, but what I want to get back is not only the text enclosed within the author tag which is what is happening right now but also the text in the text tag. this I have found a bit difficult. – T. A Aug 21 '17 at 22:14
exactly the output of the demo is the same I get, but what I want to get back is not only the text enclosed within the author tag which is what is happening right now but also the text in the text tag. this I have found a bit difficult, from your example, the iteration Is done on the all_authors list, but what I actually want is that the iteration be done on the whole xml tags, i.e each conversation, – T. A Aug 21 '17 at 22:56
and while iterating the child tags under conversation, it should check the author tag.text (i.e text in author tag) should be checked if the string in between the tags can be found on a list of stings in the created dictionary, if found then the string and the text between the text tag in that particular conversation should be printed before it iterates to the next conversation. – T. A Aug 21 '17 at 22:57
The conversations tag is the root tag, the conversation tag is a child of conversations, while message is the child of conversations, message has 3 chid tags which are: author, time and text tags, these tags are siblings. The condition is made on the author tag if which met, the text between the author tag, the text tag and the text between it are to be printed for each conversation in conversations. – T. A Aug 21 '17 at 22:57
The first example shows how to get the text from the text tag. Please post the code you are trying. Use a pastebin it is easier and simple, paste, save and it will generate a url https://bpaste.net/+python – diek Aug 21 '17 at 23:27
I am seeing a bit of the problem, the author comes after the message – diek Aug 22 '17 at 01:15
exactly, I am thinking if I can turn the string in the author tag into an attribute of author i.e author id=string and then change the location of the text tag to child of the author tag it will become easier to use the condition, but I don't know how to write the code to do that. – T. A Aug 22 '17 at 05:39
Is this xml a live document, ie is it getting updated continually, or this a one of one? I am thinking put the xml in a more sensible data structure, then extract what you need. – diek Aug 22 '17 at 14:04
Looking at your data, is this the actual information, I ask because there are 2 conversation with the same id. – diek Aug 22 '17 at 23:31

Parsing XML in Python

1 Answers1