How to parse xml with lxml

Question

So for example I have XML doc:

<?xml version="1.0"?>
<a>
  <b>Text I need</b>
</a>
<a>
  <b>Text I need2</b>
</a>

How do I parse all texts inside b's. I read my whole file into a string. I only know how to parse html, tried applying it to html, but failed.

from lxml import html   
string = myfile.read();
tree = html.fromstring(string);
result = tree.xpath('//a/@b');

But it wont work.

What does "won't work" mean? Do you get an error or a blank result? — ErlVolton, Oct 29 '14 at 14:50
Did you read the `lxml` documentation? Why use the HTML parser if you have XML, in any case? — Martijn Pieters, Oct 29 '14 at 14:51
yes, i get empty string. I didnt understand the documentation for xml part. It was confusing. — Dancia, Oct 29 '14 at 14:51

score 1 · Accepted Answer · answered Oct 29 '14 at 18:40

The first thing that you should do is make sure that your xml file is properly formatted for lxml. If the entire document is not contained within an overall "body" tag, the lxml parser will fail. May I make this suggestion:

<?xml version="1.0"?>
<body>
  <a>
    <b>Text I need</b>
  </a>
  <a>
    <b>Text I need2</b>
  </a>
</body>

Let us refer to this file as "foo.xml". Now that this data format is better for parsing, import etree from the lxml library:

from lxml import etree as et

Now it is time to parse the data and create a root object from which to start:

file_name = r"C:\foo.xml"
xmlParse = et.parse(file_name)  #Parse the xml file
root = xmlParse.getroot()  #Get the root

Once the root object has been declared, we can now use the getiterator() method to iterate through all b tags. Because the getiterator() method is exactly what it sounds like, an iterator, we can use list comprehension to save the element objects in a list. From there we can edit the text between the b tags:

bTags = [tag for tag in root.getiterator("b")]  #List comprehension with the iterator
bTags[0].text = "Change b tag 1."  #Change tag from "Text I need"
bTags[1].text = "Change b tag 2."  #Change tag from "Text I need2"
xmlParse.write(file_name)  #Edit original xml file

The final output should look something like this:

<?xml version="1.0"?>
<body>
  <a>
    <b>Change b tag 1.</b>
  </a>
  <a>
    <b>Change b tag 2.</b>
  </a>
</body>

How to parse xml with lxml

1 Answers1