1

I am a beginner in Python. I use Python 2.7 with ElementTree to parse XML files. I have a big XML file (~700 MB), which contains multiple root instances, for example:

 <?xml version="1.0" ?> <foo> <bar> <sometag> Mehdi  </sometag> <someothertag> blahblahblah </someothertag> . . . </bar> </foo>
 <?xml version="1.0" ?> <foo> <bar> <sometag> Hamidi </sometag> <someothertag> blahblahblah </someothertag> . . . </bar> </foo>
...
...

each xml instance is placed in one line. I need to parse such file in python. I used ElementTree this way:

import xml.etree.ElementTree as ET
tree = ET.parse('filename.xml')
root = tree.getroot()

but it seems it just can access to the first root XML instance line. What is the proper way to parse all XML instances in this type of file?

Mehdi Hamidi
  • 99
  • 1
  • 8
  • 5
    That's not valid XML. Split the file in single lines and parse each line as an XML document. Use `manyLinesString.split('\n')`. – Thomas Weller Jul 10 '16 at 10:37
  • As @ThomasWeller said, this is not a valid XML. Ans as I see this, you have two options: (1) build your XML properly or (2) if it is possible - read the line and parse every line as a different XML. – Gal Dreiman Jul 10 '16 at 10:39
  • See also: http://stackoverflow.com/questions/3277503/how-to-read-a-file-line-by-line-into-a-list-with-python – Thomas Weller Jul 10 '16 at 10:41

2 Answers2

0

You probably want to do this:

from xml.etree import ElementTree as ET
root = ET.parse("file.xml").getroot()
getpid = root.iter("bar")

Also you can check online for xml validate. https://www.xmlvalidation.com/

netkool
  • 45
  • 1
  • 11
0

You can use the lxml.etree.iterparse() method also, it works pretty fast. It is suggested by IBM- https://www.ibm.com/developerworks/xml/library/x-hiperfparse/

for _, elem in etree.iterparse("filename.xml"):
    if elem.tag == 'bar':
        print(elem.text)
    elem.clear()