5

I have an XML file of size 4 GB. I want to parse it and convert it to a Data Frame to work on it. But because the file size is too large the following code is unable to convert the file to a Pandas Data Frame. The code just keeps loading and does not provide any output. But when I use it for a similar file of smaller size I obtain the correct output.

Can anyone suggest any solution to this. Maybe a code that speeds up the process of conversion from XML to Data Frame or splitting of the XML file into smaller sub sets.

Any suggestion whether I should work with such large XML files on my personal system (2 GB RAM) or I should use Google Colab. If Google Colab, then is there any way to upload such large files quicker to drive and thus to Colab?

Following is the code I had used:

import xml.etree.ElementTree as ET
tree = ET.parse("Badges.xml")
root = tree.getroot()

#Column names for DataFrame
columns = ['row Id',"UserId",'Name','Date','Class','TagBased']

#Creating DataFrame
df = pd.DataFrame(columns = columns)

#Converting XML Tree to a Pandas DataFrame

for node in root: 
    
    row_Id = node.attrib.get("Id")
    UserId = node.attrib.get("UserId")
    Name = node.attrib.get("Name")
    Date = node.attrib.get("Date")
    Class = node.attrib.get("Class")
    TagBased = node.attrib.get("TagBased")
    
    df = df.append(pd.Series([row_Id,UserId,Name,Date,Class,TagBased], index = columns), ignore_index = True)

Following is my XML File:

<badges>
  <row Id="82946" UserId="3718" Name="Teacher" Date="2008-09-15T08:55:03.923" Class="3" TagBased="False" />
  <row Id="82947" UserId="994" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82949" UserId="3893" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82950" UserId="4591" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82951" UserId="5196" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82952" UserId="2635" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82953" UserId="1113" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
Ishan Dutta
  • 897
  • 4
  • 16
  • 36
  • An other approach instead of parsing the whole XML as a whole, is to first create chunks of say 250MB large, and parse them in parallel. Or course, this only works if the XML is a long list-like structure of say transactions, people, or items where you know what to expect. You could have a reader and jump to around 250MB in the file and find the right cutting point, and re-jump again etc... – Willem Hendriks Jun 25 '20 at 15:33
  • One serious performance issue you experience: [Never call `DataFrame.append` or `pd.concat` inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451) – Parfait Jun 25 '20 at 15:36
  • @user3184950 Can you please write the code for converting the XML file into small chunks. I tried it but I am facing problem. – Ishan Dutta Jun 25 '20 at 16:12
  • @Parfait can you please suggest an alternative code to what you mentioned. – Ishan Dutta Jun 25 '20 at 16:13
  • Splitting the large files largely depends on the structure of your XML. A topic very searchable and find example code on, like here on stack – Willem Hendriks Jun 30 '20 at 13:35

2 Answers2

4

Consider using cElementTree instead of ElementTree

https://effbot.org/zone/celementtree.htm

The cElementTree module is a C implementation of the ElementTree API, optimized for fast parsing and low memory use. On typical documents, cElementTree is 15-20 times faster than the Python version of ElementTree, and uses 2-5 times less memory.

The cElementTree module is designed to replace the ElementTree module from the standard elementtree package. In theory, you should be able to simply change:

from elementtree import ElementTree

to

import cElementTree as ElementTree
Mads Hansen
  • 63,927
  • 12
  • 112
  • 147
  • I tried with the cElementTree, but I am facing the same problem. The code is stuck and the output is not being produced. It just keeps loading. Can you suggest some other version of the code or a code to convert my xml file into smaller files? – Ishan Dutta Jun 25 '20 at 16:11
4

Consider iterparse for fast streaming processing that builds tree incrementally. In each iteration build a list of dictionaries that you can then pass into pandas.DataFrame constructor once outside loop. Adjust below to name of repeating nodes of root's children:

from xml.etree.ElementTree import iterparse
#from cElementTree import iterparse
import pandas as pd

file_path = r"/path/to/Input.xml"
dict_list = []

for _, elem in iterparse(file_path, events=("end",)):
    if elem.tag == "row":
        dict_list.append({'rowId': elem.attrib['Id'],
                          'UserId': elem.attrib['UserId'],
                          'Name': elem.attrib['Name'],
                          'Date': elem.attrib['Date'],
                          'Class': elem.attrib['Class'],
                          'TagBased': elem.attrib['TagBased']})

        # dict_list.append(elem.attrib)      # ALTERNATIVELY, PARSE ALL ATTRIBUTES

        elem.clear()

df = pd.DataFrame(dict_list)
Julien Palard
  • 8,736
  • 2
  • 37
  • 44
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • I used the code given by you, it took 40 mins to load but I have got many error which are mentioned below. – Ishan Dutta Jun 25 '20 at 17:10
  • I have added the XML file, please have a look. – Ishan Dutta Jun 26 '20 at 06:17
  • 2
    The code which I had written was giving no output because it was taking too long time to append everytime in the DataFrame, using your approach to append in the dictionary not only gave me an output but it took fairly less amount of time. – Ishan Dutta Jun 30 '20 at 13:43
  • I used the same code for an almost same XML file but am getting a key error which should not happen. Please have a look to the question if possible. Link: https://stackoverflow.com/questions/62660270/how-to-solve-key-error-while-xml-file-parsing-in-python – Ishan Dutta Jun 30 '20 at 14:54
  • If you are only interested in looking at 'row' tags then you can specify that in the call to `interparse` using `for _, elem in iterparse(file_path, events=('end',), tag='row'):` Then you can get rid of your `if elem.tag` test. – Tom Johnson Jul 04 '22 at 15:40