0

I am working on a project that requires me to parse massive XML files to JSON. I have written code, however it is too slow. I have looked at using lxml and BeautifulSoup but am unsure how to proceed.

I have included my code. It works exactly how it is supposed to, except it is too slow. It took around 24 hours to go through a sub-100Mb file to parse 100,000 records.

product_data = open('productdata_29.xml', 'r')
read_product_data = product_data.read()


def record_string_to_dict(record_string):
'''This function takes a single record in string form and iterates through
it, and sorts it as a dictionary. Only the nodes present in the parent_rss dict
are appended to the new dict (single_record_dict). After each record,
single_record_dict is flushed to final_list and is then emptied.'''

    #Iterating through the string to find keys and values to put in to
    #single_record_dict.
    while record_string != record_string[::-1]:

        try:
            k = record_string.index('<')

            l = record_string.index('>')
            temp_key = record_string[k + 1:l]
            record_string = record_string[l+1:]
            m = record_string.index('<')
            temp_value = record_string[:m]

            #Cleaning thhe keys and values of unnecessary characters and symbols.  
            if '\n' in temp_value:
                temp_value = temp_value[3:]
            if temp_key[-1] == '/':
                temp_key = temp_key[:-1]

            n = record_string.index('\n')
            record_string = record_string[n+2:]

            #Checking parent_rss dict to see if the key from the record is present. If it is,
            #the key is replaced with keys and added to single_record_dictionary.
            if temp_key in mapped_nodes.keys():
                temp_key = mapped_nodes[temp_key]
                single_record_dict[temp_key] = temp_value

        except Exception:
            break


    while len(read_product_data) > 10:

        #Goes through read_product_data to create blocks, each of which is a single
        #record.
        i = read_product_data.index('<record>')
        j = read_product_data.index('</record>') + 8
        single_record_string = read_product_data[i:j]
        single_record_string = single_record_string[9:-10]

        #Runs previous function with the input being the single string found previously.
        record_string_to_dict(single_record_string)

        #Flushes single_record_dict to final_list, and empties the dict for the next
        #record.
        final_list.append(single_record_dict)
        single_record_dict = {}

        #Removes the record that was previously processed.
        read_product_data = read_product_data[j:]

        #For keeping track/ease of use.
        print('Record ' + str(break_counter) + ' has been appended.')

        #Keeps track of the number of records. Once the set value is reached
        #in the if loop, it is flushed to a new file.
        break_counter += 1
        flush_counter += 1

        if break_counter == 100 or flush_counter == break_counter:
            record_list = open('record_list_'+str(file_counter)+'.txt', 'w')
            record_list.write(str(final_list))

            #file_counter keeps track of how many files have been created, so the next
            #file has a different int at the end.
            file_counter += 1
            record_list.close()

            #resets break counter 
            break_counter = 0
            final_list = []
        #For testing purposes. Causes execution to stop once the number of files written
        #matches the integer.
        if file_counter == 2:
            break

    print('All records have been appended.')
Parfait
  • 104,375
  • 17
  • 94
  • 125
Zain Mobhani
  • 51
  • 1
  • 7
  • Please include input xml and desired output json for a [reproducible](https://stackoverflow.com/help/mcve) example. – Parfait Jun 29 '17 at 18:50

2 Answers2

1

Any reason, why are you not considering packages such as xml2json and xml2dict. See this post for working examples: How can i convert an xml file into JSON using python?

Relevant code reproduced from above post:

xml2json

import xml2json
s = '''<?xml version="1.0"?>
    <note>
       <to>Tove</to>
       <from>Jani</from>
       <heading>Reminder</heading>
       <body>Don't forget me this weekend!</body>
    </note>'''
print xml2json.xml2json(s)

xmltodict

import xmltodict, json
o = xmltodict.parse('<e> <a>text</a> <a>text</a> </e>')
json.dumps(o) # '{"e": {"a": ["text", "text"]}}'

See this post if working in Python 3: https://pythonadventures.wordpress.com/2014/12/29/xml-to-dict-xml-to-json/

import json
import xmltodict

def convert(xml_file, xml_attribs=True):
    with open(xml_file, "rb") as f:    # notice the "rb" mode
        d = xmltodict.parse(f, xml_attribs=xml_attribs)
        return json.dumps(d, indent=4)
aatishk
  • 866
  • 7
  • 9
  • I would definitely try some item_callback argument here to add elements at the end of the JSON file. Indeed, not sure the whole file as a dictionary can hold in-memory. Check out help(xmltodict.parse) for more. – Q Caron Jun 29 '17 at 09:35
0

You definitely don't want to be hand-parsing the XML. As well as the libraries others have mentioned, you could use an XSLT 3.0 processor. To go above 100Mb you would benefit from a streaming processor such as Saxon-EE, but up to that kind of level the open source Saxon-HE should be able to hack it. You haven't shown the source XML or target JSON, so I can't give you specific code - the assumption in XSLT 3.0 is that you probably want a customized transformation rather than an off-the-shelf one, so the general idea is to write template rules that define how different parts of your input XML should be handled.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164