Modify large xml file using lxml

Question

Language :- Python 2.7.6

File Size :- 1.5 GB

XML Format

<myfeed>
    <product>
        <id>876543</id>
        <name>ABC</name>
        ....
     </product>

    <product>
        <id>876567</id>
        <name>DEF</name>
        ....
     </product>

    <product>
        <id>986543</id>
        <name>XYZ</name>
        ....
     </product>

I have to

A) Read all the nodes <product>

B) Delete some of these nodes ( if the <id> attribute's text is in python set()

C) Update/Alter few nodes ( if the <id> attribute's text is in python dict

D) Append/Write some new nodes

The problem is my XML file is huge ( approx 1.5 GB ). I did some research and decide to use lxml for all these purposes.

I am trying to use iterparse() with element.clear() to achieve this because it will not consume all my memory.

for event, element in etree.iterparse(big_xml_file,tag = 'product'):
        for child in element:
            if child.tag == unique_tag:
                if child.text in products_id_hash_set_to_delete: #python set()
                    #delete this element node

                else:
                    if child.text in products_dict_to_update:
                        #update this element node  
                        else:
                            print child.text
        element.clear()

Note:- I want to achieve all these 4 task in one scan of the XML file

Questions

1) Can I achieve all this in one scan of the file ?

2) If yes, how to delete and update the element nodes I am processing?

3) Should I use tree.xpath() instead ? If yes, how much memory will it consume for 1.5 GB file or does it works in same way as iterparse()

I am not very experienced in python. I am from Java background.

Martin Bonner supports Monica · Accepted Answer · 2015-12-21T16:24:30.673

2

You can't edit an XML file in-place. You have to write the output to a new (temporary) file, and then replace the original file with the new file.

So the basic algorithm is:

Loop over all elements.
If the node is one to delete, proceed to the next element
If the node is one to change, change its value
Write out the node ««« This is the crucial bit you are missing
When you are about to finish processing a node which is a parent of one of the new nodes, write out the new node, and remove it from the collection of new nodes.
Close the output file
Rename.

To answer the supplemental question: You need to realize that an XML file is a (long) string of characters. If you want to insert a character, you have to shuffle all the other ones up; if you want to delete a character, you have to shuffle all the other ones down. You can't do that with a file; you can't just delete a character from the middle of a file.

If you have millions of elements (and this is a real problem, not an exercise for a class), then you need to use a database. SQLite is my first thought when somebody says "database", but as Charles Duffy points out below, an XQuery database would probably be a better place to start given you already have XML. See BaseX or eXist for some open-source implementations.

edited Dec 21 '15 at 16:24

answered Dec 16 '15 at 08:19

Martin Bonner supports Monica

28,528
3
51
88

Is there no way to edit the XML in-place? If not in python , than may be in Java or any other language.( I understand that this process is more OS dependent). How 'sed -i' command works? sed also streams the edited content into new file? – Yogesh Yadav Dec 17 '15 at 04:21
My file is very big (millions of element ) and I need to delete and update only few (may be 300-500) elements . How much time it would take to write the whole file again ? – Yogesh Yadav Dec 17 '15 at 04:24
1

According to this answer 'sed -i' also writes the whole file and rename it behind the scenes. http://stackoverflow.com/questions/12696125/solaris-sed-edit-the-file-in-place – Yogesh Yadav Dec 17 '15 at 04:28
Performance: You'll have to try it. `grep -v "complicated pattern that doesn't match"` should give you an order-of-magnitude estimate. – Martin Bonner supports Monica Dec 17 '15 at 06:20
1

Eh? If we're talking "obvious choice", that would be an XQuery database (with native support for storing, searching through, indexing, and updating XML structures). See BaseX or eXist for some comprehensive open-source implementations. – Charles Duffy Dec 21 '15 at 16:15
SQLite is great, but the work involved in building a SQL schema to match a given XML structure... well, it might actually be pretty easy if the OP's data is no more complex than their example here, but I don't want to make the assumption. – Charles Duffy Dec 21 '15 at 16:19

Modify large xml file using lxml

1 Answers1