1

Language :- Python 2.7.6

File Size :- 1.5 GB

XML Format

<myfeed>
    <product>
        <id>876543</id>
        <name>ABC</name>
        ....
     </product>

    <product>
        <id>876567</id>
        <name>DEF</name>
        ....
     </product>

    <product>
        <id>986543</id>
        <name>XYZ</name>
        ....
     </product>

I have to

A) Read all the nodes <product>

B) Delete some of these nodes ( if the <id> attribute's text is in python set()

C) Update/Alter few nodes ( if the <id> attribute's text is in python dict

D) Append/Write some new nodes

The problem is my XML file is huge ( approx 1.5 GB ). I did some research and decide to use lxml for all these purposes.

I am trying to use iterparse() with element.clear() to achieve this because it will not consume all my memory.

for event, element in etree.iterparse(big_xml_file,tag = 'product'):
        for child in element:
            if child.tag == unique_tag:
                if child.text in products_id_hash_set_to_delete: #python set()
                    #delete this element node

                else:
                    if child.text in products_dict_to_update:
                        #update this element node  
                        else:
                            print child.text
        element.clear()

Note:- I want to achieve all these 4 task in one scan of the XML file

Questions

1) Can I achieve all this in one scan of the file ?

2) If yes, how to delete and update the element nodes I am processing?

3) Should I use tree.xpath() instead ? If yes, how much memory will it consume for 1.5 GB file or does it works in same way as iterparse()

I am not very experienced in python. I am from Java background.

Mr Lister
  • 45,515
  • 15
  • 108
  • 150
Yogesh Yadav
  • 4,557
  • 6
  • 34
  • 40

1 Answers1

2

You can't edit an XML file in-place. You have to write the output to a new (temporary) file, and then replace the original file with the new file.

So the basic algorithm is:

  • Loop over all elements.
  • If the node is one to delete, proceed to the next element
  • If the node is one to change, change its value
  • Write out the node ««« This is the crucial bit you are missing
  • When you are about to finish processing a node which is a parent of one of the new nodes, write out the new node, and remove it from the collection of new nodes.
  • Close the output file
  • Rename.

To answer the supplemental question: You need to realize that an XML file is a (long) string of characters. If you want to insert a character, you have to shuffle all the other ones up; if you want to delete a character, you have to shuffle all the other ones down. You can't do that with a file; you can't just delete a character from the middle of a file.

If you have millions of elements (and this is a real problem, not an exercise for a class), then you need to use a database. SQLite is my first thought when somebody says "database", but as Charles Duffy points out below, an XQuery database would probably be a better place to start given you already have XML. See BaseX or eXist for some open-source implementations.

  • Is there no way to edit the XML in-place? If not in python , than may be in Java or any other language.( I understand that this process is more OS dependent). How 'sed -i' command works? sed also streams the edited content into new file? – Yogesh Yadav Dec 17 '15 at 04:21
  • My file is very big (millions of element ) and I need to delete and update only few (may be 300-500) elements . How much time it would take to write the whole file again ? – Yogesh Yadav Dec 17 '15 at 04:24
  • 1
    According to this answer 'sed -i' also writes the whole file and rename it behind the scenes. http://stackoverflow.com/questions/12696125/solaris-sed-edit-the-file-in-place – Yogesh Yadav Dec 17 '15 at 04:28
  • Performance: You'll have to try it. `grep -v "complicated pattern that doesn't match"` should give you an order-of-magnitude estimate. – Martin Bonner supports Monica Dec 17 '15 at 06:20
  • 1
    Eh? If we're talking "obvious choice", that would be an XQuery database (with native support for storing, searching through, indexing, and updating XML structures). See BaseX or eXist for some comprehensive open-source implementations. – Charles Duffy Dec 21 '15 at 16:15
  • SQLite is great, but the work involved in building a SQL schema to match a given XML structure... well, it might actually be pretty easy if the OP's data is no more complex than their example here, but I don't want to make the assumption. – Charles Duffy Dec 21 '15 at 16:19