0

I have file called Books.xml The Books.xml is huge 2Gb with structure similar to this

<Books>
    <Book>
        <Detail ID="67">
            <BookName>Code Complete 2</BookName>
            <Author>Steve McConnell</Author>
            <Pages>960</Pages>
            <ISBN>0735619670</ISBN>        
            <BookName>Application Architecture Guide 2</BookName>
            <Author>Microsoft Team</Author>
            <Pages>496</Pages>
            <ISBN>073562710X</ISBN>
        </Detail>
    </Book>
    <Book>
        <Detail ID="87">
            <BookName>Rocking Python</BookName>
            <Author>Guido Rossum</Author>
            <Pages>960</Pages>
            <ISBN>0735619690</ISBN>
            <BookName>Python Rocks</BookName>
            <Author>Microsoft Team</Author>
            <Pages>496</Pages>
            <ISBN>073562710X</ISBN>
        </Detail>
    </Book>
</Books>

I have tried to split it on the Book tag like this

import xml.etree.cElementTree as etree
filename = r'D:\test\Books.xml'
context = iter(etree.iterparse(filename, events=('start', 'end')))
_, root = next(context)
for event, elem in context:
    if event == 'start' and elem.tag == 'Book':
        print(etree.dump(elem))
        root.clear()

I get the result like this

<Book>
        <Detail ID="67">
            <BookName>Code Complete 2</BookName>
            <Author>Steve McConnell</Author>
            <Pages>960</Pages>
            <ISBN>0735619670</ISBN>
            <BookName>Application Architecture Guide 2</BookName>
            <Author>Microsoft Team</Author>
            <Pages>496</Pages>
            <ISBN>073562710X</ISBN>
        </Detail>
    </Book>

None
<Book>
        <Detail ID="87">
            <BookName>Rocking Python</BookName>
            <Author>Guido Rossum</Author>
            <Pages>960</Pages>
            <ISBN>0735619690</ISBN>
            <BookName>Python Rocks</BookName>
            <Author>Microsoft Team</Author>
            <Pages>496</Pages>
            <ISBN>073562710X</ISBN>
        </Detail>
    </Book>
None
  1. How do i get rid of the None
  2. I would like to store the fragments broken up on book into some sort of queue and then have another program dequeue it.
user3249433
  • 591
  • 3
  • 9
  • 18

1 Answers1

0

here is how it can be done with celery for inter process queueing and lxml for manipulating, serializing and pretty printing a given xml:

#tasks.py file
from lxml import etree
from celery import Celery

app = Celery('tasks', broker='amqp://guest@localhost//')

@app.task
def print_book(book_xml):
    book = etree.fromstring(book_xml)
    # do something interesting ...
    print(etree.tostring(book, pretty_print=True))

#caller.py file
from tasks import print_book
from lxml import etree

for _, book in etree.iterparse('Books.xml', tag="Book"):
    book_xml = etree.tostring(book)
    print_book.delay(book_xml)
Guy Gavriely
  • 11,228
  • 6
  • 27
  • 42