Python 3 writing large (300+ mb) XML with lxml

Question

I've been googling for past few days, but I simply couldn't find any remotely simillar problem :(

My script in Python 3 has simple objective:

connect to MySQL database and fetch data
create XML with lxml
save that XML to file

Usually I had no problems with XML files containing 5000+ elements, but in this case I'm hitting max memory usage for my VPS (Amazon EC2 micro). My code (core part):

engine = create_engine(config('DB_URI'))
Session = sessionmaker(bind=engine)
session = Session()

query = session.query(Trips.Country,
                      Trips.Region,
                      Trips.Name,
                      Trips.Rebate,
                      Trips.Stars,
                      Trips.PromotionName,
                      Trips.ProductURL,
                      Trips.SubProductURL,
                      Trips.Date,
                      Trips.City,
                      Trips.Type,
                      Trips.Price,
                      TripsImages.ImageURL) \
    .join(TripsImages) \
    .all()

# define namespace xmlns:g
XMLNS = "{http://base.google.com/ns/1.0}"
NSMAP = {"g": "http://base.google.com/ns/1.0"}

# create root rss and channel
rss = etree.Element("rss", nsmap=NSMAP, attrib={"version": "2.0"})
channel = etree.SubElement(rss, "channel", attrib={"generated": str(datetime.now())})

# add <channel> title and description
channel_title = etree.SubElement(channel, "title")
channel_link = etree.SubElement(channel, "link")
channel_description = etree.SubElement(channel, "description")

channel_title.text = "Trips"
channel_link.text = "https://example.com"
channel_description.text = "Description"

# generate xml elements
for count, elem in enumerate(query):
    item = etree.SubElement(channel, "item")

    url = "/".join(["https://example.com",
                    elem.ProductURL,
                    elem.SubProductURL,
                    datetime.strftime(elem.Date, '%Y%m%d')
                    ])
    price_discounted = round(elem.Price - elem.Price * (elem.Rebate / 100))

    etree.SubElement(item, XMLNS + "id").text = str(count)
    etree.SubElement(item, XMLNS + "title").text = elem.Country
    etree.SubElement(item, XMLNS + "description").text = elem.Product
    etree.SubElement(item, XMLNS + "link").text = url
    etree.SubElement(item, XMLNS + "image_link").text = elem.ImageURL
    etree.SubElement(item, XMLNS + "condition").text = "new"
    etree.SubElement(item, XMLNS + "availability").text = "in stock"
    etree.SubElement(item, XMLNS + "price").text = str(elem.Price)
    etree.SubElement(item, XMLNS + "sale_price").text = str(price_discounted)
    etree.SubElement(item, XMLNS + "brand").text = "Brand"
    etree.SubElement(item, XMLNS + "additional_image_link").text = elem.ImageURL
    etree.SubElement(item, XMLNS + "custom_label_0").text = elem.Date.strftime("%Y-%m-%d")
    etree.SubElement(item, XMLNS + "custom_label_1").text = elem.Type
    etree.SubElement(item, XMLNS + "custom_label_2").text = str(elem.Stars / 10)
    etree.SubElement(item, XMLNS + "custom_label_3").text = elem.City
    etree.SubElement(item, XMLNS + "custom_label_4").text = elem.Country
    etree.SubElement(item, XMLNS + "custom_label_5").text = elem.PromotionName


# finally, serialize XML and save as file
with open(target_xml, "wb") as file:
    file.write(etree.tostring(rss, encoding="utf-8", pretty_print=True))

I'm using SQLAlchemy for querying database and LXML for generating XML file. When fetching data from DB, it already creates list with 228890 elements, which uses a lot of memory. Then creating XML also creates object in-memory, resulting in total use of approximately 1,5 GB RAM.

This code works fine on my laptop with 8 GB ram, but when executing on Amazon EC2 with 1 gb ram and 1 gb swap I hit write() operation and get 'Killed' response from Linux.

There's a lot covered on StackOverflow when it comes to parsing large XML files, but I couldn't find any regarding writing large files in Python except for avoiding multiple I/O operations :(

score 0 · Accepted Answer · answered Mar 05 '18 at 09:27

0

I think what you need is yield_per() function, so you do not have to process all the results at one time, but separate them in chunks. In this way you can save more memory. You can read more about this function at this link.

However, take note that yield_per() may ignore some of your query rows, and the answer in this question provides a detailed explanation. If you think you do not want to use yield_per() after reading, you may refer to all the answers posted on this stackoverflow question as well.

Another tips for you in processing large size of list is to use yield, so you do not have to load all the entries in memory at once, but process them one by one. Hope it helps.

answered Mar 05 '18 at 09:27

Da_Pz

698
5
13

Hey, thanks for tips! I used yield_per() method and it seemed to improve memory use. Unfortunately it seems the problem lies in writing file. That's what logging returns for me INFO:__main__:Adding product no. 228887 INFO:__main__:Adding product no. 228888 INFO:__main__:Adding product no. 228889 INFO:__main__:Saving XML to file Killed – Dave Mar 06 '18 at 07:37
Are you doing file I/O in the yield_per() as well? – Da_Pz Mar 06 '18 at 07:49
No, I replaced session.query().all() with session.query().yield_per(100). I/O operation is outside the loop when XML tree is created (just like in sample code in the question). I was suggested to append data to file in loop, but it seems lxml can't do it without closing root tag (or at least I hadn't found any info about it in docs). Right now I increased swap size on vps and it helped, but the problem persists. – Dave Mar 06 '18 at 08:01
In order to write such large xml data into a file, I would suggest you use SAX and lxml does support it. By doing so, you can write xml element in chunks without causing memory issue. – Da_Pz Mar 06 '18 at 08:47
Thanks, will look into it! Closing question then :) – Dave Mar 06 '18 at 10:44

Python 3 writing large (300+ mb) XML with lxml

1 Answers1