2

I'm currently working on creating a Pythonic way of parsing through OpenStreetMaps province/states dumps; which as far as I know is just knowing how to deal with very large XML files (right?).

I'm currently using lxml etree iterparse module in order to parse through the dumps for the Province of Quebec(quebec-latest.osm.bz2). I'd like to pull any entry that has highway information, convert to JSON, save it to file, and flush, though it doesn't seem to be working.

I'm currently running an i7-4770, 16GBs of RAM, 128GB SSD, and OSX 10.9. When I launch the code below, my RAM fills up completely within a couple seconds, and my swap within 30 seconds. Afterwards my system will either request that I close applications to make room, or eventually freeze.

Here's my code; You'll notice most likely a lot of bad/garbadge code in there, but I got to the point where I was plugging in whatever I could find in hope for it to work. Any help on this is greatly appreciated. Thanks!

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from lxml import etree
import xmltodict, json, sys, os, gc

hwTypes = ['motorway', 'trunk', 'primary', 'secondary', 'tertiary', 'pedestrian', 'unclassified', 'service']

#Enable Garbadge Collection
gc.enable()

def processXML(tagType):

    f = open('quebecHighways.json', 'w')
    f.write('[')
    print 'Processing '
    for event, element in etree.iterparse('quebec-latest.osm', tag=tagType):
        data = etree.tostring(element)
        data = xmltodict.parse(data)
        keys = data[tagType].keys()
        if 'tag' in keys:
            if isinstance(data[tagType]['tag'], dict):
                if data[tagType]['tag']['@k'] == 'highway':
                    if data[tagType]['tag']['@v'] in hwTypes:
                        f.write(json.dumps(data)+',')
                        f.flush() #Flush Python
                        os.fsync(f.fileno()) #Flush System
                        gc.collect() #Garbadge Collect
            else:
                for y in data[tagType]['tag']:
                    if y['@k'] == 'highway':
                        if y['@v'] in hwTypes:
                            f.write(json.dumps(data)+',')
                            f.flush()
                            os.fsync(f.fileno())
                            gc.collect()
                            break

        #Supposedly there is supposed to help clean my RAM.
        element.clear()
        while element.getprevious() is not None:
            del element.getparent()[0]

    f.write(']')
    f.close()
    return 0

processXML('way')
Dustin
  • 6,207
  • 19
  • 61
  • 93

2 Answers2

0

The library xmltodict stores the dictionary generated in memory, so if your data dictionaries are big it is not really a good idea to do so. Using only iterparse would be more efficient.

Another option could be to use the streaming possibilities offered by xmltodict. More info in http://omz-software.com/pythonista/docs/ios/xmltodict.html.

Ignacio
  • 386
  • 4
  • 19
0

I would say that you are making your life more complicated than necessary. You are effectively repeatedly dumping whole subtree into xmltodict and making it parse that whole subtree again and again.

If I were you, I would just dump xmltodict, sit on your behind, read a tutorial or two, and just use something standard: xml.sax (it is really not that difficult, if you don't need too many jumps ahead; just working on converting Bible) or iterparse and use just that. It is really not that complicated.

mcepl
  • 2,688
  • 1
  • 23
  • 38