I'm currently working on creating a Pythonic way of parsing through OpenStreetMaps province/states dumps; which as far as I know is just knowing how to deal with very large XML files (right?).
I'm currently using lxml etree iterparse module in order to parse through the dumps for the Province of Quebec(quebec-latest.osm.bz2). I'd like to pull any entry that has highway information, convert to JSON, save it to file, and flush, though it doesn't seem to be working.
I'm currently running an i7-4770, 16GBs of RAM, 128GB SSD, and OSX 10.9. When I launch the code below, my RAM fills up completely within a couple seconds, and my swap within 30 seconds. Afterwards my system will either request that I close applications to make room, or eventually freeze.
Here's my code; You'll notice most likely a lot of bad/garbadge code in there, but I got to the point where I was plugging in whatever I could find in hope for it to work. Any help on this is greatly appreciated. Thanks!
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from lxml import etree
import xmltodict, json, sys, os, gc
hwTypes = ['motorway', 'trunk', 'primary', 'secondary', 'tertiary', 'pedestrian', 'unclassified', 'service']
#Enable Garbadge Collection
gc.enable()
def processXML(tagType):
f = open('quebecHighways.json', 'w')
f.write('[')
print 'Processing '
for event, element in etree.iterparse('quebec-latest.osm', tag=tagType):
data = etree.tostring(element)
data = xmltodict.parse(data)
keys = data[tagType].keys()
if 'tag' in keys:
if isinstance(data[tagType]['tag'], dict):
if data[tagType]['tag']['@k'] == 'highway':
if data[tagType]['tag']['@v'] in hwTypes:
f.write(json.dumps(data)+',')
f.flush() #Flush Python
os.fsync(f.fileno()) #Flush System
gc.collect() #Garbadge Collect
else:
for y in data[tagType]['tag']:
if y['@k'] == 'highway':
if y['@v'] in hwTypes:
f.write(json.dumps(data)+',')
f.flush()
os.fsync(f.fileno())
gc.collect()
break
#Supposedly there is supposed to help clean my RAM.
element.clear()
while element.getprevious() is not None:
del element.getparent()[0]
f.write(']')
f.close()
return 0
processXML('way')