I have a large XML file (1.5GB). It's made up of elements called <node>
and each node element has an "id" attribute <node id = "834839483"/>
.
I would like to search the file for nodes that have duplicate id's and produce a dictionary or other structure with ids as keys and the number of duplicates of each as values, or print "no duplicates found" if applicable.
I wrote something that works for a file a tenth the size.
import xml.etree.cElementTree as ET
import pprint
from collections import Counter
def find_node_id_dups(filename):
node_id_dups = set()
empty_set = set()
empty_set.add("None")
node_counter=Counter()
x=False
for _, element in ET.iterparse(filename):
if element.tag =="node":
katt = element.attrib['id']
node_counter[katt]+=1
for id_num in node_counter:
if node_counter[id_num] != 1:
node_id_dups.add(id_num)
x=True
if x == False:
return empty_set
return node_id_dups
node_id_dups = find_node_id_dups(REAL_FILE)
print("Node Id Duplicates\n")
print("\n".join(sorted(list(node_id_dups))))
I thought this would be a fast way to search because it would only have to read over each element twice, but in the end I'm still trying to cram 1.5 GBs of data into a single counter object.
I don't know how to solve this because in theory I need to hold onto each id until the very end because a duplicate could be found at any stage of the search.
EDIT: Here is an example of the file
<?xml version="1.0" encoding="UTF-8"?>
<osm>
<node changeset="7632877" id="27195852" lat="45.5408932" lon="-122.8675556" timestamp="2011-03-21T23:25:58Z" uid="393906" user="Grant Humphries" version="11">
<tag k="addr:street" v="North Green St." />
</node>
<node changeset="7632878" id="27195856" lat="45.5408936" lon="-122.8675556" timestamp="2011-03-21T23:25:58Z" uid="393906" user="Grant Humphries" version="11">
<tag k="addr:city" v="Lower case" />
</node>
<node changeset="7632878" id="27195856" lat="45.5408936" lon="-122.8675556" timestamp="2011-03-21T23:25:58Z" uid="393906" user="Grant Humphries" version="11">
<tag k="addr:city" v="aower Lase" />
</node>
<node changeset="7632878" id="27195856" lat="45.5408936" lon="-122.8675556" timestamp="2011-03-21T23:25:58Z" uid="393906" user="Grant Humphries" version="11">
<tag k="addr:city" v="aower Lase" />
</node>
</osm>