Finding duplicate elements with duplicate "id" attributes in large XML files

Question

I have a large XML file (1.5GB). It's made up of elements called <node> and each node element has an "id" attribute <node id = "834839483"/>.

I would like to search the file for nodes that have duplicate id's and produce a dictionary or other structure with ids as keys and the number of duplicates of each as values, or print "no duplicates found" if applicable.

I wrote something that works for a file a tenth the size.

import xml.etree.cElementTree as ET
import pprint
from collections import Counter

def find_node_id_dups(filename):
    node_id_dups = set()
    empty_set = set()
    empty_set.add("None")
    node_counter=Counter()
    x=False

    for _, element in ET.iterparse(filename):
        if element.tag =="node":
            katt = element.attrib['id']
            node_counter[katt]+=1
    for id_num in node_counter:
        if node_counter[id_num] != 1:
            node_id_dups.add(id_num)
            x=True
    if x == False:
        return empty_set
    return node_id_dups    

node_id_dups = find_node_id_dups(REAL_FILE)

print("Node Id Duplicates\n")
print("\n".join(sorted(list(node_id_dups))))

I thought this would be a fast way to search because it would only have to read over each element twice, but in the end I'm still trying to cram 1.5 GBs of data into a single counter object.

I don't know how to solve this because in theory I need to hold onto each id until the very end because a duplicate could be found at any stage of the search.

EDIT: Here is an example of the file

<?xml version="1.0" encoding="UTF-8"?>

<osm>

<node changeset="7632877" id="27195852" lat="45.5408932" lon="-122.8675556" timestamp="2011-03-21T23:25:58Z" uid="393906" user="Grant Humphries" version="11">

    <tag k="addr:street" v="North Green St." />

</node>

<node changeset="7632878" id="27195856" lat="45.5408936" lon="-122.8675556" timestamp="2011-03-21T23:25:58Z" uid="393906" user="Grant Humphries" version="11">

    <tag k="addr:city" v="Lower case" />

</node>
<node changeset="7632878" id="27195856" lat="45.5408936" lon="-122.8675556" timestamp="2011-03-21T23:25:58Z" uid="393906" user="Grant Humphries" version="11">

    <tag k="addr:city" v="aower Lase" />

</node>
<node changeset="7632878" id="27195856" lat="45.5408936" lon="-122.8675556" timestamp="2011-03-21T23:25:58Z" uid="393906" user="Grant Humphries" version="11">

    <tag k="addr:city" v="aower Lase" />

</node>
</osm>

If the data is really too big, you might consider breaking the ids into ranges, ignoring all ids outside of a particular range on a given pass. Then do multiple passes, one for each range. That would let you limit how many items you have in your counter. — Tom Karzes, Nov 07 '17 at 15:49
I think once you've built a list of ids, this is pretty much the same as https://stackoverflow.com/questions/2600191/how-can-i-count-the-occurrences-of-a-list-item-in-python , other than perhaps you don't want to include ids with count of zero. — Acccumulation, Nov 07 '17 at 15:59
Possible duplicate of [How can I count the occurrences of a list item in Python?](https://stackoverflow.com/questions/2600191/how-can-i-count-the-occurrences-of-a-list-item-in-python) — Acccumulation, Nov 07 '17 at 16:00
@Acccumulation The file is too big to load into a list all at once like that — John Ingles, Nov 07 '17 at 16:12
If I split into multiple files then the duplicates could be spread out and I would miss them — John Ingles, Nov 07 '17 at 16:13
1. You don't have to load the file, just the ids. How much space do the ids take? 2. If you split the file into multiple files based on id, then the duplicates won't be spread out because they will all have the same id. — Acccumulation, Nov 07 '17 at 16:50
ok i solved the issue by adding "element.clear()" inside the parsing loop. I didn't realize that iterparse doesn't drop elements after each loop. — John Ingles, Nov 07 '17 at 21:06

ma22 · Answer 1 · 2017-11-08T21:23:05.673

I would use SAX Parser for such a large XML file:

SAX Parser DOC

ContentHandler class DOC

Example code for what you are doing:

import xml.sax

class MySaxHandler(xml.sax.ContentHandler):
    def startElement(self, name, attrs):
        # if element we are looking at is 'node'
        if name == "node":
            for key, val in attrs.items():
                if key == 'id':
                    if val not in self.my_nodes.keys():
                        self.my_nodes[val] = 1
                    else:
                        new_count = self.my_nodes[val] + 1
                        self.my_nodes[val] = new_count

    def startDocument(self):
        self.my_nodes = {}

    def endDocument(self):
        for key, val in self.my_nodes:
            print 'id: '+key+'  count: '+val

parser = xml.sax.make_parser()
parser.setContentHandler(MySaxHandler())
parser.parse(open("your_filename.xml","r"))

Finding duplicate elements with duplicate "id" attributes in large XML files

1 Answers1