1

I have to process a large XML document for which I have several data cleaning and manipulating task to do.

The basic code below is using the xml.etree.ElementTree. As the file is very large (about 2Gb) i would like to be able to print the value of my tagCounts accumulator variable on a regular basis.

What is the cleanest way to implement a timer using ElementTree printing every 3 minutes the content of self.tagCounts?

Thanks

import xml.etree.ElementTree as ET
import pprint

class TagCounter:
    def __init__(self):
        self.tagCounts = {}

    def start(self, tag, attrib):
        if tag in self.tagCounts:
            self.tagCounts[tag] += 1
        else:
            self.tagCounts[tag] = 1        

    def end(self, tag):
        pass

    def data(self, data):
        pass

    def close(self):
        return self.tagCounts

def count_tags(filename):
    parser = ET.XMLParser(target = TagCounter())
    with open(filename, mode='r') as f:
        for line in f:
            parser.feed(line)
    t = parser.close()
    return t

if __name__ == "__main__":
    tags = count_tags("file.osm")
    pprint.pprint(tags)
Michael
  • 2,436
  • 1
  • 36
  • 57

1 Answers1

0

What is the cleanest way to implement a timer using ElementTree printing every 3 minutes the content of self.tagCounts?

I don't see what ElementTree has to do with implementing a timer:

class TagCounter:
    def __init__(self):
        self.tag_counts = {}
        self.cancel_print = call_repeatedly(3*60, pprint.pprint, self.tag_counts)

    # ...

    def close(self):
        self.cancel_print()
        return self.tag_counts

where call_repeatedly(interval, function, *args) calls function(*args) every interval seconds.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • thanks @j-f-sebastian for your reply unfortunately this is a bit confusing to me, the implementation of `ElementTree` in my example is an event-based parsing approach due to the large size of the file. My idea is to print a log on the console at regular basis, I don't understand how your code can help me. May I ask you to elaborate a bit more ? Your suggestion would be definitively useful in case of a non-event based parsing. – Michael Feb 21 '15 at 13:47
  • @Michael: I've updated the answer to show that the code goes to *yours* `class TagCounter` – jfs Feb 21 '15 at 13:55