-3

The job is to read in a very large XML file line by line and store what has been already read in a string. When the string contains a full record between tags 'player' and '/player', all the values of xml tags within this record should be written to a text file as a tab separated line and the record removed from the already read chunk.

At the end of the process the unremoved part ( remainder ) should be printed, to check if all records have been properly processed and nothing remained unprocessed.

I have already this code in Perl and it runs swiftly, but I want to switch to Python.

The Python script I currently have is extremely slow.

Is Python that slow, or do I do something wrong with using the regular expressions?

import re

fh=open("players_list_xml.xml")

outf=open("players.txt","w")

x=""

cnt=0

while(cnt<10000):

line=fh.readline().rstrip()

x+=line

mo=re.search(r"<player>(.*)</player>",x)

while(mo):

    cnt=cnt+1

    if((cnt%1000)==0):
        print("processing",cnt)

    x=re.sub(re.escape(mo.group()),"",x)

    print("\t".join(re.findall(r"<[a-z]+>([^<]+)<[^>]+>",mo.group(1))),file=outf)

    mo=re.search(r"<player>(.*)</player>",x)

print("remainder",x)

outf.close()

fh.close()
javachessgui
  • 147
  • 2
  • 11
  • 1
    Review the code you have posted. It's intended awfully. – hrust Oct 30 '15 at 18:38
  • 7
    Don't use regular expressions to parse XML. Use an [XML parser](https://docs.python.org/3.4/library/xml.html) to parse XML. – dtanders Oct 30 '15 at 18:41
  • It is properly indented in my text editor, but when I paste it with four spaces before it, it falls apart. – javachessgui Oct 30 '15 at 18:42
  • I just wonder if regular expressions are that slower in Python than Perl. – javachessgui Oct 30 '15 at 18:43
  • 1
    **Don't use tabs**! As you can see: they break code when pasting on SO. Just use spaces. The problem is trivial: it's a case of catastrophic backtracking. Python regexes don't do anything to handle that, while I believe Perl handles a few special cases efficiently. In any case using a proper xml parser is simpler and faster. – Bakuriu Oct 30 '15 at 18:48
  • @javachessgui Why aren't you showing your text editor to us, then? Are this lines inside the loop or if it's formatted any other way? `line=fh.readline().rstrip()`, `x+=line`, `mo=re.search(r"(.*)",x)` What about all other lines? – hrust Oct 30 '15 at 18:48
  • Does XML parser work with extremely large files? My file is 220MB and contains more than half a million player records ( each having more than 10 tags ). – javachessgui Oct 30 '15 at 18:55
  • 1
    XML parsers are typically built to be able to do stream parsing which can handle a file of any size because it doesn't read the whole thing into memory at once. – dtanders Oct 30 '15 at 18:58
  • As I said, this works in Perl quickly. There was simply no need in Perl to learn how to use an XML parser. My question is mainly motivated by that I read that Perl is on the verge of disappearing and Python is the future. I wonder what kind of future is this, if it does regular expressions 10 times slower than Perl. – javachessgui Oct 30 '15 at 19:28

3 Answers3

1

Your regex is slow because of "backtracking" as you are using a "greedy" expression (this answer provides a simple Python example). Also, as mentioned in a comment, you should be using an XML parser to parse XML. Regex has never been very good for XML (or HTML).

In an attempt to explain why your specific expression is slow...

Lets assume you have three <player>...</player> elements in your XML. Your regex would start by matching the first opening <player> tag (that part is fine). Then (because you are using a greedy match) it would skip to the end of the document and start working backwards (backtracking) until it matched the last closing </player> tag. With a poorly written regex, it would stop there (all three elements would be in one match with all non player elements between them as well). However, that match would obviously be wrong so you make a few changes. Then the new regex would continue were the previously left off by continuing to backtrack until it found the first closing </player> tag. Then it would continue to backtrack until it determined there were no additional </player> tags between the opening tag and the most recently found closing tag. Then it would repeat that process for the second set of tags and again for the third. All that backtracking takes a lot of time. And that is for a relatively small file. In a comment you mention your files contain "more than half a million records". Ouch! I can't image how long that would take. And you're actually matching all elements, not just "player" elements. Then you are running a second regex against each element to check whether they are player elements. I would never expect this to be fast.

To avoid all that backtracking, you can use a "nongreedy" or "lazy" regex. For example (greatly simplified form your code):

r"<player>(.*?)</player>"

Note that the ? indicates that the previous pattern (.*) is nongreedy. In this instance, After finding the first opening <player> tag, it would then continue to move forward through the document (not jumping to the end) until it found the first closing </player> tag and then it would be satisfied that the pattern had matched and move on to find the second occurrence (but only by searching within the document after the end of the first occurrence).

Naturally, the nongreedy expression will be much faster. In my experience, nongreedy is almost always what you want when doing * or + matches (except for the rare cases when you don't).

That said, as stated previously, an XML parser is much more suited to parsing XML. In fact, many XML parsers offer some sort of steaming API which allows you to feed the document in in pieces in order to avoid loading the entire document into memory at once (regex does not offer this advantage). I'd start with lxml and then move to some of the builtin parsers if the C dependency doesn't work for you.

Community
  • 1
  • 1
Waylan
  • 37,164
  • 12
  • 83
  • 109
  • I use the same regex in Perl, and do a bunch of additional processing, like counting value frequencies of keys, sorting keys etc. and the whole thing runs ten times faster than this simplified little Python script. How come I don't have to worry about greedy patterns in Perl? – javachessgui Oct 30 '15 at 19:39
  • I don't have much Perl experience so I can't answer that question. However, I seem to recall reading somewhere (many years ago) that in Perl regex is often used for everything (even when it is ill-advised) , while in Python regex is a last resort when nothing else will do the trick. If you are moving from Perl to Python, you might need to let go of your reliance on regex as an all-encompassing solution. In fact some advanced things are not even possible with regex alone in Python due to differences in the supported syntax. – Waylan Oct 30 '15 at 20:00
0

With XML parser:

import xml.parsers.expat

cnt=0
state="idle"
current_key=""
current_value=""
fields=[]

def start_element(name, attrs):
    global state
    global current_key
    global current_value
    global fields
    if name=="player":
        state="player"
    elif state=="player":
        current_key=name

def end_element(name):
    global state
    global current_key
    global current_value
    global fields
    global cnt
    if state=="player":
        if name=="player":
            state="idle"
            line="\t".join(fields)
            print(line,file=outf)
            fields=[]
            cnt+=1
            if((cnt%10000)==0):
                print(cnt,"players processed")
        else:
            fields.append(current_value)
            current_key=""
            current_value=""

def char_data(data):
    global state
    global current_key
    global current_value
    if state=="player" and not current_key=="":
        current_value=data

p = xml.parsers.expat.ParserCreate()

p.StartElementHandler = start_element
p.EndElementHandler = end_element
p.CharacterDataHandler = char_data

fh=open("players_list_xml.xml")

outf=open("players.txt","w")

line=True

while((cnt<1000000) and line):

    line=fh.readline().rstrip()

    p.Parse(line)

outf.close()

fh.close()

This is quite an amount of code.

At least this produces a 29MB text file from the original XML, which size seems right.

The speed is reasonable, though this is a simplistic version, more processing is needed on the records.

In the end of the day it seems that a Perl script with only regexes is working at the speed of a dedicated XML parser, which is remarkable.

javachessgui
  • 147
  • 2
  • 11
0

The correct answer as everyone else has said is to use an XML parser to parse XML.

The answer to your question about why it's so much slower than your perl version is that for some reason python's regular expressions are just slow, much slower than perl's to handle the same expression. I often find that code that uses regexps is more than twice as fast in perl.