Your regex is slow because of "backtracking" as you are using a "greedy" expression (this answer provides a simple Python example). Also, as mentioned in a comment, you should be using an XML parser to parse XML. Regex has never been very good for XML (or HTML).
In an attempt to explain why your specific expression is slow...
Lets assume you have three <player>...</player>
elements in your XML. Your regex would start by matching the first opening <player>
tag (that part is fine). Then (because you are using a greedy match) it would skip to the end of the document and start working backwards (backtracking) until it matched the last closing </player>
tag. With a poorly written regex, it would stop there (all three elements would be in one match with all non player elements between them as well). However, that match would obviously be wrong so you make a few changes. Then the new regex would continue were the previously left off by continuing to backtrack until it found the first closing </player>
tag. Then it would continue to backtrack until it determined there were no additional </player>
tags between the opening tag and the most recently found closing tag. Then it would repeat that process for the second set of tags and again for the third. All that backtracking takes a lot of time. And that is for a relatively small file. In a comment you mention your files contain "more than half a million records". Ouch! I can't image how long that would take. And you're actually matching all elements, not just "player" elements. Then you are running a second regex against each element to check whether they are player elements. I would never expect this to be fast.
To avoid all that backtracking, you can use a "nongreedy" or "lazy" regex. For example (greatly simplified form your code):
r"<player>(.*?)</player>"
Note that the ?
indicates that the previous pattern (.*
) is nongreedy. In this instance, After finding the first opening <player>
tag, it would then continue to move forward through the document (not jumping to the end) until it found the first closing </player>
tag and then it would be satisfied that the pattern had matched and move on to find the second occurrence (but only by searching within the document after the end of the first occurrence).
Naturally, the nongreedy expression will be much faster. In my experience, nongreedy is almost always what you want when doing *
or +
matches (except for the rare cases when you don't).
That said, as stated previously, an XML parser is much more suited to parsing XML. In fact, many XML parsers offer some sort of steaming API which allows you to feed the document in in pieces in order to avoid loading the entire document into memory at once (regex does not offer this advantage). I'd start with lxml and then move to some of the builtin parsers if the C dependency doesn't work for you.