0

I have a file I need to parse. The parsing is built incrementally, such that on each iteration the expressions becomes more case specific.

The code segment which overloads the system looks roughly like this:

    for item in ret:
        pat = r'a\sstyle=".+class="VEAPI_Pushpin"\sid="msftve(.+?)".+>%s<'%item[1]
        r=re.compile(pat, re.DOTALL)
        match = r.findall(f)

The file is a rather large HTML file (parsed from bing maps), and each answer must match its exact id.

Before appying this change the workflow was very good. Is there anything I can do to avoid this? Or to optimize the code?

ire_and_curses
  • 68,372
  • 23
  • 116
  • 141
242Eld
  • 225
  • 2
  • 5
  • 13
  • 5
    Ha! That's what you get for using a regex to parse HTML. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Ignacio Vazquez-Abrams May 18 '11 at 21:20
  • 1
    First of all, don't use regex to parse HTML! Second of all, what kind of crash? Segfault, or Python exception? Any useful message? – Santa May 18 '11 at 21:25
  • the Python enviornment just stops responding, though keyboard interrupts do "wake it up" @santa – 242Eld May 19 '11 at 08:16

1 Answers1

0

My only guess is that you are getting too many matches and running out of memory. Though this doesn't seem very reasonable, it might be the case. Try using finditer instead of findall to get one match at a time without creating a monster list of matches. If that doesn't fix your problem, you might have stumbled on a more serious bug in the re module.

dusktreader
  • 3,845
  • 7
  • 30
  • 40
  • Running out of memory will throw him a MemoryError (or something to that effect, I don't remember the error name exactly). – Santa May 19 '11 at 17:24
  • I think that was the problem, Ive had to many matches/compiled item to the point i ran out of memory. thankfully now it's solved! – 242Eld May 20 '11 at 10:07