1

I've just began learning Python and I've ran into a small problem. I need to parse a text file, more specifically an HTML file (but it's syntax is so weird - divs after divs after divs, the result of a Google's 'View as HTML' for a certain PDF i can't seem to extract the text because it has a messy table done in m$ word).

Anyway, I chose a rather low-level approach because i just need the data asap and since I'm beginning to learn Python, I figured learning the basics would do me some good too.

I've got everything done except for a small part in which i need to retrieve a set of integers from a set of divs. Here's an example:

<div style="position:absolute;top:522;left:1020"><nobr>*88</nobr></div>

Now the numbers i want to retrieve all the ones inside <nobr></nobr> (in that case, '588') and, since it's quite a messy file, i have to make sure that what I am getting is correct. To do so, that number inside <nobr></nobr> must be preceded by "left:1020", "left:1024" or "left:1028". This is because of the automatic conversion and the best choice would be to get all the number preceded by left:102[0-] in my opinion.

To do so, I was trying to use:

for o in re.finditer('left:102[0-9]"><nobr>(.*?)</nobr></div>', words[index])
    out = o.group(1)

But so far, no such luck... How can I get those numbers?

Thanks in advance, J.

SilentGhost
  • 307,395
  • 66
  • 306
  • 293
João Pereira
  • 3,545
  • 7
  • 44
  • 53
  • 1
    Obligatory: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – kennytm Jan 28 '10 at 11:37
  • I am just checking, but in the line below the data you are digesting you have `(in that case, '588')` but in the line of data what is between `` is `*88`. I would edit and fix it, but I do not know which is the correct entry. – deadstump Aug 17 '12 at 20:21

1 Answers1

1

Don't use regular expressions to parse HTML. BeautifulSoup will make light work of this.

As for your specific problem, it might be that you are missing a colon at the end of the first line:

for o in re.finditer('left:102[0-9]"><nobr>(.*?)</nobr></div>', words[index]):
    out = o.group(1)

If this isn't the problem, please post the error you are getting, at what you expect the output to be.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • Yeah, I've heard about it but I wasn't sure it would manage to get all those weird divs, hence the low-level approach – João Pereira Jan 28 '10 at 11:38
  • @Hal: BeautifulSoup can find tags based on attributes, and it can even accept regex as arguments for the search if you need that. – Mark Byers Jan 28 '10 at 11:41
  • Cool, didn't know it was so powerful. Anyway, I've practically finished the script, all that's missing is getting those integers. I guess I could simply make 10 searches, but that would be plain dumb and I'd like to learn how one could use regex on that string. – João Pereira Jan 28 '10 at 11:43
  • You did it. I wasn't getting any error at all, for some reason the damn thing would just output a blank space. Thanks for putting up with this noob crap, it's guys like you that make StackOverflow so awesome. – João Pereira Jan 28 '10 at 11:48