0

I am trying to make a regex that finds all names, url and phone numbers in an html page. But I'm having trouble with the phone number part. I think the problem with the numbers part is that is searches until it finds the </strong> but in that process it skips people, instead of making a empty string if the person has no phone number ( simply put instead of a list like this: url1+name1+num1 | url2+name2+"" | url3+name3+num3 it returns a list like this: url1+name1+num1 | url2+name2+num3 , with url3+name3 deleted in the process)

for url, name, pnumber in re.findall('Name"><div>(?:<a href="/si([^">]*)"> )?([^<]*)(?:.*?</strong>([^<]*))?',page):

I am searchin for people in s single very long line. A person could have an url or phone number. An example of a person with an url and a phone number

 <tr>  <td class="lablinksName"><div><a href="/si/ivan-bratko/default.html"> dr. Ivan Bratko  akad. prof.</a></div></td>  <td class="lablinksMail"><a href="javascript:void(cmPopup('sendMessage', '/si/ivan-bratko/mailer.html', true, 350, 350));"><img src="/Static/images/gui/mail.gif" height="8" width="11"></a></td> <td class="lablinksPhone"><div><strong>T:</strong> +386  1 4768 393 </div></td> </tr>

And an example of a person with no url or phone number

 <tr>  <td class="lablinksName"><div> dr. Branko Matjaž  Jurič   prof.</div></td>  <td class="lablinksMail"><a href="javascript:void(cmPopup('sendMessage', '/si/branko-matjaz-juric/mailer.html', true, 350, 350));"><img src="/Static/images/gui/mail.gif" height="8" width="11"></a></td> <td class="lablinksPhone"><div> </div></td> </tr>

I hope i was clear enough and if any one can help me.

Brian
  • 25,523
  • 18
  • 82
  • 173
Hauba
  • 75
  • 3
  • 8
  • 5
    [You don't parse (X?HT|X)ML with regex.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) Period. –  Dec 28 '10 at 15:48
  • 2
    @delnan: while that's very good advice, it's not a universal maxim. There are certainly times when it's ok to use regex to parse data that looks like xml (for example, tiny fragments that have only a single tag). Instead of blindly following certain rules, learn the strengths and weaknesses of your tools and decide for yourself. – Bryan Oakley Dec 28 '10 at 16:03
  • @Bryan: Yes, of course. I take that for granted, regardless of the topic. I'm just too lazy to mention it every single time, although I propably should to avoid creating mindless best-practices-must-be-obeyed zombies :( That being said, if you can use BeautifulSoup or lxml, you'd be hard-pressed to find an excuse not to, as they're very powerful and can often do this in even fewer characters. –  Dec 28 '10 at 16:05

4 Answers4

1
import lxml.html

root = lxml.html.parse("http://my.example.com/page.html").getroot()
rows = root.xpath("//table[@id='contactinfo']/tr")

for r in rows:
    nameText = r.xpath("td[@class='lablinksName']/div/text() | td[@class='lablinksName']/div/a/text()")
    name = u''.join(nameText).strip()

    urls = r.xpath("td[@class='lablinksName']/div/a/@href")
    url = len(urls)>0 and urls[0] or ''

    phoneText = r.xpath("td[@class='lablinksPhone']/div/text()")
    phone = u''.join(phoneText).strip()

    print name, url, phone

For the purpose of this code, I assume <table id="contactinfo">{your table rows}</table>.

Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99
0

If you're having this kind of difficulty, it's usually a good sign you're using the wrong approach. In particular, if I were doing this via regexp, I wouldn't even try unless the line in question had the "<td class="lablinksPhone">" tag in it.

Jay Maynard K5ZC
  • 351
  • 2
  • 11
  • the line "
    " is always present and is followed by a and a number or an empty space if the person has no number, the problem is that if no number is present it reads until the next number, while i would like it to stop at the first
    – Hauba Dec 28 '10 at 16:26
0

Looks like a job for Beautiful Soup.

I love the quote: "You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like. Neither does this parser."

Paulo Scardine
  • 73,447
  • 11
  • 124
  • 153
0

The quick and dirty way to fix it:

Replace

for url, name, pnumber in re.findall('Name"><div>(?:<a href="/si([^">]*)"> )?([^<]*)(?:.*?</strong>([^<]*))?',page):

with

for url, name, pnumber in re.findall('Name"><div>(?:<a href="/si([^">]*)"> )?([^<]*)(?:.*?</strong>([^<]*))?',page.replace("<tr>","\n"):

The issue is that the the .*? in .*?</strong> can match strings containing td class="lablinksMail. It cannot match \n. Any time you use . in a Regex (rather than [^<]), this kind of annoyance tends to happen.

Brian
  • 25,523
  • 18
  • 82
  • 173
  • I am assuming `""` is always present. If it is not, replace `` with `"\n"` or `"\n"` instead. You already said that `` was always present. – Brian Dec 28 '10 at 19:00