1

I have the following (repeating) HTML text from which I need to extract some values using Python and regular expressions.

<tr>
<td width="35%">Demand No</td>
<td width="65%"><input type="text" name="T1" size="12" onFocus="this.blur()" value="876716001"></td>
</tr>

I can get the first value by using

match_det = re.compile(r'<td width="35.+?">(.+?)</td>').findall(html_source_det)

But the above is on one line. However, I also need to get the second value which is on the line following the first one but I cannot get it to work. I have tried the following, but I won't get a match

match_det = re.compile('<td width="35.+?">(.+?)</td>\n'
                       '<td width="65.+?value="(.+?)"></td>').findall(html_source_det)

Perhaps I am unable to get it to work since the text is multiline, but I added "\n" at the end of the first line, so I thought this would resolve it but it did not.

What I am doing wrong?

The html_source is retrieved downloading it (it is not a static HTML file like outlined above - I only put it here so you could see the text). Maybe this is not the best way in getting the source.

I am obtaining the html_source like this:

new_url = "https://webaccess.site.int/curracc/" + url_details #not a real url
myresponse_det = urllib2.urlopen(new_url)
html_source_det = myresponse_det.read()
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
moster67
  • 830
  • 3
  • 12
  • 30
  • 3
    Why regex over something like BeautifulSoup? – Andy Jul 07 '15 at 15:16
  • take a look into ^ and $ for regular expressions (instead of using \n) – g3rv4 Jul 07 '15 at 15:19
  • Your code [works for me](http://ideone.com/h4QAVK). Pleae provide the shortest possible **complete** program that demonstrates your error. See http://stackoverflow.com/help/mcve for more info. – Robᵩ Jul 07 '15 at 15:21
  • Take a look at [this](http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not) post that suggests why you shouldn't use a regex. – Malik Brahimi Jul 07 '15 at 15:21
  • Rob: indeed your code works! Maybe it works because the html_source is a static string. I posted the string so you could see it but actually I get it by downloading it. I updated my question with the code showing how I get the html_source. Maybe there are some encoding issues or dirty not printable characters I need to get rid off.... – moster67 Jul 07 '15 at 16:16
  • I hear you! I have read your comments about why not using regex for html. I was not aware of BeautifuSoup and I will have a look at it. I am not a regular Python-programmer and I only needed to put up a quick script to get some data and I thought regex would do just fine. As you can see from my earlier comment, the pattern actually works as Rob showed but maybe by downloading it, I retrieve also other invisible and dirty characters... – moster67 Jul 07 '15 at 16:19
  • Re: Don't use regex for html: See [this](http://stackoverflow.com/a/1732454/3665278) SO meme – Mitch Jul 07 '15 at 17:24
  • It is even [an official meme](http://meta.stackexchange.com/questions/19478/the-many-memes-of-meta/216029#216029). It starts out as: "You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions." – Peter Mortensen Jul 14 '15 at 18:37

1 Answers1

3

Please do not try to parse HTML with regex, as it is not regular. Instead use an HTML parsing library like BeautifulSoup. It will make your life a lot easier! Here is an example with BeautifulSoup:

from bs4 import BeautifulSoup

html = '''<tr>
<td width="35%">Demand No</td>
<td width="65%"><input type="text" name="T1" size="12" onFocus="this.blur()" value="876716001"></td>
</tr>'''

soup = BeautifulSoup(html)
print soup.find('td', attrs={'width': '65%'}).findNext('input')['value']

Or more simply:

print soup.find('input', attrs={'name': 'T1'})['value']
heinst
  • 8,520
  • 7
  • 41
  • 77
  • Thank you for this and your suggestions about not using regex with html. I will definitely have a look at this BeautifulSoup library. – moster67 Jul 07 '15 at 16:21