-6

I have the following string and i would like to extract the value of the field

<td class="label" width="150"">State</td><td width="" class="field">Approved&nbsp;</td>

in this case it should be Approved

Also sometime the input can be like this

<td class="label" width="150"">Type</td><td width="" class="field">Technical&nbsp;Document&nbsp;</td>

which shd result in Technical Document

Sometimes it can be

 <td class="label" width="150"">Title</td><td width="" class="field">Reversal Plate</td>

In this case it will be Reversal Plate

How can we write a regular expression for such string.

2 Answers2

1

Don't use regex for this, you should use some HTML/XML parser, like BeautifulSoup for example.

from bs4 import BeautifulSoup
soup = BeautifulSoup(s,'html.parser') #`s` being your string.
for td in soup.findAll('td',class_="field"):
    print(td.get_text())

The above would get correct results for both your examples.

Demo -

>>> s = """<td class="label" width="150"">State</td><td width="" class="field">Approved&nbsp;</td>"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s,'html.parser')
>>> for td in soup.findAll('td',class_="field"):
...     print(td.get_text())
...
Approved 
>>> s = """<td class="label" width="150"">Type</td><td width="" class="field">Technical&nbsp;Document&nbsp;</td>"""
>>> soup = BeautifulSoup(s,'html.parser')
>>> for td in soup.findAll('td',class_="field"):
...     print(td.get_text())
...
Technical Document 
Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
  • I would like to try out the available options, is it possible to get it using regex? – Ridhi Jain Oct 13 '15 at 06:21
  • @RidhiJain You can look at the other answer, but please note it will only work in very very specific cases. So if you are 100% sure that the three examples you gave are the only cases you want to find , you can use that. it would stop working if there is a small scape between `"` and `>` in the tag. But most regex solutions you are going to get would be like that. – Anand S Kumar Oct 13 '15 at 06:29
  • how to install beautiful Soup....i m new to python, how to check wat version i m running – Ridhi Jain Oct 13 '15 at 06:49
  • `import sys; print(sys.version)` should give you your current python version. You can install beautiful soup from `pip` using `pip install beautifulsoup` . – Anand S Kumar Oct 13 '15 at 06:51
  • http://stackoverflow.com/questions/19957194/install-beautiful-soup-using-pip This might help you. – Anand S Kumar Oct 13 '15 at 07:12
0

As mentioned by @Anand S Kumar you don't have to use regex, using Beautifulsoup is faster. However, since you asked for a regex solution, you can use the code below:

import re
s = '<td class="label" width="150"">State</td><td width="" class="field">Approved&nbsp;</td>'
m = re.compile('"field">(.*)<')
print (m.search(s).group(1))

Output:

Approved&nbsp;

This regex solution will match anything that is inside the class="field">....</td>

Joe T. Boka
  • 6,554
  • 6
  • 29
  • 48