regular expression - try to find name in html result

Question

I'm trying to get certain data from a webpage. I'm using Python and urllib to get this information, but this data is surrounded with a load of useless information. I figured it out that the best solution to get this information, is to use regular expression.

I'm looking for the name "Huisman, D.J." in the following string of text. This text is already a selection of the full text:

\n    \n    \n</div>\n        <div class="col-sm-8 col-md-6" id="id12">\n
        <div>\n                \n                <div class="col-xs-11">\n
<div>Huisman, D.J.</div>\n</div>\n                \n            </div>\n
    </div>\n        \n    </div>\n</div>\n            </div><div 
id="id13">\n                <div id="id14">\n    \n    <div class="row">\n 
   <div class="col-sm-2 col-md-2">\n

I try the following two expressions. The first I try to select between div and /div. expression is:

r'<div>+(.*?)</div>'

But it fails because before the div where I want it to select from, there is already a div. So I get:

['\\n                \\n                <div class="col-xs-11">\\n    <div>Huisman, D.J.']

So I thought, maybe I can select on the first capital that exist until /div but it select after the first capital. Code and result:

#expression:
r'[A-Z]+(.*?)</div>'
#result
['uisman, D.J.']

Can somebody help me?

Following your logic, I think you can use `r'([A-Z].*?)'`. I think you are using Python, I added the tag. — Wiktor Stribiżew, Feb 06 '16 at 21:41
thanks, i feel like a total noob now. Maybe i m. But thats the answer! — B.Termeer, Feb 06 '16 at 21:42
Are you sure you won't have any other uppercase letters in other input? It does not sound like a final solution to me. — Wiktor Stribiżew, Feb 06 '16 at 21:45
[You can't parse HTML with regex.](http://stackoverflow.com/a/1732454/5276734) — bastelflp, Feb 06 '16 at 21:47
"I figured it out that the best solution to get this information, is to use regular expression" I would like to know how you came upon that idea — OneCricketeer, Feb 06 '16 at 21:52

Dušan Maďar · Accepted Answer · 2016-02-06T22:07:12.313

3

Use a HTML parsing library like BeautifulSoup instead of a regular expression. Also, the HTML in your example is not valid.

from bs4 import BeautifulSoup

html = """
<div class="col-sm-8 col-md-6" id="id12">\n
        <div>\n                \n                <div class="col-xs-11">\n
<div>Huisman, D.J.</div>\n</div>\n                \n            </div>\n
    </div>\n        \n    </div>\n</div>\n            </div><div 
id="id13">\n                <div id="id14">\n    \n    <div class="row">\n 
   <div class="col-sm-2 col-md-2">\n
"""

html = html.strip()
soup = BeautifulSoup(html, 'html.parser')

target_divs = soup.findAll('div', {'class': 'col-xs-11'})
for div in target_divs:
    print div.get_text().strip()

>>> Huisman, D.J.

edited Feb 06 '16 at 22:07

answered Feb 06 '16 at 21:53

Dušan Maďar

9,269
5
49
64

1

Note that, in general, this will only print the text inside the divs with `{'class': 'col-xs-11'}` – OneCricketeer Feb 06 '16 at 21:55
Thanks, i didn't know that it can also be used. i m trying to use it now – B.Termeer Feb 06 '16 at 21:59

regular expression - try to find name in html result

1 Answers1