Python re regex matching issue

Question

Ok please be gentle - this is my first stackoverflow question and I've struggled with this for a few hours. I'm sure the answer is something obvious, staring me in the face but I give up.

I'm trying to grab an element from a webpage (ie determine gender of a name) from a name website.

The python code I've written is here:

import re
import urllib2

response=urllib2.urlopen("http://www.behindthename.com/name/janet")
html=response.read()
print html

patterns = ['Masculine','Feminine']

for pattern in patterns:
print "Looking for %s in %s<<<" % (pattern,html)

    if re.findall(pattern,html):
        print "Found a match!"
        exit
    else:
        print "No match!"

When I dump html I see Feminine there, but the re.findall isn't matching. What in the world am I doing wrong?

You know that with such simple regex you can just do `if pattern in html`? — vaultah, Jul 29 '14 at 20:27
Other answers notwithstanding, I don't see any reason why your code as given won't actually work. What do you get if you `print re.findall(pattern, html)` in the loop? — Greg Hewgill, Jul 29 '14 at 20:36

score 1 · Answer 1 · edited May 23 '17 at 11:50

1

Do not parse an HTML with regex, use a specialized tool - an HTML parser.

Example using BeautifulSoup:

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = 'http://www.behindthename.com/name/janet'
soup = BeautifulSoup(urlopen(url))

print soup.select('div.nameinfo span.info')[0].text  # prints "Feminine"

Or, you can find an element by text:

gender = soup.find(text='Feminine')

And then, see if it is None (not found) or not: gender is None.

edited May 23 '17 at 11:50

Community

1
1

answered Jul 29 '14 at 20:26

alecxe

462,703
120
1,088
1,195

@alecxe yes, the top answer in that link was very clear and easy to understand. No confusion at all. (I'm bookmarking that for later use.) – TheSoundDefense Jul 29 '14 at 20:37

Python re regex matching issue

1 Answers1