0

Ok please be gentle - this is my first stackoverflow question and I've struggled with this for a few hours. I'm sure the answer is something obvious, staring me in the face but I give up.

I'm trying to grab an element from a webpage (ie determine gender of a name) from a name website.

The python code I've written is here:

import re
import urllib2

response=urllib2.urlopen("http://www.behindthename.com/name/janet")
html=response.read()
print html

patterns = ['Masculine','Feminine']

for pattern in patterns:
print "Looking for %s in %s<<<" % (pattern,html)

    if re.findall(pattern,html):
        print "Found a match!"
        exit
    else:
        print "No match!"

When I dump html I see Feminine there, but the re.findall isn't matching. What in the world am I doing wrong?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • 1
    You know that with such simple regex you can just do `if pattern in html`? – vaultah Jul 29 '14 at 20:27
  • Other answers notwithstanding, I don't see any reason why your code as given won't actually work. What do you get if you `print re.findall(pattern, html)` in the loop? – Greg Hewgill Jul 29 '14 at 20:36

1 Answers1

1

Do not parse an HTML with regex, use a specialized tool - an HTML parser.

Example using BeautifulSoup:

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = 'http://www.behindthename.com/name/janet'
soup = BeautifulSoup(urlopen(url))

print soup.select('div.nameinfo span.info')[0].text  # prints "Feminine"

Or, you can find an element by text:

gender = soup.find(text='Feminine')

And then, see if it is None (not found) or not: gender is None.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • @alecxe yes, the top answer in that link was very clear and easy to understand. No confusion at all. (I'm bookmarking that for later use.) – TheSoundDefense Jul 29 '14 at 20:37