0

I am learning regex and Beautiful Soup and I am doing the Google Tutorial on Regex. I am using the HTML files provided in the Google Tutorial website (exercise set in the set up section of the tutorial)

The code is the following:

with open(filepath,"r") as f: soup = bs(f, 'lxml')
soup.title

out

<title>Popular Baby Names</title>

code:

h3 = soup.find_all("h3") # With find_all() I will capture the content of the <h3> Tags (In fact only one h3 Tag exists
                         # containing the Year)

h3[0].get_text() 

out

u'Popularity in 1990'

code:

pattern = re.compile(r'.+(\d\d\d\d).+') 
string = h3[0].get_text()
pattern.match(string).group(0)

out

AttributeError                            Traceback (most recent call last)
<ipython-input-61-2e4daef3292c> in <module>()
----> 1 pattern.match(string).group(0)

AttributeError: 'NoneType' object has no attribute 'group'

I can not explain why match() does not capture the year as it should.

Your advice will be appreciated.

MattDMo
  • 100,794
  • 21
  • 241
  • 231
gk7
  • 145
  • 1
  • 2
  • 7
  • 2
    Your string ends with `1990`, so the later `.+` can't match anything. – Sebastian Proske Jan 10 '17 at 21:24
  • As other comments have stated, your regex doesn't work - you can test here: https://regex101.com/r/d2NjKz/1 – ti7 Jan 10 '17 at 21:25
  • Possible duplicate of [Python: Extract numbers from a string](http://stackoverflow.com/questions/4289331/python-extract-numbers-from-a-string) – ti7 Jan 10 '17 at 21:28
  • Thank you. The problem was the .+ in the end. When I deleted it worked. However, I had tested it on regex101 but instead of passing the text only, I had passed the tags as well so it worked. – gk7 Jan 10 '17 at 21:39

1 Answers1

1

Because it expects at least one character after the year. Try .* instead of .+

palako
  • 3,342
  • 2
  • 23
  • 33