Pattern matching with regex returns None while it should not

Question

I am learning regex and Beautiful Soup and I am doing the Google Tutorial on Regex. I am using the HTML files provided in the Google Tutorial website (exercise set in the set up section of the tutorial)

The code is the following:

with open(filepath,"r") as f: soup = bs(f, 'lxml')
soup.title

out

<title>Popular Baby Names</title>

code:

h3 = soup.find_all("h3") # With find_all() I will capture the content of the <h3> Tags (In fact only one h3 Tag exists
                         # containing the Year)

h3[0].get_text()

out

u'Popularity in 1990'

code:

pattern = re.compile(r'.+(\d\d\d\d).+') 
string = h3[0].get_text()
pattern.match(string).group(0)

out

AttributeError                            Traceback (most recent call last)
<ipython-input-61-2e4daef3292c> in <module>()
----> 1 pattern.match(string).group(0)

AttributeError: 'NoneType' object has no attribute 'group'

I can not explain why match() does not capture the year as it should.

Your advice will be appreciated.

Your string ends with `1990`, so the later `.+` can't match anything. — Sebastian Proske, Jan 10 '17 at 21:24
As other comments have stated, your regex doesn't work - you can test here: https://regex101.com/r/d2NjKz/1 — ti7, Jan 10 '17 at 21:25
Possible duplicate of [Python: Extract numbers from a string](http://stackoverflow.com/questions/4289331/python-extract-numbers-from-a-string) — ti7, Jan 10 '17 at 21:28
Thank you. The problem was the .+ in the end. When I deleted it worked. However, I had tested it on regex101 but instead of passing the text only, I had passed the tags as well so it worked. — gk7, Jan 10 '17 at 21:39

score 1 · Answer 1 · answered Jan 10 '17 at 21:24

1

Because it expects at least one character after the year. Try .* instead of .+

answered Jan 10 '17 at 21:24

palako

3,342
2
23
33

Why does it need to match `.*`? – Peter Wood Jan 10 '17 at 21:25
`*` matches zero or more of the former character, so no more characters are required to get a match. – ti7 Jan 10 '17 at 21:26
it doesn't, I'm assuming he did .+ because he might want something after the year, but + requires at least one character. zero or more is *. – palako Jan 10 '17 at 21:27
@palako But it isn't used to match anything. – Peter Wood Jan 11 '17 at 14:55

Pattern matching with regex returns None while it should not

1 Answers1