2

I try to extract information with beautifulsoup4 methods by means of reg. exp. But I get the following answer:

AttributeError: 'NoneType' object has no attribute 'group'

I do not understand what is wrong.. I am trying to:

  1. get the Typologie name: 'herenhuizen'
  2. get the weblink

Here is my code:

import requests
from bs4 import BeautifulSoup
import re

url = 'https://inventaris.onroerenderfgoed.be/erfgoedobjecten/4778'
page = requests.get(url)

soup = BeautifulSoup(page.text, 'html.parser')
text = soup.prettify()

##block
p = re.compile('(?s)(?<=(Typologie))(.*?)(?=(</a>))', re.VERBOSE)
block = p.search(text).group(2)


##typo_url
p = re.compile('(?s)(?<=(href=\"))(.*?)(?=(\">))', re.VERBOSE)
typo_url = p.search(block).group(2)


## typo_name
p = re.compile('\b(\w+)(\W*?)$', re.VERBOSE)
typo_name = p.search(block).group(1)

Does someone have an idea where is the mistake?

francois
  • 43
  • 4

1 Answers1

0

I would change this:

## typo_name
block_reverse = block[::-1]
p = re.compile('(\w+)', re.VERBOSE)
typo_name_reverse = p.search(block_reverse).group(1)
typo_name = typo_name_reverse[::-1]
print(typo_name)

Sometimes it's easier to just reverse the string if you are looking for stuff at the end. This just finds the name at the end of your block. There are a number of ways to find what you are looking for, and we could come up with all kinds of clever regexes, but if this works that's probably enough :)

update

However I just noticed the reason the original regex was not working is to use \b it needs to be escaped like \\b or be raw like this:

## typo_name
p = re.compile(r'\b(\w+)(\W*?)$', re.VERBOSE)
typo_name = p.search(block).group(1)

Some good followed Q and A here: Does Python re module support word boundaries (\b)?

sniperd
  • 5,124
  • 6
  • 28
  • 44
  • It works fine. Thanks! And yes, the important is that it is working and get what we want.. But I am still wandering why the regexp in the typo block did not work.. – francois Jun 12 '18 at 07:14
  • @francois ah, I just realized it. To use `\b` you need to do this: `p = re.compile(r'\b(\w+)(\W*?)$', re.VERBOSE)` to make it raw or you need to do `\\b` similar question with some answers here: https://stackoverflow.com/questions/3995034/does-python-re-module-support-word-boundaries-b – sniperd Jun 12 '18 at 12:47