How can I extract content from web source code with re.findall?

Question

i have extracted a long web source code and within the long source code what i want to extract is the content within the span tag.

<span class="a-size-medium a-color-base a-text-normal">
  Apple iPhone 6S, GSM Unlocked, 16GB - Rose Gold (Renewed)
</span>

i.e. i want to retrieve 'Apple iPhone 6S, GSM Unlocked, 16GB - Rose Gold (Renewed)'

How can I use re.findall to extract the relevant contact within the tags? or any other easier way to do so? thanks.

You should generally avoid [parsing HTML with regex](https://stackoverflow.com/a/1732454/1678362), consider using an existing HTML parser. In particular, beautifulsoup makes extracting values from HTML very easy — Aaron, Apr 30 '19 at 14:33
You can use code blocks (the `{}` button in the editor) to correctly represent HTML in your question, you should check the edit history of your question as others tried to fix that and you might just want to rollback to a previous version. — Aaron, Apr 30 '19 at 14:41
I have tried BeautifulSoup4 and HtmlParser. I actually like BS4 the best for HTML Parsing. Its been awhile, but it may not be native on your machine and it may need to be fetched with PIP. — Fallenreaper, Apr 30 '19 at 15:00

score 1 · Accepted Answer · answered Apr 30 '19 at 14:42

You should use BeautifulSoup or something similar for this kind of task. Once you have your page's html in a variable, such as html in my example below, it is easy to find elements. Use the .text property to extract what you are looking for.

from bs4 import BeautifulSoup

html = # I used your source code provided
soup = BeautifulSoup(html, 'html.parser')
items = soup.find_all('span', {'class': 'a-size-medium'})

for item in items:
    print(item.text)
# Apple iPhone 6S, GSM Unlocked, 16GB - Rose Gold (Renewed)

Of course, this will work in the example code you provided, but I suspect you might have to play around with isolating the part you want to parse.

score 1 · Answer 2 · answered Apr 30 '19 at 14:56

As Brian Cohan answered - it's not best practice to use regex in order to parse HTML source code. I would recommend to use BS4 or html.parser. But still, answering your question, you can use this regex: (?:(?<=<span)(.*)(?<=>)).*(?=</span>) in order to obtain the data.

score 0 · Answer 3 · answered Apr 30 '19 at 14:59

0

https://scrapy.org/ is a good library to do what you want and you have plenty utility to get tags/ pattern of your HTML web page

answered Apr 30 '19 at 14:59

sslloo

521
2
10

How can I extract content from web source code with re.findall?

3 Answers3