Use regex to extract all values from HTML

Question

I need a nudge to finish out this script.

I'm scraping a newsletter site for a particular substring. The intent is to parse the page for a particular section called Companies mentioned.. and get the names of each company into a List datatype

here is what I have so far, which works but only gets the first item:

from bs4 import BeautifulSoup as bs4
import requests
import re

url = 'http://news.hipsternomics.com/issues/how-much-is-your-personal-data-worth-on-the-black-market-148489'
r = requests.get(url).text
soup = bs4(r, 'html.parser')
companies = []
for elem in soup(text=re.compile(r'^(.*?Companies mentioned\b)')):
    companies.append(elem)

Desired Outcome:

I'd like to get the mentioned companies into a list as such: [Google, Apple, Tesla, Nike, TJX, Ross, L Brands, Dominoes]

Also open to ways i can improve the regex function to catch anomalies like "Companies mentioned in this issue:" or "Companies mentioned:" as seen here. Thanks.

Generally using regex to parse HTML is a very bad idea. You should rely on a full-featured XML/HTML parser. — Pinke Helga, Jan 11 '19 at 03:26

score 2 · Accepted Answer · answered Jan 11 '19 at 03:28

2

You can access the content by providing the div class value:

import requests, re
from bs4 import BeautifulSoup as soup
d = soup(requests.get('http://news.hipsternomics.com/issues/how-much-is-your-personal-data-worth-on-the-black-market-148489').text, 'html.parser')
new_d = [i for i in d.find_all('div', {'class':'revue-p'}) if 'Companies mentioned' in i.text][0]
*final_results, _ = [re.sub('^[\w\s]+[,\s:]+|^[,\s]+|\s+$', '', i) for i in new_d.contents if isinstance(i, str)]

Output:

['Google', 'Apple', 'Tesla', 'Nike', 'TJX', 'Ross', 'L Brands', 'Domino’s']

answered Jan 11 '19 at 03:28

Ajax1234

69,937
8
61
102

what does the syntax on the last line left hand side mean please? Looks like an unpacking. – QHarr Jan 11 '19 at 08:35
1

@QHarr Yes, it is unpacking. `_` is known as a [throwaway](https://stackoverflow.com/questions/36315309/how-does-python-throw-away-variable-work) variable. – Ajax1234 Jan 11 '19 at 14:28
@QHarr Glad to help! – Ajax1234 Jan 11 '19 at 14:30
super helpful. thanks. did you use a tool to build the regex function? – ezeagwulae Jan 11 '19 at 23:07

Hanxue · Answer 2 · 2019-01-11T09:25:54.620

What you want to achieve cannot be done with only regex alone. A capture group can only capture one thing, and there is no way you can capture groups dynamically. This article has further explanation.

What I would do is to first get the string of all the companies

all_companies = re.search(r'Companies mentioned YTD:\s(.*)', orig_text).group(1)
print(all_companies, '\n')

Next, split the string by ,

companies_percent = all_companies.split(', ')

# print(companies_percent, '\n')
# Output
# ['Google -1%', 'Apple 0%', 'Tesla +15%', 'Nike +17%', 'TJX +18%', 'Ross -2%', 'L Brands -47%', 'Domino’s +37%']

And finally remove the percentage after the company name

companies = list(map(lambda x: re.search(r'(.*)\s[\+|-]?\d+%', x).group(1), companies_percent))

# print(companies, '\n')
# Output
# ['Google', 'Apple', 'Tesla', 'Nike', 'TJX', 'Ross', 'L Brands', 'Domino’s']

Putting it all together:

import re
from bs4 import BeautifulSoup as bs4
import requests

url = 'http://news.hipsternomics.com/issues/how-much-is-your-personal-data-worth-on-the-black-market-148489'
r = requests.get(url).text
soup = bs4(r, 'html.parser')

all_companies = re.search(r'Companies mentioned YTD:\s(.*)', soup.get_text()).group(1)
companies_percent = all_companies.split(', ')
companies = list(map(lambda x: re.search(r'(.*)\s[\+|-]?\d+%', x).group(1), companies_percent))

Runnable example at https://repl.it/@hanxue/capturingrepeatedtextgrouppython

Use regex to extract all values from HTML

2 Answers2