3

I need a nudge to finish out this script.

I'm scraping a newsletter site for a particular substring. The intent is to parse the page for a particular section called Companies mentioned.. and get the names of each company into a List datatype

here is what I have so far, which works but only gets the first item:

from bs4 import BeautifulSoup as bs4
import requests
import re

url = 'http://news.hipsternomics.com/issues/how-much-is-your-personal-data-worth-on-the-black-market-148489'
r = requests.get(url).text
soup = bs4(r, 'html.parser')
companies = []
for elem in soup(text=re.compile(r'^(.*?Companies mentioned\b)')):
    companies.append(elem)    

Desired Outcome:

  • I'd like to get the mentioned companies into a list as such: [Google, Apple, Tesla, Nike, TJX, Ross, L Brands, Dominoes]

Also open to ways i can improve the regex function to catch anomalies like "Companies mentioned in this issue:" or "Companies mentioned:" as seen here. Thanks.

ezeagwulae
  • 289
  • 7
  • 22

2 Answers2

2

You can access the content by providing the div class value:

import requests, re
from bs4 import BeautifulSoup as soup
d = soup(requests.get('http://news.hipsternomics.com/issues/how-much-is-your-personal-data-worth-on-the-black-market-148489').text, 'html.parser')
new_d = [i for i in d.find_all('div', {'class':'revue-p'}) if 'Companies mentioned' in i.text][0]
*final_results, _ = [re.sub('^[\w\s]+[,\s:]+|^[,\s]+|\s+$', '', i) for i in new_d.contents if isinstance(i, str)]

Output:

['Google', 'Apple', 'Tesla', 'Nike', 'TJX', 'Ross', 'L Brands', 'Domino’s']
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
2

What you want to achieve cannot be done with only regex alone. A capture group can only capture one thing, and there is no way you can capture groups dynamically. This article has further explanation.

What I would do is to first get the string of all the companies

all_companies = re.search(r'Companies mentioned YTD:\s(.*)', orig_text).group(1)
print(all_companies, '\n')

Next, split the string by ,

companies_percent = all_companies.split(', ')

# print(companies_percent, '\n')
# Output
# ['Google -1%', 'Apple 0%', 'Tesla +15%', 'Nike +17%', 'TJX +18%', 'Ross -2%', 'L Brands -47%', 'Domino’s +37%'] 

And finally remove the percentage after the company name

companies = list(map(lambda x: re.search(r'(.*)\s[\+|-]?\d+%', x).group(1), companies_percent))

# print(companies, '\n')
# Output
# ['Google', 'Apple', 'Tesla', 'Nike', 'TJX', 'Ross', 'L Brands', 'Domino’s'] 

Putting it all together:

import re
from bs4 import BeautifulSoup as bs4
import requests

url = 'http://news.hipsternomics.com/issues/how-much-is-your-personal-data-worth-on-the-black-market-148489'
r = requests.get(url).text
soup = bs4(r, 'html.parser')

all_companies = re.search(r'Companies mentioned YTD:\s(.*)', soup.get_text()).group(1)
companies_percent = all_companies.split(', ')
companies = list(map(lambda x: re.search(r'(.*)\s[\+|-]?\d+%', x).group(1), companies_percent))

Runnable example at https://repl.it/@hanxue/capturingrepeatedtextgrouppython

Hanxue
  • 12,243
  • 18
  • 88
  • 130