Web Scraping: Need help returning the whole paragraph of text when a keyword is found

Question

I am trying to create a Web Scraping script that allows me to scrape data of a website based on keywords. So if the keyword occurs on the Website it should return the entire paragraph (or better the entire job listing with descriptions). However, my Code atm only returns the actual keyword I was searching for instead of the entire paragraph the keyword is in. Here is my Code:

import requests

from bs4 import BeautifulSoup as Bsoup


keywords = ["KI", "AI", "Big Data", "Data", "data", "big data", "Analytics", "analytics", "digitalisierung", "ML",
        "Machine Learning", "Baumeisterarbeiten"]

headers = {''}


url = "https://www.auftrag.at//tenders.aspx"


data = requests.get(url, headers=headers, timeout=5)


soup = Bsoup(data.text, 'html.parser')


# jobs = soup.find_all('div', {'class': 'article'})


jobs = soup.find_all(string=["KI", "AI", "Big Data", "Data", "data", "big data", "Analytics", "analytics", "digitalisierung", "ML",
        "Machine Learning"])

print(jobs)


for word in jobs:
   print(word)

Please add the `URL` of the webpage for proper analysis. – Kartikey Singh Mar 07 '19 at 10:13 — Kartikey Singh, Mar 07 '19 at 10:13
Hi, I added the actual URL, hope this helps! – Erik D Mar 07 '19 at 10:20 — Erik D, Mar 07 '19 at 10:20

Maaz · Accepted Answer · 2019-03-07T10:23:20.127

2

You can change your find_all to match text with a regex:

jobs = soup.find_all('p',text=re.compile(r'|'.join(keywords)))

So the full code will be:

import requests
import re
from bs4 import BeautifulSoup as Bsoup


keywords = ["KI", "AI", "Big Data", "Data", "data", "big data", "Analytics", "analytics", "digitalisierung", "ML",
        "Machine Learning", "Baumeisterarbeiten"]

url = "https://www.auftrag.at//tenders.aspx"
data = requests.get(url, timeout=5)


soup = Bsoup(data.text, 'html.parser')


# jobs = soup.find_all('div', {'class': 'article'})

jobs = soup.find_all('p',text=re.compile(r'|'.join(keywords)))

print(len(jobs))

for word in jobs:
   print(word)

My output here give me 136 results

EDIT:

I would add word boundaries to avoid missed match like KILL for KI

So I would write this regex:

jobs = soup.find_all('p',text=re.compile(r'\b(?:%s)\b' % '|'.join(keywords)))

edited Mar 07 '19 at 10:23

answered Mar 07 '19 at 10:16

Maaz

2,405
1
15
21

But the approach depends on the webpage developers using `
` tags properly. – Kartikey Singh Mar 07 '19 at 10:20
You can just do `jobs = soup.find_all(text=re.compile(r'|'.join(keywords)))` if you just want to match with text – Maaz Mar 07 '19 at 10:27
This is awesome, worked like a charm! Thank you very much for the quick help :) – Erik D Mar 07 '19 at 10:33

score 0 · Answer 2 · answered Mar 07 '19 at 12:28

0

Since the concensus seems to be that using regex for html is not a good idea, here's a non-regex alternative:

import requests
from bs4 import BeautifulSoup as bs4


keywords = ["KI", "AI", "Big Data", "Data", "data", "big data", "Analytics", "analytics", "digitalisierung", "ML",
    "Machine Learning", "Baumeisterarbeiten"]

url = "https://www.auftrag.at//tenders.aspx"
data = requests.get(url, timeout=5)


soup = bs4(data.text, 'html.parser')
jobs = soup.find_all('p')

for keyword in keywords:    
   for job in jobs:
       if keyword in str(job):
           print(job)

Output is 138 results, compared with 136 in @Maaz's answer (not sure why the discrepancy).

answered Mar 07 '19 at 12:28

Jack Fleeting

24,385
6
23
45

Thanks a lot for this, very neat :) and a really interesting read thanks for sharing! – Erik D Mar 07 '19 at 16:50
1

Reading the post you linked, I think there is no problem in my answer, because it use bs4 to find the tag and the regex to match the _text_ inside the tag. It is not uses here to match directly the _HTML_. So using a regex in this case should not be a problem in my opinion. But if someone have a best explanation I take it of course :-) – Maaz Mar 07 '19 at 18:58
By the way the number of matching `
` was 123 using the word boundaries if I remember
– Maaz Mar 07 '19 at 19:00
@Maaz - the anti-regex comment was not a knock against your answer; it's just that I've seen the idea so many times in SO answers (and other places), that I've developed a knee jerk reaction: I see html, I close down the regex part of my brain (such as it is...) – Jack Fleeting Mar 07 '19 at 19:52
Yes I'd understood no worries :-) But I thought it was interesting to debate about it – Maaz Mar 07 '19 at 19:57

Web Scraping: Need help returning the whole paragraph of text when a keyword is found

2 Answers2