1

I am trying to create a Web Scraping script that allows me to scrape data of a website based on keywords. So if the keyword occurs on the Website it should return the entire paragraph (or better the entire job listing with descriptions). However, my Code atm only returns the actual keyword I was searching for instead of the entire paragraph the keyword is in. Here is my Code:

import requests

from bs4 import BeautifulSoup as Bsoup


keywords = ["KI", "AI", "Big Data", "Data", "data", "big data", "Analytics", "analytics", "digitalisierung", "ML",
        "Machine Learning", "Baumeisterarbeiten"]

headers = {''}


url = "https://www.auftrag.at//tenders.aspx"


data = requests.get(url, headers=headers, timeout=5)


soup = Bsoup(data.text, 'html.parser')


# jobs = soup.find_all('div', {'class': 'article'})


jobs = soup.find_all(string=["KI", "AI", "Big Data", "Data", "data", "big data", "Analytics", "analytics", "digitalisierung", "ML",
        "Machine Learning"])

print(jobs)


for word in jobs:
   print(word)
Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
Erik D
  • 27
  • 7

2 Answers2

2

You can change your find_all to match text with a regex:

jobs = soup.find_all('p',text=re.compile(r'|'.join(keywords)))

So the full code will be:

import requests
import re
from bs4 import BeautifulSoup as Bsoup


keywords = ["KI", "AI", "Big Data", "Data", "data", "big data", "Analytics", "analytics", "digitalisierung", "ML",
        "Machine Learning", "Baumeisterarbeiten"]

url = "https://www.auftrag.at//tenders.aspx"
data = requests.get(url, timeout=5)


soup = Bsoup(data.text, 'html.parser')


# jobs = soup.find_all('div', {'class': 'article'})

jobs = soup.find_all('p',text=re.compile(r'|'.join(keywords)))

print(len(jobs))

for word in jobs:
   print(word)

My output here give me 136 results

EDIT:

I would add word boundaries to avoid missed match like KILL for KI

So I would write this regex:

jobs = soup.find_all('p',text=re.compile(r'\b(?:%s)\b' % '|'.join(keywords)))
Maaz
  • 2,405
  • 1
  • 15
  • 21
0

Since the concensus seems to be that using regex for html is not a good idea, here's a non-regex alternative:

import requests
from bs4 import BeautifulSoup as bs4


keywords = ["KI", "AI", "Big Data", "Data", "data", "big data", "Analytics", "analytics", "digitalisierung", "ML",
    "Machine Learning", "Baumeisterarbeiten"]

url = "https://www.auftrag.at//tenders.aspx"
data = requests.get(url, timeout=5)


soup = bs4(data.text, 'html.parser')
jobs = soup.find_all('p')

for keyword in keywords:    
   for job in jobs:
       if keyword in str(job):
           print(job)            

Output is 138 results, compared with 136 in @Maaz's answer (not sure why the discrepancy).

Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • Thanks a lot for this, very neat :) and a really interesting read thanks for sharing! – Erik D Mar 07 '19 at 16:50
  • 1
    Reading the post you linked, I think there is no problem in my answer, because it use bs4 to find the tag and the regex to match the _text_ inside the tag. It is not uses here to match directly the _HTML_. So using a regex in this case should not be a problem in my opinion. But if someone have a best explanation I take it of course :-) – Maaz Mar 07 '19 at 18:58
  • By the way the number of matching `

    ` was 123 using the word boundaries if I remember

    – Maaz Mar 07 '19 at 19:00
  • @Maaz - the anti-regex comment was not a knock against your answer; it's just that I've seen the idea so many times in SO answers (and other places), that I've developed a knee jerk reaction: I see html, I close down the regex part of my brain (such as it is...) – Jack Fleeting Mar 07 '19 at 19:52
  • Yes I'd understood no worries :-) But I thought it was interesting to debate about it – Maaz Mar 07 '19 at 19:57