0

I am trying to extract a list of hyperlink text (as well as the url and date) from a website http://www.efsa.europa.eu/en/news using regular expressions.

An example of this text would be "Veterinary drug residues in animals and food: compliance with safety levels still high"

However, my expression is returning more text than is required e.g.

<span class="field-content"><a href="/en/news/veterinary-drug-residues-animals-and-food-compliance-safety-levels-still-high">Veterinary drug residues in animals and food: compliance with safety levels still high"

Here is my code:

import bs4, requests, re


res = requests.get('http://www.efsa.europa.eu/en/news')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'html.parser')
elems = soup.select('body > div.l-page > div > div > div > div > div > div > div.view-content.news-page-display')

a = str(elems[0])

text = re.findall(r'">(.+?)</a></span> </div>',a)

for i in range (len(text)):
    print(text[i]+'\n')

Does anyone have any idea what might be causing this? I have been trying for an hour and now given up :(

Thanks in advance!

Tobias Funke
  • 1,614
  • 3
  • 13
  • 23
  • If you have BeautifulSoup at your disposal, why are you resorting to regexes? – Mark Apr 26 '20 at 19:02
  • If you are already using ```BeautifulSoup```, why not use it to find `````` elements as well? – Eric Truett Apr 26 '20 at 19:03
  • 1
    For longer explanations, see duplicate threads linked at the top, in short: you should have written `>([^>]+) `, but to actually have no trouble like this, just go on with BS. Regex is not necessary here, to just get the text from the `a` tag. – Wiktor Stribiżew Apr 26 '20 at 19:06
  • I am using BeautifulSoup to get the whole website content and then regex to find the news headlines, hyperlinks, and dates. The code iterates over 20 sites with a regex for each site's data. Is this not right? I am brand new to python and just learning at the moment. – Tobias Funke Apr 26 '20 at 19:06
  • 1
    @blahblahvvvvv you can get then anchors with `el =soup.select('div.view-content.news-page-display a')` Then each `el` will have the link in `for a in el: a['href']` – Mark Apr 26 '20 at 19:08
  • 1
    @MarkMeyer Thank you! I wish I had known this when I started my project weeks ago! – Tobias Funke Apr 26 '20 at 19:11
  • You can simply match `.*> *` and replace it with an empty string. [Demo](https://regex101.com/r/4bX3vk/2/). As `.*` is greedy, it will gobble up all characters, including `>`'s, until it reaches the last `>`. It will then consume that character and any following spaces. In other words, it will consume all characters up to the beginning of he string you wish to extract. – Cary Swoveland Apr 26 '20 at 20:42

0 Answers0