I am trying to extract a list of hyperlink text (as well as the url and date) from a website http://www.efsa.europa.eu/en/news using regular expressions.
An example of this text would be "Veterinary drug residues in animals and food: compliance with safety levels still high"
However, my expression is returning more text than is required e.g.
<span class="field-content"><a href="/en/news/veterinary-drug-residues-animals-and-food-compliance-safety-levels-still-high">Veterinary drug residues in animals and food: compliance with safety levels still high"
Here is my code:
import bs4, requests, re
res = requests.get('http://www.efsa.europa.eu/en/news')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'html.parser')
elems = soup.select('body > div.l-page > div > div > div > div > div > div > div.view-content.news-page-display')
a = str(elems[0])
text = re.findall(r'">(.+?)</a></span> </div>',a)
for i in range (len(text)):
print(text[i]+'\n')
Does anyone have any idea what might be causing this? I have been trying for an hour and now given up :(
Thanks in advance!