Regex in Python not getting the result I want

Question

I am new to Python and was trying to create a script in Python that scrapes a website and return the text in a couple of links. For some reason I can not figure why this is not working and would like to learn why. My regular expression is:

> regex = re.compile(r'<a target="_blank" title=".+" href=".+.pdf">(.+)</a>')

Full code:

import requests, re

response = requests.get('websithere')

websiteDate = response.text

regex = re.compile(r'<a target="_blank" title=".+" href=".+.pdf">(.+)</a>')
mo = regex.findall(websiteDate)
print(mo)

I put the (.+) in a group thinking it would find any text listed in there. The 3 links it's scanning through are:

> <a target="_blank" title="Farm Business &amp; Production Management
> Instructor" href="/uploadedpdfs/job-opportunities/Farm Business
> Production Mgt Instructor 8-17.pdf">Farm Business &amp; Production
> Management Instructor</a>
> 
> <a target="_blank" title="Paramedic Tech Adjunct Instructor Aide"
> href="/uploadedpdfs/job-opportunities/Paramedic Adjunct Instructor
> Aide.pdf">Paramedic Tech Adjunct Instructor Aide</a>
> 
> <a target="_blank" title="Technology Support Specialist"
> href="/uploadedpdfs/job-opportunities/Technology Support
> Specialist.pdf">Technology Support Specialist</a>

Instead my result is only returning: 'Technology Support Specialist'

What am I doing wrong here? I'm just trying to return the text inside of the tag. I've tried playing around with it a bit and just can't get it to work. Any help would be appreciated.

Thanks!

Which statement you executed to produce the output shown in your post? Please paste all relevant code. As a side note, DO NOT USE REGEX TO PARSE HTML. https://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la. Use BeautifulSoup. — DYZ, Aug 07 '17 at 02:41

score 0 · Answer 1 · answered Aug 07 '17 at 02:46

0

In brief: the part title=".+" of your regular expression consumes everything from the beginning of the first title to the end of the last title:

Farm Business & Production Management Instructor" href="/uploadedpdfs/job-opportunities/Farm Business Production Mgt Instructor 8-17.pdf">Farm Business & Production Management Instructor</a> <a target="_blank" title="Paramedic Tech Adjunct Instructor Aide" href="/uploadedpdfs/job-opportunities/Paramedic Adjunct Instructor Aide.pdf">Paramedic Tech Adjunct Instructor Aide</a> <a target="_blank" title="Technology Support Specialist

DO NOT USE REGEX TO PARSE HTML

Use BeautifulSoup instead.

answered Aug 07 '17 at 02:46

DYZ

55,249
10
64
93

Ok, so I'm not super familiar with BeatifulSoup but I have worked with it a bit. Is there a something else to use in BeautifulSoup besides Regex to narrow down my results that I can read about? What's the reasoning behind not using Regex for websites? – Aug 07 '17 at 02:54
Extensive BS documentation has examples of how to extract link titles from HTML. help yourself. – DYZ Aug 07 '17 at 03:00

Regex in Python not getting the result I want

1 Answers1