0

I am new to Python and was trying to create a script in Python that scrapes a website and return the text in a couple of links. For some reason I can not figure why this is not working and would like to learn why. My regular expression is:

> regex = re.compile(r'<a target="_blank" title=".+" href=".+.pdf">(.+)</a>')

Full code:

import requests, re

response = requests.get('websithere')

websiteDate = response.text

regex = re.compile(r'<a target="_blank" title=".+" href=".+.pdf">(.+)</a>')
mo = regex.findall(websiteDate)
print(mo)

I put the (.+) in a group thinking it would find any text listed in there. The 3 links it's scanning through are:

> <a target="_blank" title="Farm Business &amp; Production Management
> Instructor" href="/uploadedpdfs/job-opportunities/Farm Business
> Production Mgt Instructor 8-17.pdf">Farm Business &amp; Production
> Management Instructor</a>
> 
> <a target="_blank" title="Paramedic Tech Adjunct Instructor Aide"
> href="/uploadedpdfs/job-opportunities/Paramedic Adjunct Instructor
> Aide.pdf">Paramedic Tech Adjunct Instructor Aide</a>
> 
> <a target="_blank" title="Technology Support Specialist"
> href="/uploadedpdfs/job-opportunities/Technology Support
> Specialist.pdf">Technology Support Specialist</a>

Instead my result is only returning: 'Technology Support Specialist'

What am I doing wrong here? I'm just trying to return the text inside of the tag. I've tried playing around with it a bit and just can't get it to work. Any help would be appreciated.

Thanks!

  • Which statement you executed to produce the output shown in your post? Please paste all relevant code. As a side note, DO NOT USE REGEX TO PARSE HTML. https://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la. Use BeautifulSoup. – DYZ Aug 07 '17 at 02:41
  • Don't use regex for parsing html. – cs95 Aug 07 '17 at 02:41

1 Answers1

0

In brief: the part title=".+" of your regular expression consumes everything from the beginning of the first title to the end of the last title:

Farm Business & Production Management Instructor" href="/uploadedpdfs/job-opportunities/Farm Business Production Mgt Instructor 8-17.pdf">Farm Business & Production Management Instructor</a> <a target="_blank" title="Paramedic Tech Adjunct Instructor Aide" href="/uploadedpdfs/job-opportunities/Paramedic Adjunct Instructor Aide.pdf">Paramedic Tech Adjunct Instructor Aide</a> <a target="_blank" title="Technology Support Specialist

DO NOT USE REGEX TO PARSE HTML

Use BeautifulSoup instead.

DYZ
  • 55,249
  • 10
  • 64
  • 93
  • Ok, so I'm not super familiar with BeatifulSoup but I have worked with it a bit. Is there a something else to use in BeautifulSoup besides Regex to narrow down my results that I can read about? What's the reasoning behind not using Regex for websites? –  Aug 07 '17 at 02:54
  • Extensive BS documentation has examples of how to extract link titles from HTML. help yourself. – DYZ Aug 07 '17 at 03:00