1

I am trying to find if a link contains ".pdf" at its end.

I am skipping all the characters before ".pdf" using [/w/-]+ in regular expression and then seeing if it contains ".pdf". I am new to regular expressions.

The code is:

import urllib2
import json
import re
from bs4 import BeautifulSoup
url = "http://codex.cs.yale.edu/avi/os-book/OS8/os8c/slide-dir/"
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read())
links = soup.find_all('a')
for link in links:
    name = link.get("href")
    if(re.match(r'[\w/.-]+.pdf',name)):
        print name

I want to match name with following type of links:

PDF-dir/ch1.pdf

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Shivam Mitra
  • 1,040
  • 3
  • 17
  • 33

2 Answers2

3

You don't need regular expressions. Use a CSS selector to check that an href ends with pdf:

for link in soup.select("a[href$=pdf]"):
    print(link["href"])
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
1

I made a small change in your code

for link in links:
name = link.get("href")
if(re.search(r'\.pdf$',name)):
    print name

The output is like:

PDF-dir/ch1.pdf
PDF-dir/ch2.pdf
PDF-dir/ch3.pdf
PDF-dir/ch4.pdf
PDF-dir/ch5.pdf
PDF-dir/ch6.pdf
PDF-dir/ch7.pdf
PDF-dir/ch8.pdf
PDF-dir/ch9.pdf
PDF-dir/ch10.pdf
PDF-dir/ch11.pdf
PDF-dir/ch12.pdf
PDF-dir/ch13.pdf
PDF-dir/ch14.pdf
PDF-dir/ch15.pdf
PDF-dir/ch16.pdf
PDF-dir/ch17.pdf
PDF-dir/ch18.pdf
PDF-dir/ch19.pdf
PDF-dir/ch20.pdf
PDF-dir/ch21.pdf
PDF-dir/ch22.pdf
PDF-dir/appA.pdf
PDF-dir/appC.pdf

Rohan Amrute
  • 764
  • 1
  • 9
  • 23
  • Since he only wants '.pdf' at the end, it's better to do: if(re.search(r'\.pdf$',name)) – soungalo Dec 14 '15 at 10:49
  • This code works. But I am asking why my original code doesn't work? – Shivam Mitra Dec 14 '15 at 10:54
  • Refer to this link, difference between `re.search()` and `re.match()` http://stackoverflow.com/questions/180986/what-is-the-difference-between-pythons-re-search-and-re-match – Rohan Amrute Dec 14 '15 at 10:58
  • @ShivamMitra Basically `re.match()` attempts to match a pattern at the **beginning** of the string. `re.search()` attempts to match the pattern **throughout** the string until it finds a match. – Rohan Amrute Dec 14 '15 at 11:15