Getting incorrect match while using regular expressions

Question

I am trying to find if a link contains ".pdf" at its end.

I am skipping all the characters before ".pdf" using [/w/-]+ in regular expression and then seeing if it contains ".pdf". I am new to regular expressions.

The code is:

import urllib2
import json
import re
from bs4 import BeautifulSoup
url = "http://codex.cs.yale.edu/avi/os-book/OS8/os8c/slide-dir/"
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read())
links = soup.find_all('a')
for link in links:
    name = link.get("href")
    if(re.match(r'[\w/.-]+.pdf',name)):
        print name

I want to match name with following type of links:

PDF-dir/ch1.pdf

Do you want all result containing ".pdf" or just one. If you want all, maybe you should use search instead of match. — codelover123, Dec 14 '15 at 10:30
1. `for links in links` is bad and will surely cause problems. 2. Have you tried a simple `os.path.splitext`? — TigerhawkT3, Dec 14 '15 at 10:37
@RohanAmrute That will find all the ".pdf". It will also return true if link = "www.pdfsite.com". — Shivam Mitra, Dec 14 '15 at 10:42
Ok, so, why not use the pattern like `re.search(r'[\w-]+/[\w-]+\.pdf$',name)`? — Wiktor Stribiżew, Dec 14 '15 at 10:43

score 3 · Answer 1 · answered Dec 14 '15 at 14:24

3

You don't need regular expressions. Use a CSS selector to check that an href ends with pdf:

for link in soup.select("a[href$=pdf]"):
    print(link["href"])

answered Dec 14 '15 at 14:24

alecxe

462,703
120
1,088
1,195

Rohan Amrute · Accepted Answer · 2015-12-14T10:52:32.717

1

I made a small change in your code

for link in links:
name = link.get("href")
if(re.search(r'\.pdf$',name)):
    print name

The output is like:

PDF-dir/ch1.pdf
PDF-dir/ch2.pdf
PDF-dir/ch3.pdf
PDF-dir/ch4.pdf
PDF-dir/ch5.pdf
PDF-dir/ch6.pdf
PDF-dir/ch7.pdf
PDF-dir/ch8.pdf
PDF-dir/ch9.pdf
PDF-dir/ch10.pdf
PDF-dir/ch11.pdf
PDF-dir/ch12.pdf
PDF-dir/ch13.pdf
PDF-dir/ch14.pdf
PDF-dir/ch15.pdf
PDF-dir/ch16.pdf
PDF-dir/ch17.pdf
PDF-dir/ch18.pdf
PDF-dir/ch19.pdf
PDF-dir/ch20.pdf
PDF-dir/ch21.pdf
PDF-dir/ch22.pdf
PDF-dir/appA.pdf
PDF-dir/appC.pdf

edited Dec 14 '15 at 10:52

answered Dec 14 '15 at 10:45

Rohan Amrute

764
1
9
23

Since he only wants '.pdf' at the end, it's better to do: if(re.search(r'\.pdf$',name)) – soungalo Dec 14 '15 at 10:49
This code works. But I am asking why my original code doesn't work? – Shivam Mitra Dec 14 '15 at 10:54
Refer to this link, difference between `re.search()` and `re.match()` http://stackoverflow.com/questions/180986/what-is-the-difference-between-pythons-re-search-and-re-match – Rohan Amrute Dec 14 '15 at 10:58
@ShivamMitra Basically `re.match()` attempts to match a pattern at the **beginning** of the string. `re.search()` attempts to match the pattern **throughout** the string until it finds a match. – Rohan Amrute Dec 14 '15 at 11:15

Getting incorrect match while using regular expressions

2 Answers2