Narrowing Regex results

Question

I'm creating a regex. This is my test dataset:

<a href="test.html">test1</a>
<a href="test.pdf">test2</a>
<a href="test.html">test1</a>
<a href="test.html">test1</a><a href="testtime.pdf">test2</a>

I'm trying to capture from "href=" to "pdf", but the following regex:

href=.*?\.pdf

Will capture the right data if it is isolated to one line, but it will also match the following from the last line:

href="test.html">test1</a><a href="testtime.pdf

I only want from the last "href" to the ".pdf", I don't want the first "href" on the line or anything that comes between it and the second "href". Is it possible to modify the regex to match this properly?

Thanks.

You want the name of the last linked file only if it's a PDF? — Waxi, Apr 18 '17 at 13:20
Please note that parsing HTML with regexes is fraught with peril. See http://htmlparsing.com/regexes.html for examples of why. — Andy Lester, Apr 18 '17 at 13:28

score 2 · Accepted Answer · edited May 23 '17 at 11:54

2

Make the attribute to start with a quote and the value not contain this quote:

href="[^"]*?\.pdf

Demo: https://regex101.com/r/UuRin3/1

P.S.

Don't use Regex to parse HTML

edited May 23 '17 at 11:54

Community

1
1

answered Apr 18 '17 at 13:21

Dmitry Egorov

9,542
3
22
40

This helped me out, thanks. By the way, I am not using Regex to parse HTML. I am trying to find instances of linked PDFs on a site with 9000 HTML pages. – Katori Apr 18 '17 at 13:56

score 0 · Answer 2 · answered Apr 18 '17 at 13:20

First of all, use capturing groups, they allow you match whole word, but extract only part of it, for example href=\"(.*\.pdf)\" should allow you to match the href="xxxx.pdf" string, but extract only xxxx.pdf part.

How you do this depends on what technology you use to fetch Regex. Somehow I doubt this is html.

Narrowing Regex results

2 Answers2