0

I'm creating a regex. This is my test dataset:

<a href="test.html">test1</a>
<a href="test.pdf">test2</a>
<a href="test.html">test1</a>
<a href="test.html">test1</a><a href="testtime.pdf">test2</a>

I'm trying to capture from "href=" to "pdf", but the following regex:

href=.*?\.pdf

Will capture the right data if it is isolated to one line, but it will also match the following from the last line:

href="test.html">test1</a><a href="testtime.pdf

I only want from the last "href" to the ".pdf", I don't want the first "href" on the line or anything that comes between it and the second "href". Is it possible to modify the regex to match this properly?

Thanks.

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
Katori
  • 101
  • 2
  • 8

2 Answers2

2

Make the attribute to start with a quote and the value not contain this quote:

href="[^"]*?\.pdf

Demo: https://regex101.com/r/UuRin3/1

P.S.

Don't use Regex to parse HTML

Community
  • 1
  • 1
Dmitry Egorov
  • 9,542
  • 3
  • 22
  • 40
  • This helped me out, thanks. By the way, I am not using Regex to parse HTML. I am trying to find instances of linked PDFs on a site with 9000 HTML pages. – Katori Apr 18 '17 at 13:56
0

First of all, use capturing groups, they allow you match whole word, but extract only part of it, for example href=\"(.*\.pdf)\" should allow you to match the href="xxxx.pdf" string, but extract only xxxx.pdf part.

How you do this depends on what technology you use to fetch Regex. Somehow I doubt this is html.