-1

I want to fetch a certain html node in a large html text, but something in my regex is bad.

I want to fetch all urls that look like this:

<a href="ftp://mysite.com"> some stuff </a>

I am trying to do:

/<a href="ftp:(.+)">/

but sometimes it will work, but sometimes it will grab everything until the next close >.

Is there a way to rewrite this regex so it will stop at the first >?

Unihedron
  • 10,902
  • 13
  • 62
  • 72
Nick Ginanto
  • 31,090
  • 47
  • 134
  • 244

3 Answers3

1

Make your regex ungreedy:

/<a href="ftp:(.+?)">/
//        here __^

or:

/<a href="ftp:([^>"]+)">/

But it's better to use a parser.

Toto
  • 89,455
  • 62
  • 89
  • 125
1

*, + are greey (matches as much as possible). By appending ? after them, you can make non-greedy.

/<a href="ftp:(.+?)">/

or you can specify exclude " using negated character classes ([^...]):

/<a href="ftp:([^"]+)">/

BTW, it's not a good idea to use regular expression to parse HTML.

Community
  • 1
  • 1
falsetru
  • 357,413
  • 63
  • 732
  • 636
1

+ is a greedy operator meaning it matches as much as it possibly can and still allows the rest of the regex to match. For this, I recommend using a negated class meaning any character except: " "one or more" times.

/<a href="ftp:([^"]+)">/

Live Demo

hwnd
  • 69,796
  • 4
  • 95
  • 132