Regular expressions to pull links from HTML

Question

Hi im trying to use regular expression to pull out links from a pice of html as follows:

<p>some random text < hr ef="http://url.co.uk/link/">link text</a> some more random text.</p>

The reg expression I am using is:

preg_match_all('/(< href="http:\/\/url.co.uk\/([\d\D]*?)\/">([\d\D]*?)<\/a>)/', $content, $matches);

Which works fine until part of the link has a carrage return in the middle of it due to a line wrap as follows:

<p>some random text < href="
http://url.co.uk/link/">link text</a> some more random text.</p>

The carrage return can be anywhere within the link and means that the link doesn't get matched.

Can anyone suggest a way out of this either buy tightning up the reg expression or by doing something to remove the carrage return befor the reg expression acts on the text.

Is your HTML misspelled on purpose? It sure as heck seems so. — BoltClock, Jan 23 '11 at 19:18
can you preprocess the input stripping out all the newlines? — Ass3mbler, Jan 23 '11 at 19:25
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Hello71, Jan 23 '11 at 19:33
Hi boltclock, yes I was getting errors about having more than one link in the question box so I couldnt paste the question. Im new to here so not sure how to use propper yet! So i tried cheeting making the html not propper html so it wouldnt compalin about the links. — Epiphanisation, Jan 24 '11 at 00:27
hi ass3mbler, yes preprosessing the input string is a possibility if that turns out more sensible and easyer than making the regex more universal and not get upset by the carrage returns. any suggestions on what preprocessing method to use? string replace? Just need to be careful that it doesnt up set anything else in the string but it shouldnt as the carrage returns should be html tags anyway for the screen printed text. — Epiphanisation, Jan 24 '11 at 00:29

score 4 · Answer 1 · answered Jan 23 '11 at 19:32

4

An html parser could do the job for you with no errors and simplehtmldom is very simple to use (requires php 5+): http://simplehtmldom.sourceforge.net/

answered Jan 23 '11 at 19:32

AJJ

7,365
7
31
34

Cool I'll have a look at this see if I can implement it easily. cheers – Epiphanisation Jan 24 '11 at 00:30

score 2 · Accepted Answer · answered Jan 23 '11 at 19:26

2

You can use \s* to eat away extraneous whitespace and line breaks. Also you should make it more strict by replacing each [\d\D]* with a negated character class:

preg_match_all('#<a[^>]+href="\s*(http://url.co.uk/[^"]+)">([^<]+)</a>#'

You might want to apply more \s* before and after the equal sign. The [^>] is a common idiom to overlook extra html attributes, and [^"] likewise works best on matching html attributes, while [^<] matches text contents that do not contain html tags.

Furthermore this version returns only the URL (not the complete tag) as $matches[1], and the contained text as $matches[2].

answered Jan 23 '11 at 19:26

mario

144,265
20
237
291

unfortunatly your match only works if the carrage return is after the = sign where as I need it to still match where ever the carrage return happens within the string. I like the other bits but again they dont quite work with the strings I am playing with but it has increased my knowlege as would not have thought to use negative character classes that much. Not sure what the # does at beginning and end will have to look that one up. Thanks for the post. – Epiphanisation Jan 24 '11 at 00:54

score 0 · Answer 3 · answered Jan 23 '11 at 19:24

0

use the s option to have the . match all characters. See this.

answered Jan 23 '11 at 19:24

kelloti

8,705
5
46
82

If fishboy doesn't know where the line wraps, where does he insert the dot? – André Paramés Jan 23 '11 at 19:27
This is the programming part. It's an extensive regex with a lot of special characters. It's not difficult, it's just time-consuming to produce. He has to use a lot of `\s*` char sets, etc. – kelloti Jan 23 '11 at 19:30
dont quite follow this. As andre says where do I put the dot as the carrage return can happen anywhere between the tags. hope your not suggesting I have to put it in every other character? – Epiphanisation Jan 24 '11 at 00:33
can you elaborate more on using the s the dot and the using of a lot of \s* sets? – Epiphanisation Jan 24 '11 at 00:57

Regular expressions to pull links from HTML

3 Answers3

Linked