0

Hi im trying to use regular expression to pull out links from a pice of html as follows:

<p>some random text < hr ef="http://url.co.uk/link/">link text</a> some more random text.</p>

The reg expression I am using is:

preg_match_all('/(< href="http:\/\/url.co.uk\/([\d\D]*?)\/">([\d\D]*?)<\/a>)/', $content, $matches);

Which works fine until part of the link has a carrage return in the middle of it due to a line wrap as follows:

<p>some random text < href="
http://url.co.uk/link/">link text</a> some more random text.</p>

The carrage return can be anywhere within the link and means that the link doesn't get matched.

Can anyone suggest a way out of this either buy tightning up the reg expression or by doing something to remove the carrage return befor the reg expression acts on the text.

John Parker
  • 54,048
  • 11
  • 129
  • 129
  • 3
    Is your HTML misspelled on purpose? It sure as heck seems so. – BoltClock Jan 23 '11 at 19:18
  • can you preprocess the input stripping out all the newlines? – Ass3mbler Jan 23 '11 at 19:25
  • 4
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Hello71 Jan 23 '11 at 19:33
  • Hi boltclock, yes I was getting errors about having more than one link in the question box so I couldnt paste the question. Im new to here so not sure how to use propper yet! So i tried cheeting making the html not propper html so it wouldnt compalin about the links. – Epiphanisation Jan 24 '11 at 00:27
  • hi ass3mbler, yes preprosessing the input string is a possibility if that turns out more sensible and easyer than making the regex more universal and not get upset by the carrage returns. any suggestions on what preprocessing method to use? string replace? Just need to be careful that it doesnt up set anything else in the string but it shouldnt as the carrage returns should be html tags anyway for the screen printed text. – Epiphanisation Jan 24 '11 at 00:29

3 Answers3

4

An html parser could do the job for you with no errors and simplehtmldom is very simple to use (requires php 5+): http://simplehtmldom.sourceforge.net/

AJJ
  • 7,365
  • 7
  • 31
  • 34
2

You can use \s* to eat away extraneous whitespace and line breaks. Also you should make it more strict by replacing each [\d\D]* with a negated character class:

preg_match_all('#<a[^>]+href="\s*(http://url.co.uk/[^"]+)">([^<]+)</a>#'

You might want to apply more \s* before and after the equal sign. The [^>] is a common idiom to overlook extra html attributes, and [^"] likewise works best on matching html attributes, while [^<] matches text contents that do not contain html tags.

Furthermore this version returns only the URL (not the complete tag) as $matches[1], and the contained text as $matches[2].

mario
  • 144,265
  • 20
  • 237
  • 291
  • unfortunatly your match only works if the carrage return is after the = sign where as I need it to still match where ever the carrage return happens within the string. I like the other bits but again they dont quite work with the strings I am playing with but it has increased my knowlege as would not have thought to use negative character classes that much. Not sure what the # does at beginning and end will have to look that one up. Thanks for the post. – Epiphanisation Jan 24 '11 at 00:54
0

use the s option to have the . match all characters. See this.

kelloti
  • 8,705
  • 5
  • 46
  • 82