What is the best pattern for preg_match_all to extract portion of strings?

Question

Context ;

• from file_get_contents from url, i get lots of stuff like <item></item>, <url></url>, etc.

• i'm using preg_match_all to extract url, title, etc.

example:

$jStringToSubStract = '<a>stuffA</a><b>stuffB</b><url>http...</url>';
preg_match_all("#<url>(.*?)<\/url>#sx", $jStringToSubStract , $subItems, PREG_SET_ORDER);
foreach ( $subItems as $subItem  ) {        
        if ( strlen ($subItem[1]) > 0 ) {
            echo $subItem[1]; // this is returning the http... INSIDE <url></url> 
        }
}

but it's slow for a large amount...

Is there a faster alternative to preg_match_all to extract portion of strings ?

They never ever learn: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Jorge Campos, Apr 23 '17 at 21:00
@JazZ: simplehtmldom isn't so simple, is in a great part based on regex (for information) and is slow. PHP have build in DOMDocument and DOMXPath classes and there're also two other ways to deal with XML (sax and XMLReader). — Casimir et Hippolyte, Apr 23 '17 at 21:10
Thanks for informations @CasimiretHippolyte. Deleted my comment. — JazZ, Apr 23 '17 at 21:12
@mickmackusa: Since he asks for a code improvement (with an already working code), the question is indeed better fitted for code-review. But even if this question was posted to the appropriate site, a problem would remain: John R takes for granted that this piece of code is slow and I suspect the problem to be elsewhere (the algorithm around, the way the xml file is loaded, the size, ...). It's possible to speed up this code a little with `#([^<]+)<\/url>#`, removing the 4th param, the loop and the if test, but the gain will be limited. Without a context, this question is also too broad. — Casimir et Hippolyte, May 03 '17 at 12:53
@mickmackusa: John R speaks also about "a large amount". But in a performance point of view, this approach is already the fastest. All parsers are slower, however: `DOMDocument` can be interesting with a sufficient amount of searches (to amortize the DOM tree cost) and `XMLReader` is the fastest one, forces a lazy evaluation code design and save memory (but it's difficult to write the code). These approaches are more rigorous, and I could answer in that sense but one more time without context it isn't possible. — Casimir et Hippolyte, May 03 '17 at 13:22

mickmackusa · Accepted Answer · 2017-05-27T07:40:09.763

0

After seeing your posted solution, I now understand what you are trying to achieve. Since you are capturing only substrings in the format of [attrname]=[attrvalue] (which may be single quoted, double quoted, or not quoted at all), these are optimized patterns for you...

This one will get ALL attributes: \K\S+=["']?[^>"']+["']?>?? Demo

This one will get specific attributes: \K(?:alt|title|src|href)=["']?[^>"']+["']?>?? Demo

These patterns do not use capture groups. This means your code will avoid unnecessary result array bloat and access the substrings as fullstring matches. Both of these patterns will run more efficiently than the patterns you have posted.

I should also mention that both my patterns and your patterns are not 100% reliable because there is no check that these substrings are actually inside of html tags. This is the reason why html-parsing programs are strenuously encouraged. If you are certain that the text that you'll be reading won't have any free floating \S=\S formatted strings outside of the tags, then the results will be fine.

edited May 27 '17 at 07:40

answered May 25 '17 at 06:53

mickmackusa

43,625
12
83
136

Is there a way with regex expression to get ALL stuff inside the =quot AND =DoubleQuot like this : `DESC 1 DESC 2 DESC 3 DESC 4 DESC 5 DESC 5 DESC 5 DESC 5` to give 1 array with att1,att2,att3,att4 – May 26 '17 at 02:29
@JohnR Let me ask more specifically before I offer a pattern... You want to capture all single and double quoted attribute values that exist inside of any html tag. Correct? Do you need to distinguish between the singles and doubles? or can they all be lumped together into one capture group? – mickmackusa May 26 '17 at 02:35
see the next answer for more details – May 26 '17 at 02:45
@JohnR I found another pattern on SO, and modified it slightly to make it more efficient. Here is the [demo](https://regex101.com/r/j60by9/2). I am afraid you are stretching the reasonable limits of preg_matching html. As mentioned earlier, you may need to investigate the inclusion of a third-party product. Oh, and you should delete your posted answer -- posting non-answers as answers is super-frowned-upon. In the future, use pastebin.com and drop a link into a comment. I hope this helps you. – mickmackusa May 26 '17 at 03:30
@JohnR Now that you have shown, more specifically, what you are trying to achieve, I've provided better patterns for you. If this doesn't do what you want, please let me know and I'll fix it up. – mickmackusa May 27 '17 at 07:36
this is what I was looking for. Thanks ;) – May 27 '17 at 15:20

score 0 · Answer 2 · 2017-05-26T09:41:03.080

FROM

$string='
<anytag aa="att1">DESC 1</anytag>
<item aa="att2">DESC 2</item>
<anytag bb="att3">DESC 3</anytag>
<anytag cc="att4">DESC 4</anytag>
<anytag src="att5">DESC 5</anytag>
<anytag src="att6">DESC 6</anytag>
<anytag src=\'att7\'>DESC 7</anytag>
<anytag src=\'att8\'>DESC 8</anytag>
<anytag href="att9" title="title1">DESC 9</anytag>
<anytag blabla="att10">DESC 10</anytag>
';

// this one will get ALL attributes
preg_match_all("#\S+=[\"'](?:.(?![\"'] +\S+=|[>\"']))+.[\"']#sx", $string , $subItems);
foreach ( $subItems[0] as $subItem  ) { echo $subItem.'<br>'; }

// this one will get specific attributes
$patterns = 'alt|title|src|href';
preg_match_all("#($patterns)=[>\"'](.*?)[>\"']#sx", $string , $subItems);
foreach ( $subItems[0] as $subItem  ) { echo $subItem.'<br>'; }

What is the best pattern for preg_match_all to extract portion of strings?

2 Answers2