Searching a PHP String

Question

I am struggling with PHP a bit.

I created an array and filled a few positions with some curl return data.

I dont see how I would search each array position for  and return every character from that to .

From a terminal I might do something like this:

grep -A 2 strong | sed -e 's/<p><strong>//' -e 's/<\/strong><br\/>//' -e 's/<br \/>//' -e 's/<\/p>//' -e 's/--//' -e 's/^[ \t]*//;s/[ \t]*$//'

but I am lost doing this in PHP

any advice?

Edit: I want the contents of every  to the 

Edit 2: Here is the code I am trying:

    $m=array();
preg_match_all('/<p><strong>(.*?)<\/p>/',$buffer,$m);
$sizeM = count($m);

for ( $counter2 = 0; $counter2 <= $sizeM; $counter2++)
{
    $displayString.= $m[$counter2];
}

And getting ArrayArrayArray...as my $displayString

Edit 3: I am doing this:

$curl_handle=curl_init();
curl_setopt($curl_handle,CURLOPT_URL, $url);
curl_setopt($curl_handle, CURLOPT_USERAGENT, "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.15) Gecko/20110303 Ubuntu/10.04 (lucid) Firefox/3.6.15");
curl_setopt($curl_handle, CURLOPT_HEADER, 0);
curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,1);

$buffer = curl_exec($curl_handle);

curl_close($curl_handle);

$m=array();
preg_match_all('/<p>.*?<strong>(.*?)<\/p>/i',$buffer,$m);

foreach($m[1] as $mnum=>$match) {
    $displayString.='Match '.$mnum.' is: '.$match."\n";
}

Please clarify your question. You want the contents of ever `
` element which starts with a `` element? — vbence, Mar 24 '11 at 16:53

score 2 · Answer 1 · edited May 23 '17 at 12:19

Within PHP and many other languages its preferred not to use string functions or regular expressions to match HTML as HTML is not regular and it can get real buggy.

What you should be looking at is a DOM system that you can iterate through html as an Object, in the same way JavaScript accesses the DOM.

You should look at the following Native PHP Library to get you started: http://php.net/manual/en/class.domdocument.php

You can simply use like so:

$xml = new DOMDocument();

// Load the url's contents into the DOM 
$xml->loadHTMLFile($url); 

//Loop through each <a> tag in the dom and add it to the link array 
foreach($xml->getElementsByTagName('a') as $link)
{
    echo $link->href . "\n";
}

and this would find all the links in the Document.

Also please see a Post i created and the great answer from Gordon: How do you parse and process HTML/XML in PHP?

can you give an example of `it can get real buggy` in relation to regexp? — Shad, Mar 24 '11 at 17:59

Shad · Answer 2 · 2011-03-25T03:24:36.727

2

preg_match_all()

$m=array();
preg_match_all('/<p>\s*<strong>([\s\S]*?)<\/p>/i',$string,$m);
foreach($m[1] as $mnum=>$match){
    $displayString.='Match '.$mnum.' is: '.$match."\n";
}

$m now contains all matches. $m[0] holds the entire matches, $m[1] holdes the parenthetical matches

edited Mar 25 '11 at 03:24

answered Mar 24 '11 at 16:51

Shad

15,134
2
22
34

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – RobertPitt Mar 24 '11 at 16:53
preg_match_all is a perfect example of using regular expressions to solve this problem. RobertPitt, your link does not address the author's question as directly as this posted solution. – Ryre Mar 24 '11 at 16:55
@Toast, yes it does, he is using curl to fetch contents and then using thinks like explode to get them into an array, if he replaced his current code wtih a DOM object then that would solve his issue, do you mind clicking the link again and reading full, as well as the comments. – RobertPitt Mar 24 '11 at 16:58
A rant on why regular expressions are bad, without providing any solution, is NOT a solution to this question. The second answer down supports regular expressions, and is the first answer that addresses this question. Please take your personal vendetta elsewhere, and let us help him. – Ryre Mar 24 '11 at 16:59
Your not helping him what so ever, and im looking out for all programmers who come across this thread, this is the wrong way to parse html, there is no rant, there is no vendetta, just someone who fails to learn from others! – RobertPitt Mar 24 '11 at 17:03
2

@RobertPitt - the OP has already stated that he would ordinarily use a regular expression to solve this problem if using the command-line. It is reasonable, therefore, for answers to be given that show him how to do a similar thing inside PHP; that is what he's asking for. No, regex is not ideal for parsing HTML, but the OP's regex skills appear to be good enough to do a reasonable job, and you don't know the circumstance; he may have a very specific HTML snippet to parse with known properties, in which case regex is actually a reasonable tool. – Spudley Mar 24 '11 at 17:04
@chad = so I could access $m in a for loop for each position and pull out the data, massage it and display it. I just get ArrayArrayArray. see my edit above if you have time. I am misunderstanding something – Todd Mar 24 '11 at 17:32
@Todd `foreach($m[1] as $mnum=>$match){ $displayString.='Match '.$mnum.' is: '.$match."\n";}` – Shad Mar 24 '11 at 17:43
@todd each result is going to return an array, so you will need a nested loop to get the data from each of the arrays in the $m array – superultranova Mar 24 '11 at 17:43
@shad, using the above my displayString is empty. `$string` contains the results of my curl according to firebug. I don't understand obviously – Todd Mar 24 '11 at 19:13
It worked for me, throw a `var_dump` on your `$m` to see its structure – Shad Mar 24 '11 at 19:56
@shad 0 i get `array(2) { [0]=> array(0) { } [1]=> array(0) { } }` – Todd Mar 24 '11 at 19:59
@Todd that means it didn't find _any_ matches. `var_dump` your `$string` and make sure there are matches to be found. – Shad Mar 24 '11 at 20:15
@shad i get `string(13862) "` and if I look at the variable I am returning from curl it does have an html doc with the data I expect in it.. – Todd Mar 24 '11 at 20:21
@Todd well, then the empty `$m` arrays means it can't find a match for the given pattern in the given string... Are you sure you are calling `preg_match_all()` ON the var that caught the CURL return? (Between the example and you initial question you are using `$string` and `$buffer` – Shad Mar 24 '11 at 20:46
@shad I am doing `$buffer = curl_exec($curl_handle);` and `$m=array(); preg_match_all('/
(.*?)<\/p>/',$buffer,$m);`
– Todd Mar 24 '11 at 20:53
@shad if I look in FireBug it shows data like: `
San Jose College Park
2.33 miles
Out Of Stock
` – Todd Mar 24 '11 at 20:55
@shad the data is a whole HTML page being returned as a result of my curl call. the `
` are wrapped in `` with the start of the URL the same for each location..
– Todd Mar 24 '11 at 20:56
@Todd Then there are no matches... I guess I would add the `i` flag to the pattern (for case-insensitivity), and double check the encoding of the response is the same as the encoding of the script (ISO,UTF-8?) `preg_match_all('/
(.*?)<\/p>/i',$buffer,$m);`
– Shad Mar 24 '11 at 20:58
@Todd If you look at the actual source, not firebug, are there any space/characters between the openning `p` tag and the openning `strong` tag? that would require a different regexp. see my updated answer – Shad Mar 24 '11 at 21:01
@shad if I paste into a browser my url and view the source each entry I want is made up like this: `
San Jose College Park
2.33 miles
Out Of Stock

Todd

Mar 24 '11 at 21:08

@Todd From the example code you just gave, I am able to successfully extract the code between the tags. I think you have a typo somewhere, or perhaps are fetching differently encoded text into your script... – Shad Mar 24 '11 at 21:12

@Shad - can you look at Edit 3? – Todd Mar 24 '11 at 21:21

shad - what is the `Array=array()`? – Todd Mar 24 '11 at 22:16

@Todd forgot to escape the `$` in heredoc, sorry. Fixed now. – Shad Mar 24 '11 at 22:20

@shad - OK, I think I've for it now..testing! – Todd Mar 24 '11 at 22:20

@shad, yeah I can manage from this, very helpful advice today. I will give you a shout-out on the example I am posting on Hacker News later today. – Todd Mar 24 '11 at 22:24

Matt Wonlaw · Answer 3 · 2011-03-24T17:29:15.887

As has been pointed out in other posts, if you are trying to process HTML you shouldn't use regular expressions.

To handle finding  you could use DOMDocument:

$doc = new DOMDocument();
$doc->loadHTML($html);
$pTags = $doc->getElemetsByTagName('p');
for ($pTags as $pTag) {
  if ($pTag->firstChild->nodeName === 'strong') {
    $data = $pTag->firstChild->nodeValue;
  }
}

Or use XPath:

$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$matchingNodes = $xpath->query('//p/strong');

or you may even be able to use expat.

These methods are much clearer, proven, flexible and more failsafe than using regular expressions.

My personal favorite for pulling data out of xml-style docs is xpath. Here is a good set of xpath examples: http://msdn.microsoft.com/en-us/library/ms256086.aspx

Edit: *Note: if you are trying to process very large XML/HTML documents you will not want to use DOMDocument or XPath as they can be slow for large documents. For these cases, go with an event driven XML parser. We have had cases at work where parsing a large XML file with XPath took a few minutes and parsing the same file with an event driven parser took just a few seconds.

ok so changing to `foreach` triggers a bunch of HTML parse errors — Todd, Mar 24 '11 at 19:48
maybe it is not `loadHTML`? b>Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Tag header invalid in Entity, line: 83 in /Library/WebServer/Documents/ipadcheck.php on line 110 — Todd, Mar 24 '11 at 19:49
@Todd I'll compile and run this stuff when I get home. Maybe my php is rusty... Anyway, loadHTML expects a well formed document... so if the document isn't valid html you'll get errors. — Matt Wonlaw, Mar 24 '11 at 20:13

score 0 · Answer 4 · answered Mar 24 '11 at 16:46

0

Regular expressions will be your friend here. strpos, substr, and explode are useful php functions.

answered Mar 24 '11 at 16:46

Ryre

6,135
5
30
47

1

This is totally incorrect! - Do not use Regular Expressions, who told you HTML was *regular* ? – RobertPitt Mar 24 '11 at 16:50
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – RobertPitt Mar 24 '11 at 16:53
1

As an answer to the given question, regular expressions are a good fit. – Ryre Mar 24 '11 at 16:53
Toast, regular expressions are the worst possible choice any programmer could make when parsing html – RobertPitt Mar 24 '11 at 16:55

score 0 · Answer 5 · answered Mar 24 '11 at 16:58

0

Well, if the positions aren't relevant for the result you're expecting, you could try merging the array into a single string, and perform a regex in there...

Here's the code

    <?php

$data = array(
    'DONT MATCH THISDONT MATCH THIS<p><strong>hello1!</strong></p>DONT MATCH THISDONT MATCH THISDONT MATCH THIS',
    'DONT MATCH THISDONT MATCH THIS<p><strong>hello2!</strong></p>DONT MATCH THISDONT MATCH THISDONT MATCH THIS',
    'DONT MATCH THISDONT MATCH THIS<p><strong>hello3!</strong></p>DONT MATCH THISDONT MATCH THISDONT MATCH THIS',
    '<p><strong>hello4!</strong></p>DONT MATCH THISDONT MATCH THIS<p><strong>hello5!</strong> test test</p>DONT MATCH THISDONT MATCH THISDONT MATCH THIS',
    'DONT MATCH THISDONT MATCH THIS<p><strong>hello6!</strong></p>DONT MATCH THISDONT MATCH THISDONT MATCH THIS',
);

preg_match_all('/<p><strong>.*?<\/p>/',implode($data,''),$results);

print_r($results);


?>

Let me know if this works for you. Cheers!

answered Mar 24 '11 at 16:58

fsodano

548
2
8

This of course, expects the HTML you're parsing to be well formed, otherwise your results might not be accurate. Cheers! – fsodano Mar 24 '11 at 17:00
Perhaps you could try to be helpful instead of criticising everyone else @RobertPitt. Go troll somewhere else. – fsodano Mar 24 '11 at 17:01
This answer solves the problem for the user, but it is also spreading a disease, now just because something can be solved does not mean its the correct way to do it, this is called bad programming ! – RobertPitt Mar 24 '11 at 17:04
I think this is far from a "spreding disease" and that this is a correct, agile solution, far from "bad programming". But at least you followed my comment and tried to be helpful. Cheers. – fsodano Mar 24 '11 at 17:09

Searching a PHP String

5 Answers5