2

I am struggling with PHP a bit.

I created an array and filled a few positions with some curl return data.

I dont see how I would search each array position for <p><strong> and return every character from that to </p>.

From a terminal I might do something like this:

grep -A 2 strong | sed -e 's/<p><strong>//' -e 's/<\/strong><br\/>//' -e 's/<br \/>//' -e 's/<\/p>//' -e 's/--//' -e 's/^[ \t]*//;s/[ \t]*$//'

but I am lost doing this in PHP

any advice?

Edit: I want the contents of every <p><strong> to the </p>

Edit 2: Here is the code I am trying:

    $m=array();
preg_match_all('/<p><strong>(.*?)<\/p>/',$buffer,$m);
$sizeM = count($m);

for ( $counter2 = 0; $counter2 <= $sizeM; $counter2++)
{
    $displayString.= $m[$counter2];
}

And getting ArrayArrayArray...as my $displayString

Edit 3: I am doing this:

$curl_handle=curl_init();
curl_setopt($curl_handle,CURLOPT_URL, $url);
curl_setopt($curl_handle, CURLOPT_USERAGENT, "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.15) Gecko/20110303 Ubuntu/10.04 (lucid) Firefox/3.6.15");
curl_setopt($curl_handle, CURLOPT_HEADER, 0);
curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,1);

$buffer = curl_exec($curl_handle);

curl_close($curl_handle);

$m=array();
preg_match_all('/<p>.*?<strong>(.*?)<\/p>/i',$buffer,$m);

foreach($m[1] as $mnum=>$match) {
    $displayString.='Match '.$mnum.' is: '.$match."\n";
}
Todd
  • 33
  • 4
  • Please clarify your question. You want the contents of ever `

    ` element which starts with a `` element?

    – vbence Mar 24 '11 at 16:53

5 Answers5

2

Within PHP and many other languages its preferred not to use string functions or regular expressions to match HTML as HTML is not regular and it can get real buggy.

What you should be looking at is a DOM system that you can iterate through html as an Object, in the same way JavaScript accesses the DOM.

You should look at the following Native PHP Library to get you started: http://php.net/manual/en/class.domdocument.php

You can simply use like so:

$xml = new DOMDocument();

// Load the url's contents into the DOM 
$xml->loadHTMLFile($url); 

//Loop through each <a> tag in the dom and add it to the link array 
foreach($xml->getElementsByTagName('a') as $link)
{
    echo $link->href . "\n";
} 

and this would find all the links in the Document.

Also please see a Post i created and the great answer from Gordon: How do you parse and process HTML/XML in PHP?

Community
  • 1
  • 1
RobertPitt
  • 56,863
  • 21
  • 114
  • 161
2

preg_match_all()

$m=array();
preg_match_all('/<p>\s*<strong>([\s\S]*?)<\/p>/i',$string,$m);
foreach($m[1] as $mnum=>$match){
    $displayString.='Match '.$mnum.' is: '.$match."\n";
}

$m now contains all matches. $m[0] holds the entire matches, $m[1] holdes the parenthetical matches

Shad
  • 15,134
  • 2
  • 22
  • 34
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – RobertPitt Mar 24 '11 at 16:53
  • preg_match_all is a perfect example of using regular expressions to solve this problem. RobertPitt, your link does not address the author's question as directly as this posted solution. – Ryre Mar 24 '11 at 16:55
  • @Toast, yes it does, he is using curl to fetch contents and then using thinks like explode to get them into an array, if he replaced his current code wtih a DOM object then that would solve his issue, do you mind clicking the link again and reading full, as well as the comments. – RobertPitt Mar 24 '11 at 16:58
  • A rant on why regular expressions are bad, without providing any solution, is NOT a solution to this question. The second answer down supports regular expressions, and is the first answer that addresses this question. Please take your personal vendetta elsewhere, and let us help him. – Ryre Mar 24 '11 at 16:59
  • Your not helping him what so ever, and im looking out for all programmers who come across this thread, this is the wrong way to parse html, there is no rant, there is no vendetta, just someone who fails to learn from others! – RobertPitt Mar 24 '11 at 17:03
  • 2
    @RobertPitt - the OP has already stated that he would ordinarily use a regular expression to solve this problem if using the command-line. It is reasonable, therefore, for answers to be given that show him how to do a similar thing inside PHP; that is what he's asking for. No, regex is not ideal for parsing HTML, but the OP's regex skills appear to be good enough to do a reasonable job, and you don't know the circumstance; he may have a very specific HTML snippet to parse with known properties, in which case regex is actually a reasonable tool. – Spudley Mar 24 '11 at 17:04
  • @chad = so I could access $m in a for loop for each position and pull out the data, massage it and display it. I just get ArrayArrayArray. see my edit above if you have time. I am misunderstanding something – Todd Mar 24 '11 at 17:32
  • @Todd `foreach($m[1] as $mnum=>$match){ $displayString.='Match '.$mnum.' is: '.$match."\n";}` – Shad Mar 24 '11 at 17:43
  • @todd each result is going to return an array, so you will need a nested loop to get the data from each of the arrays in the $m array – superultranova Mar 24 '11 at 17:43
  • @shad, using the above my displayString is empty. `$string` contains the results of my curl according to firebug. I don't understand obviously – Todd Mar 24 '11 at 19:13
  • It worked for me, throw a `var_dump` on your `$m` to see its structure – Shad Mar 24 '11 at 19:56
  • @shad 0 i get `array(2) { [0]=> array(0) { } [1]=> array(0) { } }` – Todd Mar 24 '11 at 19:59
  • @Todd that means it didn't find _any_ matches. `var_dump` your `$string` and make sure there are matches to be found. – Shad Mar 24 '11 at 20:15
  • @shad i get `string(13862) "` and if I look at the variable I am returning from curl it does have an html doc with the data I expect in it.. – Todd Mar 24 '11 at 20:21
  • @Todd well, then the empty `$m` arrays means it can't find a match for the given pattern in the given string... Are you sure you are calling `preg_match_all()` ON the var that caught the CURL return? (Between the example and you initial question you are using `$string` and `$buffer` – Shad Mar 24 '11 at 20:46
  • @shad I am doing `$buffer = curl_exec($curl_handle);` and `$m=array(); preg_match_all('/

    (.*?)<\/p>/',$buffer,$m);`

    – Todd Mar 24 '11 at 20:53
  • @shad if I look in FireBug it shows data like: `

    San Jose College Park
    2.33 miles
    Out Of Stock

    `
    – Todd Mar 24 '11 at 20:55
  • @shad the data is a whole HTML page being returned as a result of my curl call. the `

    ` are wrapped in `` with the start of the URL the same for each location..

    – Todd Mar 24 '11 at 20:56
  • @Todd Then there are no matches... I guess I would add the `i` flag to the pattern (for case-insensitivity), and double check the encoding of the response is the same as the encoding of the script (ISO,UTF-8?) `preg_match_all('/

    (.*?)<\/p>/i',$buffer,$m);`

    – Shad Mar 24 '11 at 20:58
  • @Todd If you look at the actual source, not firebug, are there any space/characters between the openning `p` tag and the openning `strong` tag? that would require a different regexp. see my updated answer – Shad Mar 24 '11 at 21:01
  • @shad if I paste into a browser my url and view the source each entry I want is made up like this: `
  • San Jose College Park
    2.33 miles
    Out Of Stock

  • ` – Todd Mar 24 '11 at 21:08
  • @Todd From the example code you just gave, I am able to successfully extract the code between the tags. I think you have a typo somewhere, or perhaps are fetching differently encoded text into your script... – Shad Mar 24 '11 at 21:12
  • @Shad - can you look at Edit 3? – Todd Mar 24 '11 at 21:21
  • shad - what is the `Array=array()`? – Todd Mar 24 '11 at 22:16
  • @Todd forgot to escape the `$` in heredoc, sorry. Fixed now. – Shad Mar 24 '11 at 22:20
  • @shad - OK, I think I've for it now..testing! – Todd Mar 24 '11 at 22:20
  • @shad, yeah I can manage from this, very helpful advice today. I will give you a shout-out on the example I am posting on Hacker News later today. – Todd Mar 24 '11 at 22:24