0

I am having an issue getting a detailed preg_match_all to work. I keep getting a blank Array.

Here is my code:

  <?php
  $remote_search = file_get_contents('http://wiki.seg.org/index.php?title=Special%3ASearch&search=drilling&button=');
  preg_match_all('%<li><div class=\'mw-search-result-heading\'><a href="(.*)" title="(.*)">(.*)</a>  </div> <div class=\'searchresult\'>(.*)</div>
  <div class=\'mw-search-result-data\'>(.*)</div></li>%si', $remote_search, $links);
  echo '<ul class=\'mw-search-results\'>';
  for($i = 0; $i < count($links[1]); $i++) {
  echo '<li><div class=\'mw-search-result-heading\'><a href="' . $links[5][$i] . '" title="' . $links[4][$i] . '">' . $links[3][$i] . '<\/a>  </div> <div class=\'searchresult\'>' . $links[2][$i] . '<\/div><div class=\'mw-search-result-data\'>' . $links[1][$i] . '<\/div><\/li>';
  }
  echo '</ul>';
  ?>

I am trying to grab the link details from code shown below:

<li><div class='mw-search-result-heading'><a href="/index.php/Dictionary:Cable_drilling" title="Dictionary:Cable drilling">Dictionary:Cable drilling</a> </div> <div class='searchresult'>{{lowercase}}{{#category_index:C|cable <span class='searchmatch'>drilling</span>}} </div> <div class='mw-search-result-data'>132 B (22 words) - 19:58, 20 December 2011</div></li>

When I perform a var_dump($links); I get Array as the result.

The code below works to grab the contents in the section I am trying to pull the variables.

  <?php
  $remote_search = file_get_contents('http://wiki.seg.org/index.php?title=Special%3ASearch&search=drilling&button=');
  preg_match_all('%<ul class=\'mw-search-results\'>(.*)</ul>%si', $remote_search, $links);
  $bar = $links[0];
  echo '<ul class=\'mw-search-results\'>';
  echo $bar;
  echo '</ul>';
  var_dump($links);
  ?>

The echo $bar; results in Array and no ouput.

The var_dump($links); in this snippet outputs the content of the ul.

Does anyone see the error in my top snippet that is preventing me from parsing the code the way I am intending it?

Himanshu
  • 31,810
  • 31
  • 111
  • 133
jwestyp
  • 3
  • 2

2 Answers2

0

Never try to parse html with Regex. Use DOMDocument instead. In your case to get links from file you can do something like:

$dom = new DOMDocument();
$dom->load($url);

$elements = $dom->getElementsByTagName('a');
$links = array();
foreach ($elements as $element)
    $links[] = $element->getAttribute('href');

var_dump($links);
Community
  • 1
  • 1
Leri
  • 12,367
  • 7
  • 43
  • 60
  • Thank you. The page has so many more links that this will not target the links that I need. I like the concept, though. Is it possible to target the div as shown in the snippet, and get the full div? – jwestyp Oct 11 '12 at 07:57
  • @jwestyp I've provided link to the manual that has full information and samples. And on your answer: yes, it's possible `$dom->getElementsByTagName('div');` will give you [`DOMNodeList`](http://www.php.net/manual/en/class.domnodelist.php) that will have all `div`s and you can loop through and do whatever you want with them. – Leri Oct 11 '12 at 10:30
0

Try:

preg_match_all('@<li><div\s*class=\'mw-search-result-heading\'><a\s*href=.([^"]*).\s*title=.([^"]*).>([^<]*)<\/a>\s*<\/div>\s*<div\s*class=\'searchresult\'>(.*?)<\/div>\s*<div\s*class=.mw-search-result-data.>([^<]*)<\/div><\/li>@sim', $remote_search, $links);
print_r($links);

The logic error in your code was the way you were matching <div class=\'searchresult\'>(.*)</div> against <div class='searchresult'>{{lowercase}}{{#category_index:C|cable <span class='searchmatch'>drilling</span>}}</div> This doesn't work well with regular expressions since there is a nested tag -- the span. So I changed your matching logic to non-greedy: .*?. Also notice how I changed the flag modifiers for the regular expression to sim. I always use these three modifiers whenever I toss a regular expression against HTML. I use them so often I even found a way to arrange the modifier letters into a word namely "sim" as a memory aid to help remember the modifiers.

Happy coding!

Ultimater
  • 4,647
  • 2
  • 29
  • 43
  • This worked great. I am displaying the first three links with `echo $segLinks[0][0];` `echo $segLinks[0][1];` `echo $segLinks[0][2];` Is there a way to force a base href into this entry? The links pulled from the page are relative, so I need to define an absolute base for just this section. – jwestyp Oct 11 '12 at 07:57
  • The regular expression only matches what is contained within the HTML. If you want to link to their server instead of your own then just toss `` into your page's ``. – Ultimater Oct 11 '12 at 10:22
  • In my original code above, I was trying to pull each item as a separate variable. I could then insert the items individually. Is there a way to do something similar? I can't set the base in the head because it will set all of the embedded links on my page to the other site. And in case you were wondering, we do have permission for this aggregation. – jwestyp Oct 16 '12 at 02:34
  • Ultimator, I put the code in before and after my code, but it still is not affecting the urls. They are staying relative to their original location and prepending the current site hostname. – jwestyp Oct 16 '12 at 23:21
  • I got something to work. Here is my solution. I placed this code before the preg_match_all. `$segLinks_abs = str_replace('href="/', 'href="http://wiki.seg.org/', $remote_search_seg);` – jwestyp Oct 17 '12 at 05:32
  • This works for me: http://pastie.org/private/faoi6bzavko8hhdpjagasg For future reference, whenever creating a string bound by single quotes, remember that a single quote is the only character you NEED to escape. All other characters will leave the backslash in tact and it will not escape the character after it. The exception to this is when attempting to escape another backslash for instances such as: `echo '\\\'';` Which would echo `\'`. But for something like `echo '
    ';` it will echo `
    `. Lonely backslashes will stay in tact and they will not escape the character after them.
    – Ultimater Oct 19 '12 at 07:20