1

My puzzle: as a PHP newby I am trying to extract some data from a string using a regular expression, but I cannot find a correct syntax.

The content of the string is scraped as html of several images from a website, I want the final output to be 3 seperate variables: "$Number1", "$Number2" and "$Status".

An example of the content of the input string $html:

<div id="system">         
<img alt="2" height="35" src="/images/numbers/2.jpg" width="18" /><img alt="2" height="35" src="/images/numbers/2.jpg" width="18" /><img alt=".5" height="35" src="/images/numbers/point5.jpg" style="margin-left: -4px" width="26" /><img alt="system statusA" height="35" src="/images/numbers/statusA.jpg" width="37" /><img alt="2" height="35" src="/images/numbers/2.jpg" width="18" /><img alt="1" height="35" src="/images/numbers/1.jpg" width="18" /><img alt=".0" height="35" src="/images/numbers/point0.jpg" style="margin-left: -4px" width="26" />
</div>

The possible values which can appear in this string are:

  • 0.jpg
  • 1.jpg
  • 2.jpg
  • 3.jpg
  • 4.jpg
  • 5.jpg
  • 6.jpg
  • 7.jpg
  • 8.jpg
  • 9.jpg
  • point0.jpg
  • point5.jpg
  • statusA.jpg
  • statusB.jpg
  • statusC.jpg
  • statusD.jpg
  • statusE.jpg
  • statusF.jpg

The result should be variables:

  • "Number1" (XX.X) based upon the first two numbers (0-9) and .0 or .5
  • "Status" (statusX) based upon the status
  • "Number2" (XX.X) based upon the last two numbers (0-9) and .0 or .5

Code so far:

$regex = '\balt='(.*?)';
preg_match($regex,$html,$match);
var_dump($match);
echo $match[0];

Probably I have to do this in multiple steps or use another function, who can help me?

JERO79
  • 11
  • 1

3 Answers3

0

The very first thing that you should ask yourself is: "in what format is my input data". Since in this case it is clearly a snippet of HTML, you should feed that snippet to an HTML parser, and not to a regular expression engine.

I don't know the exact function names, but your code should look like this:

$htmltext = '<div id="system">[...]</div>';
$htmltree = htmlparser_parse($htmltext);
$images = $htmltree->find_all('img');
foreach ($images as $image) {
  echo $image->src;
}

So you need to find an HTML parser that parses a string into a tree of nodes. The nodes should have methods for finding node inside them based on CSS classes, element names or node IDs. For Python this library is called BeautifulSoup, for Java it is JSoup, and I'm sure that there is something similar for PHP.

The examples provided with simplehtmldom look promising.

Roland Illig
  • 40,703
  • 10
  • 88
  • 121
0

Possibly DOM : http://www.php.net/manual/en/book.dom.php

See Robust and Mature HTML Parser for PHP too

Community
  • 1
  • 1
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
0

You want just the alt's? Try this xpath example:

$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DomXpath($doc);

foreach($xpath->query('//img/@alt') as $node){
    echo $node->nodeValue."\n";
}
pguardiario
  • 53,827
  • 19
  • 119
  • 159
  • Thank you, this works. With a | as seperator it returns: '1|3|.5|statusB|8|5|.0|'. But how can I get this value into a string? I need this to split it further into 3 strings using explode. – JERO79 Nov 22 '11 at 20:02
  • Solved using: foreach($xpath->query('//img/@alt') as $node){ $input[]=$node->nodeValue; } – JERO79 Nov 22 '11 at 21:50