0

I'd need to extract numbers and dot which are codified in a TEXT as images. The number of digits and the presence of dot is unpredictable.

String would be like:

beginningspeedstring-"./gifs/4.jpg"-"./gifs/1.jpg"-"./gifs/dot.jpg"-"./gifs/3.jpg"-endspeedstring-beginningtempstring-"./gifs/1.jpg"-"./gifs/8.jpg"-"./gifs/dot.jpg"-"./gifs/8.jpg"-endtempstring-beginningforce-"./gifs/5.jpg"-"./gifs/3.jpg"-"./gifs/3.jpg"-endforce

What I expect as output is in a single pattern match:

18.8

Can I get this through a single regexp?

Thanks

EDIT Changed example as the main point is not html but capturing multiple occurences at once.

EDIT2

beginningtempstring-(?:.*?gifs\/(.*?)\.jpg.*)*-endtempstring

This is the best I could come out so far but it retrieves only first occurrence (and does not pick up dot)

lui
  • 440
  • 3
  • 16
  • 1
    So, basically you want to parse html with regexp? http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 You should parse it as html/xml & loop through the image nodes. – pozs Mar 13 '14 at 13:16
  • Actually it doesn't really matter if it's html or not. For my purpose it can be treated as simple string. Actually the issue is getting in one single item the whole number. It could be like: beginning-"./gifs/1.jpg"-"./gifs/8.jpg"-"./gifs/dot.jpg"-"./gifs/8.jpg"-end – lui Mar 13 '14 at 13:32

2 Answers2

0

For an html file:

$html = <<<EOD
<tr>
<td valign="middle">
<img src="./gifs/1.jpg" height="62" width="20">
<img src="./gifs/8.jpg" height="62" width="20">
<img src="./gifs/dot.jpg" height="62" width="10">
<img src="./gifs/8.jpg" height="62" width="20">
<img src="gifs/unit-of-measure.jpg">
</td> 
</tr>
EOD;

A clean way to do this is to use DOMDocument and XPath:

$dom = new DOMDocument();
@$dom->loadHTML($html);

$xpath = new DOMXPath($dom);
$query = '//tr/td[@valign = "middle"]/img[@height = "62"]/@src';

$srcNodes = $xpath->query($query);

foreach ($srcNodes as $srcNode) {
    $tmp = substr($srcNode->textContent,7,-4);
    if ($tmp === 'dot') $tmp = '.';
    $result .= $tmp;
}
print_r($result);

A regex way (assuming that the format is always the same):

$pattern = '~<img src="\./gifs/(?|(\d)\.|dot(\.))jpg" height="62" width="[12]0">~';
preg_match_all($pattern, $html, $matches);
$result = implode($matches[1]);

Note: if you want to be sure that <img> tags are contiguous, you can add this at the begining of the pattern:

(?:<td valign="middle">|\G)\s*

that ensures the match start after the <td> tag or at the end of a precedent match.

For a text file:

$text = 'beginningspeed-"./gifs/4.jpg"-"./gifs/1.jpg"-"./gifs/dot.jpg"-"./gifs/3.jpg"-endspeed
beginningtemp-"./gifs/1.jpg"-"./gifs/8.jpg"-"./gifs/dot.jpg"-"./gifs/8.jpg"-endtemp
beginningforce-"./gifs/5.jpg"-"./gifs/3.jpg"-"./gifs/3.jpg"-endforce';

$pattern = '~^[^-]+-|[^-]+$|(?<!t)\.?jpg"-|"\./gifs/|dot~m';

$tmp = preg_replace($pattern, '', $text);

$results = explode(PHP_EOL, $tmp);
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Thanks Casimir but I cannot parse the document in that way because it will brake the logic I use for other cases unfortunately.... Let's not stick to html, it can even be considere a string like: beginning-"./gifs/1.jpg"-"./gifs/8.jpg"-"./gifs/dot.jpg"-"./gifs/8.jpg"-end – lui Mar 13 '14 at 13:38
  • @lui: perhaps it is time to change the logic you use for the other cases? As you can see, this method is independant of all the quirks you can find in an html document, allows to obtain quickly the content you want with a simple query. – Casimir et Hippolyte Mar 13 '14 at 13:43
  • Unfortunately for 50% of the cases I get text files and 50% I'd get html pages. I'd like to find a way to let me match in one regexp the multiple occurrences... – lui Mar 13 '14 at 13:47
0

If "in one match" means: extracting the desired result in one actual regex match, then i think it is not possible, or at least complicated. But if you want to use one regex to match all required parts, then you might use the following approach:

$input = '<tr><td valign="middle"><img src="./gifs/1.jpg" height="62" width="20"><img src="./gifs/8.jpg" height="62" width="20"><img src="./gifs/dot.jpg" height="62" width="10"><img src="./gifs/8.jpg" height="62" width="20"><img src="gifs/unit-of-measure.jpg"></td> </tr>';
//(or)
$input = '"./gifs/1.jpg"-"./gifs/8.jpg"-"./gifs/dot.jpg"-"./gifs/8.jpg"';

$pattern = '%gifs/(?:dot)?([0-9]|\.)(?:\.)?jpg%';
preg_match_all($pattern, $input, $matches, PREG_PATTERN_ORDER);
$result = implode('',$matches[1]);
echo $result;

Tested with both $input strings.

M.S.
  • 442
  • 3
  • 13
  • Thanks M.S. yes that's good option as I can probably insert it in my actual logic. Still if a solution to have just one regexp would exist it would be great. – lui Mar 13 '14 at 13:56
  • Also because.... there are multiple jpgs and dots belonging to different parameters (imagine in same page, speed, humidity, ...) so I need to tie down the position as much as possbile. – lui Mar 13 '14 at 13:58
  • you mean you have an input which should yield multiple results? if you give a more extensive example i might come up with a better solution for such examples as well. – M.S. Mar 13 '14 at 14:02
  • or did you mean that the "middle" is more important? if so, then I got confused by your text-only example – M.S. Mar 13 '14 at 14:04
  • Ok, I'm adding a more complete example in the explanation – lui Mar 13 '14 at 14:05
  • either you have to know something about the surrounding text OR it needs to have some regularity. So either you know "beginningspeed" etc. in advance, OR some regular thing like "" is always around. – M.S. Mar 13 '14 at 14:12
  • Correct: I know the beginning and end of the string. For simplicity I called them "beginningspeed" and "endspeed". Those substrings can be used for the sake of the example. – lui Mar 13 '14 at 14:17
  • Not exactly as requested but very useful as well. I ended up using a specific parser using the code from M.S. thanks a lot! – lui Mar 13 '14 at 19:03