0

The string

<div id="main">
   content (is INT)
   <div>some more content (is not INT) other content (also INT)</div>
</div>

I need to get the content which is an INT. A simple strip all non-INT function will not work since other contentsometimes also is an INT. I cannot use a select child solution since it is always outside div and to select the content of <div id="main">will also select the other div.

Thus is there a solution that can search the string from start for the first <and remove the rest of the string when found.

(The structure cannot be altered)

Joseph
  • 1,734
  • 6
  • 29
  • 51

1 Answers1

1

if that's the exactly format, you could just use substr and strpos something like

$html = '<div id="main">
   12345
   <div>foobar6789</div>
</div>
';

$content_1 = substr($html,15,strpos($html,'<div>')-15); //the first INT content
$subdiv = str_replace("</div>","",substr($html,strpos($html,'<div>')+5));

preg_match('/(?P<noint>[^0-9]+)(?P<digit>\d+)/', $subdiv, $matches);
echo $matches['noint'];//the NO INT content
echo $matches['digit'];//the second INT

it's not a good idea to parse html using regexp... but maybe you could do it using only preg_match...

good luck!

pleasedontbelong
  • 19,542
  • 12
  • 53
  • 77
  • It might not always be the exact format, but I thought there was a php function that could go through a strnig and stop at the first occurance of e.g. "<" and return the part of the string up until the "<"? That would solve it but I do not know how to do it. – Joseph Apr 09 '11 at 11:08
  • well.. to do that you have to use substr and strpos together like i did to retreive the first INT... i had to use preg_match to separate the text from the digits because i didn't know if there was a separator, i guess that what you need is a parser – pleasedontbelong Apr 09 '11 at 11:13
  • OK if I understand correctly there is a `:` after the wanted INT (not always but so often that it would be a good solution) does it help to simplify the solution? Also I could enter the first div so that the INT is the first in the string and would thus only need to get, from start to string (the INT) to the first `<` – Joseph Apr 09 '11 at 11:17
  • i dont think it will simplify it.. in the end i gave you a 3-lines solution to retreive 3 values.. you could try to break your head using regexp to retreive the 3 values using 1 regexp, i've tried but i had no luck, [check this][1] ... [1]:http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not – pleasedontbelong Apr 09 '11 at 11:31
  • the question was not meant to be negative regardning your solution but rather the problem that since the structure is not always the same it would be better to not regard thestructure but simply step through the string and stop at a separater e.g. `:`which I said is after the INT I want. The problem for me occurs since it is inside a div together with another div with content I do not want. And thus I cannot search for main and handle the children. – Joseph Apr 09 '11 at 11:35