0

I'm using PHP Simple HTML DOM parser and everithing runs fine until I get this div content . I've been tried all ways to get the src attr, find the a tags, the img, and all fails, I can get the img tag, but only can get the width, height and alt attr (just the part where "some text" appears, not the others strings).

<img width="656" height="370" 
alt="some text " .="" othertetx="" anothertext="" anothertext="" anothertext="" anothertext'="" title="same text in the alt attr " src="http://siteurl/getattach/somedir/somefile.aspx">

I think the problem is in the alt attr with all the text with the .= symbols that confuses the parser. This tag is displayed fine in browsers, so, it must be "standard"

Edit:

The answer pointed does not resolve the problem, I know how to get the src, the problem is with this tag. Take the time to full read the question before marking it as duplicate, please. The code provided in the sugested answer does not work with the sample I show.

This

$img_src = $element->src;
if(!strstr($img_src, 'http://')) {
    $img_src = $v . $img_src;
}

don't extract the src attr from this

<img width="656" height="370" 
    alt="some text " .="" othertetx="" anothertext="" anothertext="" anothertext="" anothertext'="" title="same text in the alt attr " src="http://siteurl/getattach/somedir/somefile.aspx">
José Romero
  • 462
  • 4
  • 9
  • Parse dom??? http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Gadonski May 22 '14 at 23:43
  • It is the apostrophe ' that is breaking the element. I'm not sure how to resolve it though. – Andy G May 22 '14 at 23:45
  • Probably get the innerHTML of its parent and either search for 'src', or remove the apostrophe and append this as a new (hidden) element and read its 'src'. (I'm assuming the parser can do this.) – Andy G May 22 '14 at 23:52
  • @hek2mgl - You marked this question a duplicate of a question that was, itself, a duplicate. You should try to fix that. – pguardiario May 23 '14 at 03:03
  • The HTML is wrong. The best option would be to try to correct that. Garbage in is garbage out. – GolezTrol May 23 '14 at 09:01

1 Answers1

0

The <img> element is not valid HTML. It shows several issues with the attribute declarations. I suggest to use a validation service like the W3C online validator in order to see those errors. I've wrapped the img tag from your question into this document for validation.

However, while the <img> tag isn't valid, the DOMDocument class is able to parse it. Like this:

$string = <<<EOF
<img width="656" height="370"
alt="some text " .="" othertetx="" anothertext="" anothertext="" anothertext="" anothertext'="" title="same text in the alt attr " src="http://siteurl/getattach/somedir/somefile.aspx">
EOF;

$doc = new DOMDocument();
@$doc->loadHTML($string);

$images = $doc->getElementsByTagName('img');
echo $images->item(0)->getAttribute('src');

Output:

http://siteurl/getattach/somedir/somefile.aspx

Note that the simplehtmldom class is not as powerful as the builtin DOM extension. It was written in a time when PHP had no builtin DOM extension. In most cases it's usage can be considered deprecated nowadays.

hek2mgl
  • 152,036
  • 28
  • 249
  • 266