1

I'm parsing some HTML, that I have generated in a form. This is a token system. I'm trying to get the information from the Regexp later on, but somehow, it's turning up only the first of the matches. I found a regexp on the Web, that did almost what I needed, except of being able to process multiple occurances.

I want to be able to replace the content found, with content that was generated from the found string.

So, here is my code:

$result = preg_replace_callback("/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>\[\*.*\*\]\<\/[a]\>/i", array(get_class($this), 'embed_video'), $str);
        public function embed_video($matches)
{
  print_r($matches);
  return $matches[1] . 'foo';
}

I really need only the attributes, since they containt all of the valuable information. The contents of the tag are used only to find the token. This is an example of what needs to happen:

<a type="TypeOfToken1" id="IdOfToken1">[*SomeTokenTitle1*]</a>
<a type="TypeOfToken2" id="IdOfToken2">[*SomeTokenTitle2*]</a>

After the preg_replace_callback() this should be returned:

type="TypeOfToken1" id="IdOfToken1" type="TypeOfToken2" id="IdOfToken2"

But, the callback function outputs the matches, but does not replace them with the return. So, the $result stays the same after the preg_replace_callback. What could be the problem?


An example with real data:

Input:

<p><a id="someToken1" rel="someToken1">[*someToken1*]</a> sdfsdf <a id="someToken2" rel="someToken2">[*someToken2*]</a></p>

returned $result:

id="someToken1" rel="someToken1"foo

Return from the print_r() if the callback function:

Array ( [0] => [*someToken1*] sdfsdf [*someToken2*] [1] => id="someToken1" rel="someToken1" [2] => rel="someToken1" [3] => rel [4] => ="someToken1" ) 

I think that it is not returning both of the strings it should.

Janis Peisenieks
  • 4,938
  • 10
  • 55
  • 85
  • 2
    possible duplicate of [Best methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html) – Gordon Mar 07 '11 at 12:53
  • Your code snippet is working fine (apart from stripping out the start of the tags). It sounds likely that you looked in `$str` for results instead of `$result`. – mario Mar 07 '11 at 12:57
  • I dont think the question deserves a downvote. However, how to parse X from HTML has been answered many times before. Try to search with the keyword "DOM". – Gordon Mar 07 '11 at 12:58
  • 1
  • @Janis it may take *slightly* longer, but it will not use a regexp, and work much, much more reliably. – Pekka Mar 07 '11 at 13:15
  • But what about replacing? As You can see from the callback function, I am trying to replace the match with some content that depends on the match made. – Janis Peisenieks Mar 07 '11 at 13:17
  • It's not quite clear why you want to depend on HTML attributes **and** a `[*token*]`. Usually template libraries use something like `[*token attr1 attr2*]` for which regular expressions are more clearly the right fit. -- As for the result replacing, the callback only does what you told it to. It removes everything the regex matches, but you only return `$matches[1]`. You need to group the parts you want to save and include them in the return $matches[1], [2], [3]... Or for testing just return `$matches[0]` which includes everything. – mario Mar 07 '11 at 13:41
  • I got that. Quite a lot of hate here :) It seems, that I may have misunderstood what the callback function should do. I thought, that it should return a string that should replace the string found. As for the tag, I understood, that what I was asking was too much. I wanted to use a tag, so that I would add some variables to the token, so that they would not be visible inside a WYSIWYG editor, like TinyMCE. I have since reverted to syntax like this: [*Token::SomeNeededInfo*] , to which I have created a regex. – Janis Peisenieks Mar 07 '11 at 13:45
  • No, that's exactly what the callback does. But the string it returns will replace the **entire** match, which includes the leading `?\w+` and the trailing `[a]>`. So you would either `(` group `)` them for later returning. Or use the simpler approach of having a secondary regex in the callback to match/break it up more fine-grained. -- But I think you should go with the attributes in the token if that is feasible. – mario Mar 07 '11 at 14:01
  • Thanks. It's people like You, who really try to understand questions like mine, that make this site so great. Thanks for the help, I've gotten it. – Janis Peisenieks Mar 07 '11 at 14:10

1 Answers1

1

For anyone else stumbling into a problem like this, try checking your regexp and it's modifiers.

Regarding the parsing of the document, I'm still doing it, just not HTML tags. I have instead gone with someting more textlike, that can be more easily parsed. In my case: [*TokeName::TokenDetails*].

Janis Peisenieks
  • 4,938
  • 10
  • 55
  • 85