1

I know I can refer in replacement to dynamic parts of the term in regex in PHP:

preg_replace('/(test1)(test2)(test3)/',"$3$2$1",$string);

(Somehow like this, I don't know if this is correct, but its not what I am looking for)

I want that in the regex, like:

preg_match_all("~<(.*)>.*</$1>~",$string,$matches);

The first part between the "<" and ">" is dynamic (so every tag existing in html and even own xml tags can be found) and i want to refer on that again in the same regex-term.

But it doesn't work for me. Is this even possible? I have a server with PHP 5.3

/edit:

my final goal is this:

if have a html-page with e. g. following source-code: HTML

<html>
  <head>
    <title>Titel</title>
  </head>
  <body>
    <div>
      <p>
        p-test<br />
        br-test
      </p>
      <div>
        <p>
          div-p-test
        </p>
      </div>
    </div>
  </body>
</html>

And after processing it should look like

$htmlArr = array(
    'html' => array(
            'head' => array('title' => 'Titel'),
            'body' => array(
                'div0' => array(
                    'p0' => 'p-test<br />br-test',
                    'div1' => array(
                        'p1' => 'div-p-test'
                    )
                )
            )
    ));
hakre
  • 193,403
  • 52
  • 435
  • 836
Mohammer
  • 405
  • 3
  • 15
  • 2
    You must not process HTML or XML with regular rexpressions. [There are tools for this kind of work.](http://php.net/manual/en/class.domdocument.php) Use them. – Tomalak Apr 07 '12 at 14:14
  • 1
    [Never use regex to parse HTML/XML](http://stackoverflow.com/a/1732454/383609). It's not a regular language. Use an [HTML/XML parser](http://php.net/manual/en/class.domdocument.php) instead. – Bojangles Apr 07 '12 at 14:15
  • The structure you propose does not make sense. What's wrong with parsing the file into a DOM and using that? (except that you think arrays are easier then a DOM, which is not a good-enough reason) – Tomalak Apr 08 '12 at 19:38
  • @Tomalak i think i am just feared to use that because its new to me. I going to work me through documentation and maybe some tutorials. thank you! – Mohammer Apr 08 '12 at 19:41
  • If you'd explain what you intend to do with the parsed data, I could help you better. – Tomalak Apr 08 '12 at 19:43
  • @Tomalak Do you have skype/icq or some other messenger? or should i just get more in detail in the main question? (sry i am new to stackoverflow, dont know how to proceed now the correct way and didn't find any way to send you a message) – Mohammer Apr 08 '12 at 19:49
  • We could set up [a cat room](http://chat.stackoverflow.com) for this, but I'd like it better if you stated the problem you are trying to solve here, as this is less timezone trouble (I take it you are not located anywhere near GMT) – Tomalak Apr 09 '12 at 10:48
  • @Tomalak I am CET, germany. I'll write down the hole thing i want to archieve and open a new thread i guess – Mohammer Apr 09 '12 at 11:31
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/9861/discussion-between-tomalak-and-mohammer) – Tomalak Apr 09 '12 at 14:48

1 Answers1

1

Placeholders in the replacement string use the $1 syntax. In the regex itself they are called backreferences and follow the syntax \1 backslash and number.
http://www.regular-expressions.info/brackets.html

So in your case:

preg_match_all("~<(.*?)>.*?</\\1>~",$string,$matches);

The backslash is doubled here, because in PHP strings the backslash escapes itself. (In particular for double quoted strings, else it would become an ASCII symbol.)

mario
  • 144,265
  • 20
  • 237
  • 291
  • ...and sure enough someone writes a regular expression to process HTML for somebody who doesn't know regular expressions well enough to *not* have to ask questions about them. – Tomalak Apr 07 '12 at 14:22
  • I'm always tempted to exchange angle for curly brackets as to not violate the sacred SO parsing feelings. But the actual question here was about backreferences. Rubbing joke pages onto newbies doesn't accomplish much. – mario Apr 07 '12 at 14:29
  • Giving them the feeling that they can parse HTML with regex, after all, and this is all an obsessive exaggeration isn't helping either. My point is that if the OP is not smart enough with regex to figure it out on his own, it is actually dangerous advice to fix the regex for him. Oh, and I did not rub in any joke page, I linked to `DOMDocument`. – Tomalak Apr 07 '12 at 14:42
  • I am thankful to both of you. Yes I am not very skilled, and I didn't knew anything about DOMDocument in PHP (just in js). Now i know that backreference exists and I know about DOMDocument. Though it will be pain in my brain to go through that class to convert and html-string to an array. But thats only a matter of time and effort. – Mohammer Apr 07 '12 at 14:58
  • @Mohammer If you post some HTML and describe exactly what you want to do to it, I'd give you a head-start on how to do it with `DOMDocument`. – Tomalak Apr 07 '12 at 15:00
  • @Mohammer: You should avoid DOMDocument in PHP. Rather use phpQuery or [QueryPath](http://querypath.org/) for simplicity (jQuery-like extracting, not #%!$-"parsing"). Or even SimpleXML for converting XML into an array structure, if you have plain XML. Despite the frowning on SO, a regex is sufficient and most commonly workable for stripping single bits from coherent HTML input. – mario Apr 07 '12 at 15:01
  • At both of you: I want to write a class, using curl and extract some parts for an api so i can write an app for my android phone. Some of those websites i want to build my own api via frontend/source code have invalid html (like the same ids for several list etc) so i want to have the html as array. It wont work if the sourcecode/frontend changes somehow, but until then it works for me and those apps will be only for me – Mohammer Apr 07 '12 at 15:10
  • @Mohammer: Then don't write any custom class, use QueryPath with e.g. `qp($url)->find("h3")->text();` and you're done. – mario Apr 07 '12 at 15:38
  • @Mohammer Just post some sample HTML to your question. QueryPath does not sound like a bad option, either. – Tomalak Apr 07 '12 at 15:40
  • @Tomalak sample HTML and sample solution in main question. Thx so far Tomalak and mario, you helped me a lot! – Mohammer Apr 08 '12 at 19:33