3

The title of my question is a bit complicated, I know, but here is basically what I want to do:

Say I have this piece of text:

[table]
[tr]
[td]test str 1[/td]
[td]test str 2[/td]
[/tr]
[/table]

Would there be a regex, that allows me to find:

  • A string that is between the [td] and [/td] tag
  • Of which the entire part from [td] to [/td] is itself between the [table] and [/table] tags
  • And the text that is between the [table] and [td] tags can't contain the [/table] tag
  • And the text that is between the [/td] and [/table] tags can't contain the
    [table] tag

It might sound obvious, but it should be a safe regex because this regex will be used to handle user input, and if a user were to enter a [td] outside of a table (all the tags are converted to html), it could affect the tables used for the layout of my site's page.

So it should match "test str 1" first, and on the next go "test str 2", but only if that string is within the td tags, which should in turn be within the table tags between which may not be another table tag.

This is as close as I've gotten:

/\[table(.*?)\]((?!\[\/table\]).*?)\[td(.*?)\](.*?)\[\/td\]((?!\[table(.*?)\]).*?)\[\/table\]/si

But I think I'm missing something in the parts where the table tags should not be there, so between the table and td tags.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Qub1
  • 1,154
  • 2
  • 14
  • 31
  • [Don't regex html](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). Just write a parser, or use a library. – zellio Sep 01 '12 at 01:00
  • You have a better way to parse this **non** HTML stuff? @zellio – PeeHaa Sep 01 '12 at 01:03
  • If I was going to parse a non-regular language I would use a parser. And this is just html. changing the `<>` to `[]` doesn't change it. It takes user input and is converted to HTML. – zellio Sep 01 '12 at 01:04
  • Although I agree with the point about not the right tool for the job, but for some things there are no parsers ^^ (don't know whether that is the case here though because I have no idea where that stuff is coming from) – PeeHaa Sep 01 '12 at 01:05
  • Well I can't use a parser because I need to do this in a php environment that only accepts regex. I'm writing it in a plugin for a forum software, which will only accept regex writtin in a php environment. Its a real pain but I know it should be possible and I think my regex is real close to a solution, I just can't find the missing link. – Qub1 Sep 01 '12 at 01:06
  • Very true, though this is the kind of thing that could very easily be adapted from an HTML parser or from a simple grammar. – zellio Sep 01 '12 at 01:09
  • @zellio Yes but the entire problem here is that I can't use a parser as the environment accepts nothing but regex to match... I also use the matched regex string to extract the parts I need into the new html string... So while the correct parts are found, it is also converted to html at the same time. – Qub1 Sep 01 '12 at 01:12
  • If this is a limited set of tags you have, you could try `(?=\[table\]).*?\[td\](.*?)\[\/td\].*?(?<=\[\/table\])`. – Kash Sep 01 '12 at 02:25
  • Yes these are the only tags that will be available, but for some reason the regex you supplied does not find any matches. I am totally lost, I thought this would be simpler. – Qub1 Sep 01 '12 at 09:40
  • Why can't you use a parser? What is a PHP environment that only accepts regex? Just add a function or two that handles parsing. – zellio Sep 07 '12 at 19:40
  • Possible duplicate of [Regex matching table rows in HTML](https://stackoverflow.com/questions/7289181/regex-matching-table-rows-in-html) – Brian Tompsett - 汤莱恩 Oct 29 '17 at 08:50

1 Answers1

1

HTML is a Context-Free Language, whereas a regular expression is for Regular Languages. If you look at the Chomsky hierarchy of formal languages, you'll see that what you're trying to do isn't possible to do in any reliable way.

Richard
  • 1,024
  • 1
  • 7
  • 15