-4

I love regexp but I find it rather confusing why there is no "match all" special character? For example, if I wanted to select a HTML tag and its contents, I would do

re = "<tag>([\s\S]*)</tag>"

You see, [\s\S] is a workaround to solve the absence of a match-all special character. Is there a reason why a match-all is missing from the spec? I know about . but it's not that pretty either: re = "<tag>([.\n]*)</tag>"

Karveiani
  • 141
  • 1
  • 12
  • If you set the `DOTALL` flag, `.` will match all. – Barmar Jul 14 '20 at 01:59
  • Actually, `[^]` is another workaround – theX Jul 14 '20 at 01:59
  • `[^]` would be neat but it caused an error in my python code. Something like "no closing brackets detected" if I remember correctly – Karveiani Jul 14 '20 at 02:00
  • 1
    We can't answer "why" questions like this. They are how they are. – Barmar Jul 14 '20 at 02:00
  • @theX "Is there a reason why a match-all is missing from the spec?" – Karveiani Jul 14 '20 at 02:00
  • @Barmar I guess you could say "it is what it is". I was just assuming that there would be a historical reason. Or that people use regex in a different style (compared to mine), so, that a match-all would be considered an anti-pattern. – Karveiani Jul 14 '20 at 02:02
  • @Karveiani it’s probably because then, you’d have to type in `[^\S\s]` instead of `.` – theX Jul 14 '20 at 02:03
  • @theX `]` appearing right after `[` or `[^` includes `]` into the character class instead of ending it (in PCRE, at least); e.g. `\[[^][]*\]` to match something in brackets. (Also: this breaks regexr.com. Funsies!) – HTNW Jul 14 '20 at 02:04
  • `[.\n]` would be a `.` or new line, for new line or any character it would be `(?:.|\n)` – user3783243 Jul 14 '20 at 02:11
  • Find the team that developed the particular flavor of regex you're using and ask them. We can't speculate on why the did or did not provide a specific feature. The better question is why you're attempting to use a regex to parse HTML or XML when you can use a DOM parser instead. Obligatory link about the [futility of trying to parse X/HtML with a regex](https://stackoverflow.com/a/1732454/62576). – Ken White Jul 14 '20 at 02:12
  • @KenWhite Before posting this question I was wondering if match-all behaviour is considered bad practice because it's not implemented in regex. BarMar responded that it would come with performance issues. So I really don't understand why you downvoted this question – Karveiani Jul 14 '20 at 02:24
  • Also I'm not going to parse XML with regex so don't worry. It was an example – Karveiani Jul 14 '20 at 02:26
  • Who said I downvoted? I posted a comment. You should be careful about making accusations without proof. – Ken White Jul 14 '20 at 02:49
  • @KenWhite Someone downvoted my question and you were the only one coming after me in the comments. – Karveiani Jul 14 '20 at 02:55
  • I made basically the same comment as Barmar did, six comments above mine. Again, you should be careful about making accusations without any evidence. I didn't *come after you*. If you feel like I did, you should develop a less sensitive personality when participating here. Not every comment is an *attack*. – Ken White Jul 14 '20 at 03:02
  • The answer is to use `.` and follow https://stackoverflow.com/a/45981809/3832970 post. Sometimes, `.` matches just any char. – Wiktor Stribiżew Jul 14 '20 at 07:43

1 Answers1

1

. is the match all character. By default it doesn't match newlines, but if you set the DOTALL flag it will match all characters. In Python you write:

re.search(r"<tag>(.*)</tag>", string, re.DOTALL)

Why isn't this the default? Probably because most regexp applications want to limit matches to within a line (especially for performance reasons). And having two separate characters, one for "match all" and another for "match all except newline", would have been a waste of characters.

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • 1
    `.` **is** matching **any chars including newlines** in POSIX based regex flavors. I think [this answer](https://stackoverflow.com/a/45981809/3832970) is enough, and there is no need to re-post the same. – Wiktor Stribiżew Jul 14 '20 at 07:47