Regex: Find first word after semicolon if semicolon doesn't belong to XML entity

Question

I have this string and need to get word2 and word3 but not word1

this &gt;word1 is a special ;word2 with ;word3

So far I have this regex but it simply selects all three words

(;[a-z0-9]+)

What I want is only receiving word2 and word3 because the semicolon of word1 belongs to an XML entity.

Smells like [H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — ctwheels, Jan 12 '18 at 14:20
Since you haven't specified a language, I'll assume any language is possible? So variable width lookbehinds (.net allows this) can be used: [`(?<!?[^\s;]+);(\w+)`](http://regexstorm.net/tester?p=%28%3f%3c!%26%23%3f%5b%5e%5cs%3b%5d%2b%29%3b%28%5cw%2b%29&i=this+%26gt%3bword1+is+a+special+%3bword2+with+%3bword3+a%26%23768%3bword) — ctwheels, Jan 12 '18 at 14:24
If you're using [tag:php] you can use `html_entity_decode()` and then simply use `(?<=;)\w+` — ctwheels, Jan 12 '18 at 14:55

ricardo silva · Accepted Answer · 2018-01-12T16:48:46.113

1

Have you tried this

(?<!&[^ ]+)(;[a-z0-9]+)

It's kind of "hardcoded", but it will only get words after a semicolon if that semicolon isn't preceded by a string starting with &.

edit: if this approach doesn't work due to using a variable length lookbehind replace it with

(?<!&[^ ]\w{1,20})(;[a-z0-9]+)

it does effectively the same thing with a workaround for said lookbehind

edited Jan 12 '18 at 16:48

answered Jan 12 '18 at 14:31

ricardo silva

1

`&DoubleDot;`, `&NonBreakingSpace;`, `&DiacriticalGrave;`, `&DiacriticalAcute;`, `·`, `&Cedilla;`, `&circledR;`, etc. – ctwheels Jan 12 '18 at 14:33
I see, thank you. Well, you could just increase the size of the string you're searching, but I guess that can lead to incorrect answers eventually. I'll edit my answer anyways – ricardo silva Jan 12 '18 at 14:35
1

Also, this uses a variable length lookbehind which hasn't much support. Only [tag:.net] and [tag:jgsoft] currently support this. Java also supports this but not `*` or `+`, only `{x, y}` in the lookbehind. – ctwheels Jan 12 '18 at 14:39
1

Since there wasn't any restrictions on the part of the question I also didn't take that into account. You can replace the `+` with a `{1,20}` but that is getting pretty (w)hacky – ricardo silva Jan 12 '18 at 15:09
@ricardosilva: It's a common way to do that in Java and co. You can often see things like `(?<=x.{1,1000})` in place of `(?<=x.+)`. – Casimir et Hippolyte Jan 12 '18 at 15:14
This selects the right words, but includes the semicolon. It is the actual right solution for my problem though, because I need to remove the word from the text including the semicolon, I just missed that part in my question. But after reading all answers and comments, yours and @ctwheels comment above are the best solutions I got :) – Martin Weber Jan 15 '18 at 07:23

Faibbus · Answer 2 · 2018-01-15T12:21:44.543

0

I'd say :

And you just have to check if group 1 exists.

Or, depending on the language you are using regexes in, you might as well split on any entity (&[^\s;]+;), and then find words in each chunk.

If you only want to replace the words + semicolon, you can use ([^ ]+?;)|;\w+ and replace with first group.

edited Jan 15 '18 at 12:21

answered Jan 12 '18 at 14:34

Faibbus

I tried this, but it selects word 2 and word3 but including the > part – Martin Weber Jan 15 '18 at 07:27
In the full match, yes, but not in group #1. – Faibbus Jan 15 '18 at 12:19

2 Answers2