-2

I have this string and need to get word2 and word3 but not word1

this >word1 is a special ;word2 with ;word3

So far I have this regex but it simply selects all three words

(;[a-z0-9]+)

What I want is only receiving word2 and word3 because the semicolon of word1 belongs to an XML entity.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
Martin Weber
  • 3,892
  • 4
  • 20
  • 23
  • 4
    Worst tag combination ever. – Mad Physicist Jan 12 '18 at 14:19
  • 1
    is the whole thing within an XML element? – diginoise Jan 12 '18 at 14:20
  • 3
    Smells like [H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – ctwheels Jan 12 '18 at 14:20
  • 2
    Since you haven't specified a language, I'll assume any language is possible? So variable width lookbehinds (.net allows this) can be used: [`(?<!?[^\s;]+);(\w+)`](http://regexstorm.net/tester?p=%28%3f%3c!%26%23%3f%5b%5e%5cs%3b%5d%2b%29%3b%28%5cw%2b%29&i=this+%26gt%3bword1+is+a+special+%3bword2+with+%3bword3+a%26%23768%3bword) – ctwheels Jan 12 '18 at 14:24
  • If you're using [tag:php] you can use `html_entity_decode()` and then simply use `(?<=;)\w+` – ctwheels Jan 12 '18 at 14:55

2 Answers2

1

Have you tried this

(?<!&[^ ]+)(;[a-z0-9]+)

It's kind of "hardcoded", but it will only get words after a semicolon if that semicolon isn't preceded by a string starting with &.

edit: if this approach doesn't work due to using a variable length lookbehind replace it with

(?<!&[^ ]\w{1,20})(;[a-z0-9]+)

it does effectively the same thing with a workaround for said lookbehind

ricardo silva
  • 331
  • 1
  • 18
  • 1
    `&DoubleDot;`, `&NonBreakingSpace;`, `&DiacriticalGrave;`, `&DiacriticalAcute;`, `&CenterDot;`, `&Cedilla;`, `&circledR;`, etc. – ctwheels Jan 12 '18 at 14:33
  • I see, thank you. Well, you could just increase the size of the string you're searching, but I guess that can lead to incorrect answers eventually. I'll edit my answer anyways – ricardo silva Jan 12 '18 at 14:35
  • 1
    Also, this uses a variable length lookbehind which hasn't much support. Only [tag:.net] and [tag:jgsoft] currently support this. Java also supports this but not `*` or `+`, only `{x, y}` in the lookbehind. – ctwheels Jan 12 '18 at 14:39
  • 1
    Since there wasn't any restrictions on the part of the question I also didn't take that into account. You can replace the `+` with a `{1,20}` but that is getting pretty (w)hacky – ricardo silva Jan 12 '18 at 15:09
  • @ricardosilva: It's a common way to do that in Java and co. You can often see things like `(?<=x.{1,1000})` in place of `(?<=x.+)`. – Casimir et Hippolyte Jan 12 '18 at 15:14
  • This selects the right words, but includes the semicolon. It is the actual right solution for my problem though, because I need to remove the word from the text including the semicolon, I just missed that part in my question. But after reading all answers and comments, yours and @ctwheels comment above are the best solutions I got :) – Martin Weber Jan 15 '18 at 07:23
0

I'd say :

(?:&[^ ]+?;)|;(\w+)

And you just have to check if group 1 exists.

Or, depending on the language you are using regexes in, you might as well split on any entity (&[^\s;]+;), and then find words in each chunk.

If you only want to replace the words + semicolon, you can use ([^ ]+?;)|;\w+ and replace with first group.

Faibbus
  • 1,115
  • 10
  • 18