2

I have this regex filter: <+>|\P{L}

Numbers and HTML tags are deleted.

My problem is that spaces are also deleted and I don't want spaces to be deleted.

For example, I need to change this text "(0) Ship Out" to this "Ship Out". Now it returns "ShipOut".

How can i fix it?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Nomi
  • 127
  • 1
  • 10

2 Answers2

3

You might be looking for a way to still match \P{L} (any character that is not a Unicode letter) and still be able to not match a space.

Just use a reverse shorthand class \p{L} in a negated character class [^\p{L}\s].

No idea if <+> is working for you, you might be looking for <[^<]*>.

So, my suggestion is

Regex.Replace(str, @"<[^<]*>|[^\p{L}\s]", string.Empty).Trim();

See demo

enter image description here

Trim() will get rid of leading and trailing whitespace.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • @Nomi you were using `match` or `replace` earlier – vks Jul 16 '15 at 06:37
  • Glad to be of help. One remark: if you are *parsing* HTML you'd better use some HTML parser instead of just removing tags with regex. This will save you lots of trouble. Have a lookt at [my answer describing how to get clean text from HTML document](http://stackoverflow.com/questions/31385985/c-sharp-regular-expressions-get-second-number-not-first/31386196#31386196). – Wiktor Stribiżew Jul 16 '15 at 06:39
0
 <+>|\P{L}|\P{Z}

You can use this filter for that.

See demo.

You can also use

\p{L}|(?<=\p{L})\p{Z}(?=\p{L})

If you want to preserve space between words only

vks
  • 67,027
  • 10
  • 91
  • 124
  • This [will remove everything in the sample input](http://regexstorm.net/tester?p=%3c%2b%3e%7c%5cP%7bL%7d%7c%5cP%7bZ%7d&i=%3cb%3e(0)+Ship+Out%3c%2fb%3e&r=). – Wiktor Stribiżew Jul 16 '15 at 06:33
  • @stribizhev OP probably is using a match instead of replace.As `p{L}` will match `Ship Out` too – vks Jul 16 '15 at 06:36