C# regex filters

Question

I have this regex filter: <+>|\P{L}

Numbers and HTML tags are deleted.

My problem is that spaces are also deleted and I don't want spaces to be deleted.

For example, I need to change this text "(0) Ship Out" to this "Ship Out". Now it returns "ShipOut".

How can i fix it?

score 3 · Accepted Answer · answered Jul 16 '15 at 06:29

3

You might be looking for a way to still match \P{L} (any character that is not a Unicode letter) and still be able to not match a space.

Just use a reverse shorthand class \p{L} in a negated character class [^\p{L}\s].

No idea if <+> is working for you, you might be looking for <[^<]*>.

So, my suggestion is

Regex.Replace(str, @"<[^<]*>|[^\p{L}\s]", string.Empty).Trim();

enter image description here

Trim() will get rid of leading and trailing whitespace.

answered Jul 16 '15 at 06:29

Wiktor Stribiżew

@Nomi you were using `match` or `replace` earlier – vks Jul 16 '15 at 06:37
Glad to be of help. One remark: if you are *parsing* HTML you'd better use some HTML parser instead of just removing tags with regex. This will save you lots of trouble. Have a lookt at [my answer describing how to get clean text from HTML document](http://stackoverflow.com/questions/31385985/c-sharp-regular-expressions-get-second-number-not-first/31386196#31386196). – Wiktor Stribiżew Jul 16 '15 at 06:39

vks · Answer 2 · 2015-07-16T06:37:23.583

0

 <+>|\P{L}|\P{Z}

You can use this filter for that.

You can also use

\p{L}|(?<=\p{L})\p{Z}(?=\p{L})

If you want to preserve space between words only

edited Jul 16 '15 at 06:37

answered Jul 16 '15 at 06:29

vks

This [will remove everything in the sample input](http://regexstorm.net/tester?p=%3c%2b%3e%7c%5cP%7bL%7d%7c%5cP%7bZ%7d&i=%3cb%3e(0)+Ship+Out%3c%2fb%3e&r=). – Wiktor Stribiżew Jul 16 '15 at 06:33
@stribizhev OP probably is using a match instead of replace.As `p{L}` will match `Ship Out` too – vks Jul 16 '15 at 06:36

2 Answers2