-5

I need a regex to avoid the basic replace, the loops and the "if" everywhere.

I am looking for an expression in a full text that could get me this result:

\n\n\Lorem\n\n\t\n\r\n\Ipsum \t\t\t Lorem\t\t\tHello Stackoverflow!
Lorem\r\t\nTest lorem ipsum V++ \n\rO+\n V2.0

The result I am expecting is all the words except the \n, \r, \t, I need the Lorem, Ipsum, Test and Hello Stackoverflow on the first sentence and also the V++ and the O+. but not the V2.0

Is there any better way than removing the bad chars and extract the words via a regex?

Lenny32
  • 624
  • 6
  • 15
  • 4
    Why don't you need `Lorem` and `Ipsum`? Also, if you're "quite bad at regex", why do you find it easier than an `If`? – Tim Schmelter Jun 12 '15 at 08:42
  • 1
    possible duplicate of [How can I remove "\r\n" from a string in c#? Can I use a regEx?](http://stackoverflow.com/questions/1981947/how-can-i-remove-r-n-from-a-string-in-c-can-i-use-a-regex) – d.popov Jun 12 '15 at 08:43
  • Have you googled it first? there are plenty of SO questions about that. – d.popov Jun 12 '15 at 08:45
  • @TimSchmelter : I actually need the Lorem and Ipsum actually. I also didn't say that I could have in the text some text like 'V2.0' which I do not want either. – Lenny32 Jun 12 '15 at 08:56
  • @d.popov: I have been browsing the web for the past 3 hours so actually I did. – Lenny32 Jun 12 '15 at 08:58
  • 2
    This question can't really be answered. There is no logic behind what you want to have and what not. Why don't you want `V2.0`? What about `V3.0`? Do you want that? – Daniel Hilgarth Jun 12 '15 at 09:17
  • @Daniel Hilgarth Actually I don't want it either because of some special characters (.,\/?:;'[]{}!@#$%^&*()) this is more or less the list I can't have. – Lenny32 Jun 12 '15 at 09:19
  • But `V`, `2` and `0` are not in that list. So, maybe you want to exclude *words* that contain one of these characters with a word being anything that is delimited by whitespace? – Daniel Hilgarth Jun 12 '15 at 09:21
  • Yeah this is what I actually need. – Lenny32 Jun 12 '15 at 09:23
  • What about the single backslashes in front of the first Lorem and the first Ipsum? Are they an error in your question? – Daniel Hilgarth Jun 12 '15 at 09:35

2 Answers2

0

\s is a matcher for a whitespace character in a Regular Expression.

From http://www.regular-expressions.info/shorthand.html:

\s stands for "whitespace character". Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n\f]. That is: \s matches a space, a tab, a line break, or a form feed.

So you could just write a Regex for \s and replace all matches with string.empty.

bgs264
  • 4,572
  • 6
  • 38
  • 72
0

I don't see an easy way to achieve what you really want using regex.

I would go with ordinary C# code:

var input = @"\n\n\Lorem\n\n\t\n\r\n\Ipsum \t\t\t Lorem\t\t\tHello Stackoverflow!
Lorem\r\t\nTest lorem ipsum V++ \n\rO+\n V2.0";
var separators = new [] {"\r", "\n", "\t", "\\n", "\\t", "\\r", "\\" };
var invalidCharacters = @".,\/?:;'[]{}!@#$%^&*()".ToCharArray();
var rawWords = input.Split(separators, StringSplitOptions.RemoveEmptyEntries)
                    .Select(x => x.Trim()).Where(x => !string.IsNullOrEmpty(x));
var words = rawWords.Where(x => !invalidCharacters.Any(y => x.Contains(y)));

Please note, that this removes the Hello Stackoverflow! because it contains one of the invalid characters: !

This is the content of rawWords:

  • Lorem
  • Ipsum
  • Lorem
  • Hello Stackoverflow!
  • Lorem
  • Test lorem ipsum V++
  • O+
  • V2.0

And this is the content of words:

  • Lorem
  • Ipsum
  • Lorem
  • Lorem
  • Test lorem ipsum V++
  • O+

As your requirements are still unclear - and frankly, I think that your example text contains errors - this is the best I can do. From here on, you should use this code and modify it the way you need it to get you what you actually need.

Community
  • 1
  • 1
Daniel Hilgarth
  • 171,043
  • 40
  • 335
  • 443