1

I have a parsing question. I have a paragraph which has instances of :  word  . So basically it has a colon, two spaces, a word (could be anything), then two more spaces.

So when I have those instances I want to convert the string so I have

  1. A new line character after : and the word.
  2. Removed the double space after the word.
  3. Replace all double spaces with new line characters.

Don't know exactly how about to do this. I'm using C# to do this. Bullet point 2 above is what I'm having a hard time doing this.

Thanks

MindGame
  • 1,211
  • 6
  • 29
  • 50

5 Answers5

3

Assuming your original string is exactly in the form you described, this will do:

var newString = myString.Trim().Replace("  ", "\n");

The Trim() removes leading and trailing whitespaces, taking care of your spaces at the end of the string.

Then, the Replace replaces the remaining " " two space characters, with a "\n" new line character.

The result is assigned to the newString variable. This is needed, as myString will not change - as strings in .NET are immutable.

I suggest you read up on the String class and all its methods and properties.

Oded
  • 489,969
  • 99
  • 883
  • 1,009
  • You would have to use a regex to isolate these instances of mystring from the paragraph first, if I've understood the question correctly – Rich Jun 08 '11 at 18:23
  • A blanket trim and replace will give a lot of unwanted results... This definitely is not the solution that Hitesh is looking for. – Zhais Jun 08 '11 at 18:24
  • Zhais you are correct. There is a third part, which I thought was not that important to include but now that I think about it does matter. I added that above. I was thinking some type of regular expression. Anybody good at doing those? Also this is a paragraph that has multiple instances of this stuff. – MindGame Jun 08 '11 at 18:28
  • I set up a good start in an answer below. If you give a more detailed example of the before and after strings that you are looking for, I can make it a bit more detailed – Zhais Jun 08 '11 at 18:30
2

Using RegularExpressions will give you exact matches on what you are looking for.

The regex match for a colon, two spaces, a word, then two more spaces is:

Dim reg as New Regex(":    [a-zA-Z]*    ")

[a-zA-Z] will look for any character within the alphabetical range. Can append 0-9 on as well if you accept numbers within the word. The * afterwards indicated that there can be 0 or more instances of the preceding value.

[a-zA-Z]* will attempt to do a full match of any set of contiguous alpha characters.

Upon further reading, you may use [\w] in place of [a-zA-Z0-9] if that's what you are looking for. This will match any 'word' character.

source: http://msdn.microsoft.com/en-us/library/ms972966.aspx

You can retrieve all the matches using reg.Matches(inputString).

Review http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.replace.aspx for more information on regular expression replacements and your options from there out

edit: Before I was using \s to search for spaces. This will match any whitespace character including tabs, new lines and other. That is not what we want, so I reverted it back to search for exact space characters.

Zhais
  • 1,493
  • 12
  • 18
2

You can try

var str = ":  first  :  second  ";
var result = Regex.Replace(str, ":\\s{2}(?<word>[a-zA-Z0-9]+)\\s{2}",
                                                         ":\n${word}\n");
Bala R
  • 107,317
  • 23
  • 199
  • 210
  • Oooh that's neat. I like this. Except I had to change out the `\\s` to just spaces to get the proper matches. – Zhais Jun 08 '11 at 18:51
  • Thanks for the post. What does this exactly do? Does it do 1-3 from above. – MindGame Jun 08 '11 at 18:52
  • It will replace any occurrence of ': word ' with a newline after the colon, and a newline after the word. No double spaces around the word – Zhais Jun 08 '11 at 18:55
  • It tried this one out, did not work. Zhais you were saying to replace \\s with space in the regular expression. So i just put a space " " instead of "\\s"? Wait i'm guessing that is just \s. I think I just answered my own question. – MindGame Jun 08 '11 at 19:34
  • Oh i replace \\s with \s, I need the extra \ to escape \s. – MindGame Jun 08 '11 at 19:36
  • @Hitesh yes the regex expression has to be `\s` so the extra slash is required; it has to be `\\s` or you can use a verbatim string like this `@":\s{2}(?[a-zA-Z0-9]+)\s{2}"` – Bala R Jun 08 '11 at 19:44
  • How do I make it so that a-zA-Z0-9 includes any characters meaning . (period), *, etc? – MindGame Jun 08 '11 at 21:01
  • That worked thanks. Also I noticed at times after the word there is sometimes only one space. Curious how can I can I change this :\\s{2}(?[a-zA-Z0-9]+)\\s{2} so it checks for both instances. So change s{2} so it checks for one space or two spaces. – MindGame Jun 08 '11 at 21:08
  • @Hitesh `{1,2}` instead of `{2}` – Bala R Jun 08 '11 at 21:11
  • [a-zA-Z0-9.*] That worked for a .,but it did not pick up -. Ex. Left-sided would not work. – MindGame Jun 08 '11 at 22:05
  • I guess that expression is anything but space. I noticed its not picking up characters like \, -, commas. From the ones I'm seeing. I guess its anything but a space. Ex. of things not working: "left-sided", "fever,", "john\doe", and "Comparison:". – MindGame Jun 08 '11 at 22:20
  • Oh i figured it out. I replace it with [^\\s]. So anything but a space. – MindGame Jun 08 '11 at 22:38
  • Thanks for all your help. This was the final solution I used. Regex.Replace(stry, ":\\s{1}(?[^\\s]+)\\s{1,2}", ":\n${word} ").Replace(" ", "\n"); – MindGame Jun 08 '11 at 23:09
1

You can use string.TrimEnd - http://msdn.microsoft.com/en-us/library/system.string.trimend.aspx - to trim spaces at the end of the string.

Piotr Perak
  • 10,718
  • 9
  • 49
  • 86
1

The following is an example using Regular Expressions. See also this question for more info.

Basically the pattern string tells the regex to look for a colon followed by two spaces. Then we save in a capture group named "word" whatever the word is surrounded by two spaces on either side. Finally two more spaces are specified to finish the pattern.

The replace uses a lambda which says for every match, replace it with a colon, a new line, the "lone" word, and another newline.

string Paragraph = "Jackdaws love my big sphinx of quartz:  fizz  The quick onyx goblin jumps over the lazy dwarf. Where:  buzz  The crazy dogs.";
string Pattern = @":  (?<word>\S*)  ";
string Result = Regex.Replace(Paragraph, Pattern, m =>
    {
        var LoneWord = m.Groups[1].Value;
         return @":" + Environment.NewLine + LoneWord + Environment.NewLine;
    },
    RegexOptions.IgnoreCase);

Input

Jackdaws love my big sphinx of quartz:  fizz  The quick onyx goblin jumps over the lazy dwarf. Where:  buzz  The crazy dogs.

Output

Jackdaws love my big sphinx of quartz:
fizz
The quick onyx goblin jumps over the lazy dwarf. Where:
buzz
The quick brown fox.

Note, for item 3 on your list, if you also want to replace individual occurrences of two spaces with newlines, you could do this:

Result = Result.Replace("  ", Environment.NewLine);
Community
  • 1
  • 1
JYelton
  • 35,664
  • 27
  • 132
  • 191
  • Using your code from above. What does "m =>" mean. Is this .net? Maybe it was not translated proper after you posted. I'm using C# VS 2005 – MindGame Jun 08 '11 at 19:42
  • The `m =>` is the [lambda expression](http://msdn.microsoft.com/en-us/library/bb397687.aspx) - basically it is an anonymous method that means for every match, we'll name the variable `m`. Using that variable, call the method enclosed in brackets. – JYelton Jun 08 '11 at 20:38
  • Interesting, did not know you can do that. Thanks. – MindGame Jun 08 '11 at 23:10