2

i am very newbie to c#..

i want program if input like this

input : There are 4 numbers in this string 40, 30, and 10

output :

there = string
are = string
4 = number
numbers = string
in = string
this = string
40 = number
, = symbol
30 = number
, = symbol
and = string
10 = number

i am try this

{
    class Program
    {
        static void Main(string[] args)
        {
            string input = "There are 4 numbers in this string 40, 30, and 10.";
            // Split on one or more non-digit characters.
            string[] numbers = Regex.Split(input, @"(\D+)(\s+)");
            foreach (string value in numbers)
            {
                Console.WriteLine(value);               
            }
        }
    }
}

but the output is different from what i want.. please help me.. i am stuck :((

Toto
  • 89,455
  • 62
  • 89
  • 125
Isyar Harun
  • 35
  • 1
  • 2
  • 8
  • Yeah, please add the output that you get to your question. On a side-note: you could use an NLP library to help with tokenization and classification of the parts of the sentence. I have only worked with OpenNLP for Java, but this library is a C# port: http://www.codeproject.com/Articles/12109/Statistical-parsing-of-English-sentences – Christoffer Mar 21 '12 at 15:16
  • this one doesn't help? http://stackoverflow.com/questions/521146/c-sharp-split-string-but-keep-split-chars-separators – gbianchi Mar 21 '12 at 15:16
  • In js I'd write `str.match(/\w+|\d+|[^\s]+/g)` instead of split. – kirilloid Mar 21 '12 at 15:27
  • @kirilloid: I can't fathom what that is meant to do. `\d+` will never match as the string will already have been consumed by `\w+`. – Borodin Mar 21 '12 at 15:50
  • Hmm, yes. It works on certain example but looks wrong in general. Then `\w+` needs to be replaced with `[a-zA-Z]+` – kirilloid Mar 21 '12 at 15:59
  • If you are absolutely certain you want to club needles with a hammer (i.e. tokenize text with regex), I suppose you could pull it off with assertions. For example, to find the boundary between a number and punctuation, you could put a lookbehind assertion for a number, and a lookahead for a non-number, non-alphabetic character. This is a tortured approach, though, and unless you can explain why you think you should use regex for this task, I will not develop this idea further. – tripleee Mar 21 '12 at 16:04

5 Answers5

2

The regex parser has an if conditional and the ability to group items into named capture groups; to which I will demonstrate.

Here is an example where the patttern looks for symbols first (only a comma add more symbols to the set [,]) then numbers and drops the rest into words.

string text = @"There are 4 numbers in this string 40, 30, and 10";
string pattern = @"
(?([,])            # If a comma (or other then add it) is found its a symbol
  (?<Symbol>[,])   # Then match the symbol
 |                 # else its not a symbol
  (?(\d+)             # If a number
    (?<Number>\d+)    # Then match the numbers
   |                  # else its not a number
    (?<Word>[^\s]+)   # So it must be a word.
   ) 
)
";


// Ignore pattern white space allows us to comment the pattern only, does not affect
// the processing of the text!
Regex.Matches(text, pattern, RegexOptions.IgnorePatternWhitespace)
     .OfType<Match>()
     .Select (mt => 
    {
        if (mt.Groups["Symbol"].Success)
            return  "Symbol found:     " + mt.Groups["Symbol"].Value;

        if (mt.Groups["Number"].Success) 
            return  "Number found:  " + mt.Groups["Number"].Value;

        return "Word found:     " + mt.Groups["Word"].Value;
    }
     )
     .ToList() // To show the result only remove
     .ForEach(rs => Console.WriteLine (rs));

/* Result
Word found:     There
Word found:     are
Number found:  4
Word found:     numbers
Word found:     in
Word found:     this
Word found:     string
Number found:  40
Symbol found:     ,
Number found:  30
Symbol found:     ,
Word found:     and
Number found:  10
*/

Once the regex has tokenized the resulting matches, then we us linq to extract out those tokens by identifying which named capture group has a success. In this example we get the successful capture group and project it into a string to print out for viewing.

I discuss the regex if conditional on my blog Regular Expressions and the If Conditional for more information.

ΩmegaMan
  • 29,542
  • 12
  • 100
  • 122
1

You could split using this pattern: @"(,)\s?|\s"

This splits on a comma, but preserves it since it is within a group. The \s? serves to match an optional space but excludes it from the result. Without it, the split would include the space that occurred after a comma. Next, there's an alternation to split on whitespace in general.

To categorize the values, we can take the first character of the string and check for the type using the static Char methods.

string input = "There are 4 numbers in this string 40, 30, and 10";
var query = Regex.Split(input, @"(,)\s?|\s")
                 .Select(s => new
                 {
                     Value = s,
                     Type = Char.IsLetter(s[0]) ?
                             "String" : Char.IsDigit(s[0]) ?
                             "Number" : "Symbol"
                 });
foreach (var item in query)
{
    Console.WriteLine("{0} : {1}", item.Value, item.Type);
}

To use the Regex.Matches method instead, this pattern can be used: @"\w+|,"

var query = Regex.Matches(input, @"\w+|,").Cast<Match>()
                 .Select(m => new
                 {
                     Value = m.Value,
                     Type = Char.IsLetter(m.Value[0]) ?
                             "String" : Char.IsDigit(m.Value[0]) ?
                             "Number" : "Symbol"
                 });
Ahmad Mageed
  • 94,561
  • 19
  • 163
  • 174
0

If you want to get the numbers

var reg = new Regex(@"\d+");
var matches = reg.Matches(input );
var numbers = matches
        .Cast<Match>()
        .Select(m=>Int32.Parse(m.Groups[0].Value));

To get your output:

var regSymbols = new Regex(@"(?<number>\d+)|(?<string>\w+)|(?<symbol>(,))");
var sMatches = regSymbols.Matches(input );
var symbols = sMatches
    .Cast<Match>()
    .Select(m=> new
    {                  
       Number = m.Groups["number"].Value,
       String = m.Groups["string"].Value,
       Symbol = m.Groups["symbol"].Value
     })
    .Select(
      m => new 
      {
        Match = !String.IsNullOrEmpty(m.Number) ? 
                    m.Number : !String.IsNullOrEmpty(m.String) 
                            ? m.String : m.Symbol,
        MatchType = !String.IsNullOrEmpty(m.Number) ? 
                    "Number" : !String.IsNullOrEmpty(m.String) 
                            ? "String" : "Symbol"
      }
    );

edit If there are more symbols than a comma you can group them in a class, like @Bogdan Emil Mariesan did and the regex will be:

@"(?<number>\d+)|(?<string>\w+)|(?<symbol>[,.\?!])"

edit2 To get the strings with =

var outputLines = symbols.Select(m=>
                            String.Format("{0} = {1}", m.Match, m.MatchType));
Adrian Iftode
  • 15,465
  • 4
  • 48
  • 73
0

Well to match all numbers you could do:

[\d]+

For the strings:

[a-zA-Z]+

And for some of the symbols for example

 [,.?\[\]\\\/;:!\*]+
Bogdan Emil Mariesan
  • 5,529
  • 2
  • 33
  • 57
0

You can very easily do this like so:

string[] tokens = Regex.Split(input, " ");  

foreach(string token in tokens)  
{  
    if(token.Length > 1)  
    {   
       if(Int32.TryParse(token))  
       {  
          Console.WriteLine(token + " =   number");
       }
      else  
      {  
         Console.WriteLine(token + " = string");  
      }  
    }    
    else  
    {
      if(!Char.isLetter(token ) && !Char.isDigit(token))   
      {  
        Console.WriteLine(token + " = symbol");
      }  
  }
}  

I do not have an IDE handy to test that this compiles. Essentially waht you are doing is splitting the input on space and then performing some comparisons to determine if it is a symbol, string, or number.

Woot4Moo
  • 23,987
  • 16
  • 94
  • 151
  • 1
    This solution will lost commas. One of the tokens will be `40,`. – kirilloid Mar 21 '12 at 15:23
  • @kirilloid true, ideally the OP posts the output he is getting sometime this year. – Woot4Moo Mar 21 '12 at 15:25
  • I don't think this would work for the first `4`. I think the individual string chunk checks need to be more comprehensive. – Matthew Mar 21 '12 at 15:26
  • Using match `/\w+|\d+|[^\s]+/g` and then checking only 1st character of each string with `isLetter` and `isDigit` would give a solution. – kirilloid Mar 21 '12 at 15:31