2

I am trying to find the most efficient way to create a generic tokenizer that will retain the complex delimiters / separators as extra token.

And yes... I looked at some SO questions like How can i use string#split to split a string with the delimiters + - * / ( ) and space and retain them as an extra token? but so far, it's too specific. I need the solution to work against generic string.

In my case, I am looking to tokenize strings such as

"   A brown bear     A red firetruck  A white horse   "

and as result, I am expecting the following tokens:

"   ",              //3 spaces
"A brown bear",
"     ",            //5 spaces
"A red firetruck",
"  ",               //2 spaces
"A white horse",
"   "               //3 spaces

and so, here is the code that I come up with, it's working as expected but I am wondering if there is anyway to improve on this...

public static class StringExtension
{
    public static List<string> TokenizeUsingRegex(this string input, string separatorRegexPattern, bool includeSeparatorsAsToken = true)
    {
        var tokens = Regex.Split(input, separatorRegexPattern).Where(t => !string.IsNullOrWhiteSpace(t)).ToList();

        if (!includeSeparatorsAsToken)
            return tokens;

        //Reinstate the removed separators      
        var newTokens = new List<string>();
        var startIndex  = 0;
        for(int i = 0, l = tokens.Count(); i < l; i++) 
        {
            var token = tokens[i];          
            var endIndex = input.IndexOf(token);

            if (startIndex < endIndex) {
                //Add back the separator as a new token
                newTokens.Add(input.Substring(startIndex, endIndex - startIndex));
            }
            //Then add the token afterward
            newTokens.Add(token);

            startIndex = endIndex + token.Length;           
        }

        //Add last separator if any
        if (startIndex < input.Length) {            
            newTokens.Add(input.Substring(startIndex));
        }

        return newTokens;
    }   
}

Live example at: https://dotnetfiddle.net/l3mesr

Community
  • 1
  • 1
Jimmy Chandra
  • 6,472
  • 4
  • 26
  • 38
  • Read the characters one by one and create a new item in your final string token array when the token character first appears and then when the next non token character appears. Searching on one token at a time in my example. – Sql Surfer Jun 05 '15 at 01:56

1 Answers1

2

What about this?

using System;
using System.Linq;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        var str = "    Invisible Pty. Ltd.     1 Nowhere St.  Sydney  2000  AUSTRALIA   ";
        //str = " A teddy bear   A red firetruck ";

        //tokenize the input delimited by 2 or more whitespaces
        var tokens = Regex.Matches(str, @"\s{2,}|(\s?[^\s]+(\s[^\s]+)*(\s$)?)").Cast<Match>().ToArray(); 

        foreach(var token in tokens)
        {
            Console.WriteLine("'{0}' - {1}", token, token.Length);
        }
    }
}

I used visual studio's Perf and Diagnositics in visual studio and this takes 40ms vs the existing one took 80ms. dotnetfiddle.net reported the performance as slower(?) I would probably trust VS more but I just wanted to throw that out there.

Basically how it works is it looks for either multi-spaces OR anything else with no more then one space between.

Jimmy Chandra
  • 6,472
  • 4
  • 26
  • 38
SunsetQuest
  • 8,041
  • 2
  • 47
  • 42
  • Almost there, but if I have a single space before Invisible and a single space after AUSTRALIA, ideally the single space should be added to the trailing or preceding word like ` Invisible...` and `AUSTRALIA ` but this is nice and very compact :) – Jimmy Chandra Jun 05 '15 at 02:59
  • I think this is it: `@"\s{2,}|(\s?[^\s]+(\s[^\s]+)*(\s$)?)"`...? – Jimmy Chandra Jun 05 '15 at 03:10
  • That would work. Good use of the ^ and $ to provide the special first and last case. – SunsetQuest Jun 05 '15 at 03:15