5

I have to work through a large file (several MB) and remove comments from it that are marked by a time. An example :

blablabla  12:10:40 I want to remove this
blablabla some more
even more bla

After filtering, I would like it to look like this :

blablabla
blablabla some more
even more bla

The nicest way to do it should be easing a Regex :

Dataout = Regex.Replace(Datain, "[012][0123456789]:[012345][0123456789]:[012345][0123456789].*", string.Empty, RegexOptions.Compiled);

Now this works perfectly for my purposes, but it's a bit slow.. I'm assuming this is because the first two characters [012] and [0123456789] match with a lot of the data (it's an ASCII file containing hexadecimal data, so like "0045ab0123" etc..). So Regex is having a match on the first two characters way too often.

When I change the Regex to

Dataout = Regex.Replace(Datain, ":[012345][0123456789]:[012345][0123456789].*", string.Empty, RegexOptions.Compiled);

It get's an enormous speedup, probably because there's not many ':' in the file at all. Good! But I still need to check the two characters before the first ':' being numbers and then delete the rest of the line.

So my question boils down to :

  • how can I make Regex first search for least frequent occurences of ':' and only after having found a match, checking the two characters before that?

Or maybe there's even a better way?

wvl_kszen
  • 183
  • 1
  • 10
  • 1
    Is there always spaces before a date? – Casimir et Hippolyte Apr 28 '14 at 19:52
  • Would lookbehind work here? I'm not sure if lookbehind gets evaluated after a potential match is found, or before it checks for a match. – codebreaker Apr 28 '14 at 19:53
  • No, unfortunately there's not always a space infront, so it could look like : "0A0B1216:43:11 blabla". I agree if there would be a space, searching would be easier.. – wvl_kszen Apr 28 '14 at 20:25
  • You really should read this http://stackoverflow.com/questions/513412/how-does-regexoptions-compiled-work to better understand where `RegexOptions.Compiled` helps your speed, and where it hurts it, so you're properly taking advantage of it. – hatchet - done with SOverflow Apr 28 '14 at 21:38
  • You can use `:(?<=[0-2][0-9]:)[0-5][0-9]:[0-5][0-9].*` to perform a global search, and after, with a reverse loop in the matchCollection, remove substrings using the match.index and the match length. AdrianHHH has written an answer with this kind of way. – Casimir et Hippolyte Apr 28 '14 at 23:33

3 Answers3

2

You could use both of the regular expressions in the question. First a match with the leading colon expression to quickly find or exclude possible lines. If that succeeds then use the full replace expression.

MatchCollection mc = Regex.Matches(Datain, ":[012345][0123456789]:[012345][0123456789].*"));

if ( mc != null && mc.Length > 0 )
{
    Dataout = Regex.Replace(Datain, "[012][0123456789]:[012345][0123456789]:[012345][0123456789].*", string.Empty, RegexOptions.Compiled);
}
else
{
    Dataout = Datain;
}

A variation might be

Regex finder = new Regex(":[012345][0123456789]:[012345][0123456789].*");
Regex changer = new regex("[012][0123456789]:[012345][0123456789]:[012345][0123456789].*");

if ( finder.Match(Datain).Success)
{
    Dataout = changer.Replace(Datain, string.Empty);
}
else
{
    Dataout = Datain;
}

Another variation would be to use the finder as above. If the string is found then just check whether the previous two characters are digits.

Regex finder = new Regex(":[012345][0123456789]:[012345][0123456789].*");

Match m = finder.Match(Datain);
if ( m.Success && m.Index > 1)
{
    if ( char.IsDigit(DataIn[m.index-1]) && char.IsDigit(DataIn[m.index-2])
    {
        Dataout = m.Index-2 == 0 ? string.Empty : DataIn.Substring(0, m.Index-2);
    }
    else
    {
        Dataout = Datain;
    }
}
else
{
    Dataout = Datain;
}

In the second and third ideas the finder and changer should be declared and given values before reading any lines. There is no need to execute the new Regex(...) inside the line reading loop.

AdrianHHH
  • 13,492
  • 16
  • 50
  • 87
  • About the "Another variation": I have tested the same way with PHP and the pattern `:(?<=[0-2][0-9]:)[0-5][0-9]:[0-5][0-9].*`, and obtain a result between 9X and 10X faster than a simple replace with the pattern `[0-2][0-9]:[0-5][0-9]:[0-5][0-9].*`. However, I don't know the gain in time for a C# version. – Casimir et Hippolyte Apr 28 '14 at 23:19
0

You could use DateTime.TryParseExact to check whether or not a word is a time and take all words before. Here's a LINQ query to clean all lines from the path, maybe it's more efficient:

string format = "HH:mm:ss";
DateTime time;
var cleanedLines = File.ReadLines(path)
    .Select(l => string.Join(" ", l.Split().TakeWhile(w => w.Length != format.Length
       ||  !DateTime.TryParseExact(w, format, CultureInfo.InvariantCulture, DateTimeStyles.None, out time))));

If performance is very critical you could also create a specialized method that is optimized for this task. Here is one approach that should be much more efficient:

public static string SubstringBeforeTime(string input, string timeFormat = "HH:mm:ss")
{
    if (string.IsNullOrWhiteSpace(input))
        return input;
    DateTime time;

    if (input.Length == timeFormat.Length && DateTime.TryParseExact(input, timeFormat, CultureInfo.InvariantCulture, DateTimeStyles.None, out time))
    {
        return ""; // full text is time
    }
    char[] wordSeparator = {' ', '\t'};
    int lastIndex = 0;
    int spaceIndex = input.IndexOfAny(wordSeparator);
    if(spaceIndex == -1)
        return input;
    char[] chars = input.ToCharArray();
    while(spaceIndex >= 0)
    {
        int nonSpaceIndex = Array.FindIndex<char>(chars, spaceIndex + 1, x => !wordSeparator.Contains(x));
        if(nonSpaceIndex == -1)
            return input;
        string nextWord = input.Substring(lastIndex, spaceIndex - lastIndex);
        if( nextWord.Length == timeFormat.Length 
         && DateTime.TryParseExact(nextWord, timeFormat, CultureInfo.InvariantCulture, DateTimeStyles.None, out time))
        {
            return input.Substring(0, lastIndex);
        }
        lastIndex = nonSpaceIndex;
        spaceIndex = input.IndexOfAny(wordSeparator, nonSpaceIndex + 1);
    }
    return input;
}

Sample data and test:

string[] lines = { "blablabla  12:10:40 I want to remove this", "blablabla some more", "even more bla  ", "14:22:11" };
foreach(string line in lines)
{
    string newLine = SubstringBeforeTime(line, "HH:mm:ss");
    Console.WriteLine(string.IsNullOrEmpty(newLine) ? "<empty>" : newLine);
}

Output:

blablabla  
blablabla some more
even more bla  
<empty>
Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
  • wouldn't using an AND operation be better than OR here? Either way I definitely want to know if it's faster – Jonesopolis Apr 28 '14 at 19:43
  • @Jonesy: it would be incorrect, this query takes all words as long as the word-length is != 8 or (if the word-length is exactly 8) as long as this word is not a time. If one of both is true the word will be taken. – Tim Schmelter Apr 28 '14 at 19:46
  • Oh I see, I was thinking about it kind of reversed – Jonesopolis Apr 28 '14 at 19:48
0

in the end I went for this :

        bool MeerCCOl = true;
        int startpositie = 0;
        int CCOLfound = 0; // aantal keer dat terminal output is gevonden

        while(MeerCCOl)
        {
            Regex rgx = new Regex(":[0-5][0-9]:[0-5][0-9]", RegexOptions.Multiline | RegexOptions.Compiled);
            Match GevondenColon = rgx.Match(VlogDataGefilterd,startpositie);

            MeerCCOl = GevondenColon.Success; // CCOL terminal data gevonden, er is misschien nog meer..

            if (MeerCCOl && GevondenColon.Index >= 2)
            {
                CCOLfound++;
                int GevondenUur = 10 * (VlogDataGefilterd[GevondenColon.Index - 2] - '0') +
                                        VlogDataGefilterd[GevondenColon.Index - 1] - '0';
                if (VlogDataGefilterd[GevondenColon.Index - 2] >= '0' && VlogDataGefilterd[GevondenColon.Index - 2] <= '2' &&
                    VlogDataGefilterd[GevondenColon.Index - 1] >= '0' && VlogDataGefilterd[GevondenColon.Index - 1] <= '9' &&
                    GevondenUur>=0 && GevondenUur<=23)
                {
                    Regex rgx2 = new Regex("[012][0-9]:[0-5][0-9]:[0-5][0-9].*", RegexOptions.Multiline);
                    VlogDataGefilterd = rgx2.Replace(VlogDataGefilterd, string.Empty, 1, (GevondenColon.Index - 2));
                    startpositie = GevondenColon.Index - 2; // start volgende match vanaf de plek waar we de 
                }
            }
        }

It first searches for a match to :xx:xx and then checks the 2 characters before that. If it is recognized as a time it removes the whole thing. Bonus is that by check the hours separately, i can make sure the hours read 00-23, instead of 00-29. Also the number of matches is counted this way.

The original simple regex took about 550ms. This code (while more messy) takes about 12ms for the same datafile. That's a whopping 40x speedup :-)

Thanks all!

wvl_kszen
  • 183
  • 1
  • 10