6

Problem (Check edit for clarifications)

I have a list of about 1500 strings and for each of these strings I have to check if any of the files in a directory (and subdirectories) contains that string (there are about 4000 files).

Code

What I have now are these two variants:

Original

foreach(var str in stringList)
{
    allFiles.Any(f => File.ReadAllText(f).Contains(str));
}

Second variant (using ReadLines instead of ReadAllText, as suggested from VladL in this question)

foreach(var string in stringList)
{
    allFiles.SelectMany(File.ReadLines).Any(line => line.Contains(str));
}

I only tested the complete program execution of the original variant and it took 21 minutes to finish. I then tested a single statement (check if 1 string is contained in any file) searching for a string that I knew it wasn't contained to check the worst case scenario, and this are my timings (executed each 3 times):

Original: 1285, 1369, 1336 ms

Second variant: 2718, 2804, 2831 ms

I also tryed to replace ReadAllText with ReadAllLines in the Original statement (without changing anything else), but with no performance changes.

Question

Is there any faster way for checking if a string is contained in any file (large amount of large files)?

Edit

I admit I didn't expressed myself as clear as I wanted, by saying I have a list of strings. What I actually have is a list of csv files, I then itarate trhough those and then iterate through each line of these file (ignoring the first line). With each line I create a string composing it with some of the fields of the line, and then look if any file contains that string.

foreach(var csvFile in csvFiles)
{
    var lines = File.ReadAllLines(csvFile);
    foreach(var line in lines)
    {
        if (IsHeader(line)) continue;
        var str = ComposeString(line);
        var bool = allFiles.Any(f => File.ReadAllText(f).Contains(str));
        // do stuff with the line and bool
     }
 }

Edit 2

public void ExecuteAhoCorasick()
{
    var table = CreateDataTable();
    var allFiles = GetAllFiles();
    var csvFiles = GetCsvFiles();
    var resList = new List<string>();

    foreach(var csvFile in csvFiles)
    {
        if (file.Contains("ValueList_")) continue;
        var lines = File.ReadAllLines(file);
        foreach (var line in lines)
        {
            if (line == HeaderLine) continue;
            var res = line.Split(';');
            if (res.Length <= 7) continue;
            var resPath = $"{res[0]}.{res[1]}.{res[2]}".Trim('.');
            resList.Add(resPath);

            var row = table.NewRow();
            row[0] = res[0]; // Group
            row[1] = res[1]; // Type
            row[2] = res[2]; // Key
            row[3] = res[3]; // Global
            row[4] = res[4]; // De
            row[5] = res[5]; // Fr
            row[6] = res[6]; // It
            row[7] = res[7]; // En
            row[8] = resPath; // Resource Path
            row[9] = false;
            row[10] = ""; // Comment
            row[11] = file; // File Path

            table.Rows.Add(row);
        }
    }

    var foundRes = new List<string>();

    foreach (var file in allFiles)
    {
        // var chars = File.ReadLines(file).SelectMany(line => line);
        var text = File.ReadAllText(file);

        var trie = new Trie();
        trie.Add(resList);

        foundRes.AddRange(trie.Find(text));
        // foundRes.AddRange(trie.Find(chars));
    }

    // update row[9] to true foreach res in foundRes
}
Century
  • 285
  • 6
  • 20
  • 1
    I would at first turn the loops around: for each **file** check if any **string** is contained. Currently you read each file 1500 times, even if the file system may cache something, this is clearly slower than reading each file only once and check all strings. – René Vogt Sep 21 '17 at 08:34
  • check edit for clarifications – Century Sep 21 '17 at 08:48
  • 1
    If you are comparing a large block of text with a large number of strings, then it might be much faster to use [the Aho-Corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) to do so. An implementation can be found [here](https://github.com/pdonald/aho-corasick). (Note: You should also check each file only once as per the other answers.) If you decide to try this, remember to benchmark it in a release build rather than a debug build, otherwise you will be comparing apples with oranges. – Matthew Watson Sep 21 '17 at 08:57
  • (With reference to your comments to my answer) OK, so to fix the problem you just need to add `trie.Build()` after the `trie.Add()`. Also note that you can move the initialisation of the trie before the loop, so you only need the `trie.Find()` itself in the loop. That should also reduce overhead. (In other words, move the `var trie = new Trie(); trie.Add(resList); trie.Build();` to before the loop.) – Matthew Watson Sep 21 '17 at 12:10

5 Answers5

6

I think the fastest way to do this will be to:

  1. Read each file completely into memory. This simplifies the code.
  2. Use the Aho-Corasick algorithm to search for the keywords in the text for each file.

An implementation of Aho-Corasick is available here.

I've written a simple program using that implementation from Github that tests the worst-case performance (that is, when none of the keywords are present in the text) to compare Aho-Corasick with Contains()):

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using ConsoleApp1;

namespace Demo
{
    class Program
    {
        static void Main()
        {
            string[] needles = createNeedles(1500).ToArray();
            string haystack = createHaystack(100000);

            var sw = Stopwatch.StartNew();
            anyViaContains(needles, haystack);
            Console.WriteLine("anyViaContains() took " + sw.Elapsed);

            sw.Restart();
            anyViaAhoCorasick(needles, haystack);
            Console.WriteLine("anyViaAhoCorasick() took " + sw.Elapsed);
        }

        static bool anyViaContains(string[] needles, string haystack)
        {
            return needles.Any(haystack.Contains);
        }

        static bool anyViaAhoCorasick(string[] needles, string haystack)
        {
            var trie = new Trie();
            trie.Add(needles);
            trie.Build();
            return trie.Find(haystack).Any();
        }

        static IEnumerable<string> createNeedles(int n)
        {
            for (int i = 0; i < n; ++i)
                yield return n + "." + n + "." + n;
        }

        static string createHaystack(int n)
        {
            var sb = new StringBuilder();

            for (int i = 0; i < n; ++i)
                sb.AppendLine("Sample Text Sample Text Sample Text Sample Text Sample Text Sample Text Sample Text Sample Text");

            return sb.ToString();
        }
    }
}

The results I get for a 64-bit RELEASE build (run outside the debugger) are as follows:

anyViaContains() took 00:00:09.8216836

anyViaAhoCorasick() took 00:00:00.4002765

For this test case, it appears that Aho-Corasick is around 25 times faster than using Contains(). However, this is a somewhat artificial test case and your actual results may vary. You should instrument with your actual data to see if it really does help.

Note that you can actually avoid loading the entire file into memory when using the Aho-Corasick implementation, because its Find() method accepts an IEnumerable<char>.

You can turn a the contents of a file into an IEnumerable<char> as follows:

var chars = File.ReadLines(filename).SelectMany(line => line);

That actually removes all the newline characters, which is probably OK for your application. If you wanted to keep the newline characters, you'd have to put them back like so:

IEnumerable<char> newline = Enumerable.Repeat('\n', 1);
var chars = File.ReadLines(filename).SelectMany(line => Enumerable.Concat(line, newline));

It would be worth comparing loading each file completely into memory with enumerating the chars in each file (as above) to see if there is any difference. For very large files, it may be important to avoid loading their entire contents into memory.

Community
  • 1
  • 1
Matthew Watson
  • 104,400
  • 10
  • 158
  • 276
  • I tried this solution, but I get a NullReferenceException during the execution of Find(), with both `IEnumerable` and loading into memory – Century Sep 21 '17 at 10:58
  • @Century OK sorry, it was my fault - somehow I deleted a line when posting the code. The line `trie.Build();` is required after all calls to `trie.Add()`. See my updated post. – Matthew Watson Sep 21 '17 at 12:05
  • Ok. now it works, it running now, I'll let you know if there is any difference between loading into memory and not, as soon as I have the results – Century Sep 21 '17 at 12:09
  • Removing the overhead (moving `trie.add()` and `trie.build()` outside the loop), it takes around 12.3 seconds in memory, and around 15.1 seconds with `IEnumerable`. – Century Sep 21 '17 at 12:23
  • @Century (Just in case) Make sure you're timing a released build; a debug build will be significantly slower. – Matthew Watson Sep 21 '17 at 12:28
  • I used a release build, and took the average of three runs. Now the whole program takes about 16 seconds to complete instead of minutes. Thanks! – Century Sep 21 '17 at 12:30
3

Does file contains any of the string?

private static bool ContainsLine(string file, List<string> wordsToFind) {
  return File
    .ReadLines(file)
    .Any(line => wordsToFind.Any(word => line.Contains(word))); 
}

Do we have any file which contain any string?

bool result = allFiles
  .AsParallel() // worth trying: we have a lot of files to be proceed
  .Any(file => ContainsLine(file, stringList));

Edit: Often .AsParallel() is worth trying for the problems like this (many files to test), however, in case AsParallel() doesn't bring any gain, just comment it out:

bool result = allFiles
  //.AsParallel() // comment out in case of no gain
  .Any(file => ContainsLine(file, stringList));
Dmitry Bychenko
  • 180,369
  • 20
  • 160
  • 215
  • @Century would be interesting to see the benchmark result of this answer – fubo Sep 21 '17 at 08:40
  • I'm not sure if `AsParallel()` has any positive impact here...at least when the files are all on the same hard disk. And the file sizes aren't well predictable. – René Vogt Sep 21 '17 at 08:41
  • @RenéVogt `AsParallel()` could help in case of a SSD where you have low access time and high bandwidth – fubo Sep 21 '17 at 08:42
  • You'd have to add code to make sure you didn't run the AsParallel() version on spinning rust - it'd be likely to make it a lot slower. – Matthew Watson Sep 21 '17 at 08:46
  • I edited my question for some clarification, I didn't choose my words too carefully. I guess I could retrieve all the strings before searching (one time effort), to then invert the search (search in a file for all strings), and see if it's an improvement – Century Sep 21 '17 at 08:50
  • Ok. I tested inverting the search, and these are my findings: the overhead of creating the string list first is about 100ms, execution of your ContainsLine is 9ms. If I multiply the 9ms by the number of files I have I get about 36000ms. – Century Sep 21 '17 at 09:22
2

You are reading the all the files for each string.

How about to do it the other way around? loop once through all the files:

bool exists = 
    allFiles.SelectMany(File.ReadLines).Any(l=> stringList.Any(str=> l.Contains(str));

Ater the OP edit:

As you've mentioned in the comment, You should first cummulate all the strings from the CSV files and then continue as suggested:

var stringList =
  csvFiles.SelectMany(f=>File.ReadAllLines(f).Where(l=>!IsHeader(l)).Select(ComposString))
          .ToList();

You might want to use also a .Distinct in case there's a chance that some of these words are not unique, to make it faster. But that depends on the size of this list, and how many words are indeed repeating themselves.

var stringList =
  csvFiles.SelectMany(f=>File.ReadAllLines(f).Where(l=>!IsHeader(l)).Select(ComposString))
          .Distinct()
          .ToList();
Ofir Winegarten
  • 9,215
  • 2
  • 21
  • 27
  • I expressed myself badly, sorry. Check the edit. What I could do though is retrieve all the strings before searching (one time effort), to then invert the search (search in a file for all strings), and see if it's an improvement – Century Sep 21 '17 at 08:51
1

This strongly depends on your exact use-case. If you are trying to match whole words, that can be discriminated easily, you could build some sort of hashed index (e.g. Dictionary<string, WhatEver>) you can search easily. Anyway - depending on the size - this can be very RAM intensive.

The following code will give an idea how to structure this

class FileReference
{
    // elided 

    string File { get; } // may be set in constructor
    IEnumerable<int> Indices { get; } // will get the contents of _index

    public void Add(int index)
    {
        _indices.Add(index);
    }
}

class ReferenceIndex
{
    Dictionary<string, FileReference> _fileReferences = new Dictionary<string, FileReference>();

    public void Add(string fileName, string index)
    {
        if(!_fileReferences.ContainsKey(fileName))
        {
            _fileReferences.Add(fileName, new FileReference(fileName));
        }
        _fileReferences[fileName].Add(index);
    }

    // elided
}

FileReference tracks the indices of the a string within a single file, ReferenceIndex holds the FileReferences for a single string. For Dictionary<TKey, TValue> is hashed, accessing it is blazing fast. You can use these classes to build up a Dictionary<string, ReferenceIndex> that keeps track of all strings in the files and file references to these strings

Dictionary<string, ReferenceIndex> stringIndex = BuildIndex(fileName);
foreach(var s in searchStrings)
{
    if(stringIndex.ContainsKey(s))
    {
        // do something
    }
}
Paul Kertscher
  • 9,416
  • 5
  • 32
  • 57
1

I just recently came across a similar problem to yours. Represent each searchable file as the following:

public class SearchableFile {
    private readonly HashSet<string> _uniqueLines;
    //private readonly HashSet<string> _uniqueString;

    public string FilePath { get; }

    public SearchableFile(string filePath) {
        _uniqueLines = new HashSet<string>(File.ReadAllLines(filePath));
        //↑You can also split each line if you have many repeating words in each line.
        //_uniqueString = File.ReadAllLines(filePath).SelectMany(singleLine => singleLine.Split(' '));
        FilePath = filePath;
    }

    public bool ContainsCompositeString(string compositeString) {
        return _uniqueLines.Any(singleLine => singleLine.Contains(compositeString));
        //return _uniqueString.Contains(compositeString);
    }
}

Then you can use it as it is:

    private static void Main(string[] args) {
        var filePaths = new List<string> { "c://temp.txt" };

        foreach (var filePath in filePaths) {
            FilesOnHdd.Add(new SearchableFile(filePath));
        }
        var csvFiles = new List<string> { "c://temp.csv" };
        foreach (var csvFile in csvFiles) {
            var lines = File.ReadAllLines(csvFile);
            foreach (var line in lines) {
                if (IsHeader(line)) {
                    continue;
                }
                var str = ComposeString(line);

                foreach (var singleFileOnHdd in FilesOnHdd) {
                    var result = singleFileOnHdd.ContainsCompositeString(str);
                    if (result) {
                        // do stuff with the line and bool
                    }
                }
            }
        }
    }
Hasan Emrah Süngü
  • 3,488
  • 1
  • 15
  • 33