2

I have a folder with many CSV files in it, which are around 3MB each in size.

example of content of one CSV:

afkla890sdfa9f8sadfkljsdfjas98sdf098,-1dskjdl4kjff;
afkla890sdfa9f8sadfkljsdfjas98sdf099,-1kskjd11kjsj;
afkla890sdfa9f8sadfkljsdfjas98sdf100,-1asfjdl1kjgf;
etc...

Now I have a Console app written in C#, that searches each CSV file for a certain string. And those strings to search for are in a txt file.

example of search txt file:

-1gnmjdl5dghs
-17kn3mskjfj4
-1plo3nds3ddd

then I call the method to search each search string in all files in given folder:

private static object _lockObject = new object();
public static IEnumerable<string> SearchContentListInFiles(string searchFolder, List<string> searchList)
{
    var result = new List<string>();

    var files = Directory.EnumerateFiles(searchFolder);
    Parallel.ForEach(files, (file) =>
    {
        var fileContent = File.ReadLines(file);

        if (fileContent.Any(x => searchList.Any(y => x.ToLower().Contains(y))))
        {
            lock (_lockObject)
            {
                foreach (string searchFound in fileContent.Where(x => searchList.Any(y => x.ToLower().Contains(y))))
                {
                    result.Add(searchFound);
                }
            }
        }
    });

    return result;
}

Question now is, can I anyhow improve performance of this operation? I have around 100GB of files to search trough. It takes aproximatly 1 hour to search all ~30.000 files with around 25 search strings, on a SSD disk and a good i7 CPU.

Would it make a difference to have larger CSV files or smaller CSV? I just want this search to be as fast as possible.

UPDATE

I have tried every suggestion that you wrote, and this is now what best performed for me (Removing ToLower from the LINQ yielded best performance boost. Search time from 1hour is now 16minutes!):

public static IEnumerable<string> SearchContentListInFiles(string searchFolder, HashSet<string> searchList)
{
    var result = new BlockingCollection<string>();

    var files = Directory.EnumerateFiles(searchFolder);
    Parallel.ForEach(files, (file) =>
    {
        var fileContent = File.ReadLines(file); //.Select(x => x.ToLower());

        if (fileContent.Any(x => searchList.Any(y => x.Contains(y))))
        {
            foreach (string searchFound in fileContent.Where(x => searchList.Any(y => x.Contains(y))))
            {
                result.Add(searchFound);
            }
        }
    });

    return result;
}
Rumplin
  • 2,703
  • 21
  • 45
  • You’re doing the searching twice in the worst case, locking on the second one and that will stop any parallelism. You’d be better doing all of it parallel, only once, and only in the end lock while combing the results. Assuming this isn’t CPU bound operation already. – Sami Kuhmonen Jan 31 '18 at 13:39
  • What is the rationale for the lock? – Alex K. Jan 31 '18 at 13:40
  • @AlexK. I really don't know if I need the lock or not... but this only happens when I find a search string, which is rarely in the same file as others. – Rumplin Jan 31 '18 at 13:46
  • You could use a Dictionary for your result, it's threadsafe so you don't need the lock. Also I'd use a Dictionary for the searchList, and test if the key exist on every inputline[1] split on ';'. – Gnor Jan 31 '18 at 13:47
  • @Gnor dictionary is threadsafe? – Sushil Mate Jan 31 '18 at 14:30
  • 2
    @SushilMate Good point, it's not. That's why there's a `ConcurrentDictionary` https://msdn.microsoft.com/en-us/library/dd287191(v=vs.110).aspx – Matías Fidemraizer Jan 31 '18 at 14:34
  • @MatíasFidemraizer: that's what I thought about it, we have plenty of thread-safe collections but dictionary definitely not one of them. – Sushil Mate Jan 31 '18 at 14:53
  • @Rumplin: you can try this link. it has some good suggestions https://stackoverflow.com/questions/13993530/better-search-for-a-string-in-all-files-using-c-sharp – Sushil Mate Jan 31 '18 at 14:54
  • 1
    @SushilMate Yeah, I guessed that you already met concurrent collections. Clearly your comment was ironic :D – Matías Fidemraizer Jan 31 '18 at 15:09
  • 1
    @SushilMate You're right, dictionary is not threadsafe, I meant ConcurrentDictionary. sorry... – Gnor Jan 31 '18 at 15:53
  • @Gnor :thumbs up: – Sushil Mate Jan 31 '18 at 16:27
  • @Rumplin, I update my answer, probably it will be more efficient. Please measure time on yours data, If you can. Thanks. – gabba Feb 02 '18 at 18:00

3 Answers3

2

Probably something like Lucene could be a performance boost: why don't you index your data so you can search it easily?

Take a look at Lucene .NET

You'll avoid searching data sequentially. In addition, you can model many indexes based on the same data to be able to get to certain results at the light speed.

Matías Fidemraizer
  • 63,804
  • 18
  • 124
  • 206
  • Indexing 100GB of data will generate an index of size 100GB? – Rumplin Jan 31 '18 at 15:23
  • @Rumplin No because you don't index the entire data but chuncks that might be useful to do the search and get certain results. – Matías Fidemraizer Jan 31 '18 at 15:25
  • yes, but all the data is unique, so I would have and index of each occurence once... makes no sense to index. – Rumplin Jan 31 '18 at 18:04
  • @Rumplin Even on this case an index would work better because indexing implement algorithms to speed up things like [B-trees](https://en.wikipedia.org/wiki/B-tree) – Matías Fidemraizer Feb 01 '18 at 09:47
  • @Rumplin I doubt that your accepted answer could beat Lucene performance. – Matías Fidemraizer Feb 01 '18 at 09:48
  • Just might be, but the time I would have invested to rewrite the code for Lucene I might as well be searching with my code. Apart from this I found very little documentation how to start with my use case. – Rumplin Feb 01 '18 at 11:06
  • @Rumplin Any docs from the Java version applies to .NET one. And .NET version has tutorials everywhere too... From my own experience, you wouldn't need to invest too much time and you would get instant results. – Matías Fidemraizer Feb 01 '18 at 11:13
  • Indexing takes twice the space on HDD as the actual files I want to search. Indexing takes as long as my search does. This might be a good library for certain use cases, but this is not one. – Rumplin Feb 01 '18 at 12:13
  • @Rumplin twice? You would put two fields: file location, text to search (not the entire data). And indexing wouldn't be done from scratch everytime: you would need a file watcher so files would be indexed as they're stored in certain directories. // You also miss that Lucene may compress fields for you: https://lucene.apache.org/core/5_2_1/core/org/apache/lucene/codecs/compressing/CompressingStoredFieldsFormat.html – Matías Fidemraizer Feb 01 '18 at 12:24
  • @Rumplin It's up to you. Full-text search on documents is the natural field of Lucene. I understand that either you don't have enough resources (time) to implement Lucene or you don't want to learn new tech right now. Again, it's up to you. I wouldn't search in 100GB of files manually. But this is my case. – Matías Fidemraizer Feb 01 '18 at 12:26
  • Indexing, and ready to use solution always better than self made solution. But efficiency depends on search count, only if you do search one (or two) times, scan will be better. – gabba Feb 02 '18 at 18:03
1

Try to:

  1. Do .ToLower one time for a line instead of do .ToLower for each element in searchList.

  2. Do one scan of file instead of two pass any and where. Get the list and then add with lock if any found. In your sample you waste time for two pass and block all threads when search and add.

  3. If you know position where to look for (in your sample you know) you can scan from position, not in all string

  4. Use producer consumer pattern for example use: BlockingCollection<T>, so no need to use lock

  5. If you need to strictly search in field, build HashSet of searchList and do searchHash.Contains(fieldValue) this will increase process dramatically

So here a sample (not tested):

using(var searcher = new FilesSearcher(
    searchFolder: "path", 
    searchList: toLookFor))
{
    searcher.SearchContentListInFiles();
}

here is the searcher:

public class FilesSearcher : IDisposable
{
    private readonly BlockingCollection<string[]> filesInMemory;
    private readonly string searchFolder;
    private readonly string[] searchList;

    public FilesSearcher(string searchFolder, string[] searchList)
    {
        // reader thread stores lines here
        this.filesInMemory = new BlockingCollection<string[]>(
            // limit count of files stored in memory, so if processing threads not so fast, reader will take a break and wait
            boundedCapacity: 100);

        this.searchFolder = searchFolder;
        this.searchList = searchList;
    }

    public IEnumerable<string> SearchContentListInFiles()
    {

        // start read,
        // we not need many threads here, probably 1 thread by 1 storage device is the optimum
        var filesReaderTask = Task.Factory.StartNew(ReadFiles, TaskCreationOptions.LongRunning);

        // at least one proccessing thread, because reader thread is IO bound
        var taskCount = Math.Max(1, Environment.ProcessorCount - 1);

        // start search threads
        var tasks = Enumerable
            .Range(0, taskCount)
            .Select(x => Task<string[]>.Factory.StartNew(Search, TaskCreationOptions.LongRunning))
            .ToArray();

        // await for results
        Task.WaitAll(tasks);

        // combine results
        return tasks
            .SelectMany(t => t.Result)
            .ToArray();
    }

    private string[] Search()
    {
        // if you always get unique results use list
        var results = new List<string>();
        //var results = new HashSet<string>();

        foreach (var content in this.filesInMemory.GetConsumingEnumerable())
        {
            // one pass by a file
            var currentFileMatches = content
                .Where(sourceLine =>
                {
                    // to lower one time for a line, and we don't need to make lowerd copy of file
                    var lower = sourceLine.ToLower();

                    return this.searchList.Any(sourceLine.Contains);
                });

            // store current file matches
            foreach (var currentMatch in currentFileMatches)
            {
                results.Add(currentMatch);
            }                
        }

        return results.ToArray();
    }

    private void ReadFiles()
    {
        var files = Directory.EnumerateFiles(this.searchFolder);

        try
        {
            foreach (var file in files)
            {
                var fileContent = File.ReadLines(file);

                // add file, or wait if filesInMemory are full
                this.filesInMemory.Add(fileContent.ToArray());
            }
        }
        finally
        {
            this.filesInMemory.CompleteAdding();
        }
    }

    public void Dispose()
    {
        if (filesInMemory != null)
            filesInMemory.Dispose();
    }
}
gabba
  • 2,815
  • 2
  • 27
  • 48
  • @Rumplin, you are welcome, but try to take attention to all that I say above BlockingCollection. Just BlockingCollection did't help enugh – gabba Jan 31 '18 at 14:58
  • @Rumplin, I will be glad if you tell about the results – gabba Jan 31 '18 at 15:04
  • I will post my results once I find the optimum code – Rumplin Jan 31 '18 at 15:10
  • On a 156MB file set, my code executed in 37 seconds, and yours in 1minute and 6 seconds, and used around 300MB of memory, whereas mine used 16MB of memory. I used a 4 core CPU. – Rumplin Feb 02 '18 at 22:41
  • @Rumplin, thankyou. You can adjust memory consumption by change files in memory capacity. But sad anyway. – gabba Feb 02 '18 at 22:46
  • Thank you for you effort, I did refactor the code into a disposable class as you did, this was also an improvement. Your code is searching with one thread, whereas I'm using all threads available. – Rumplin Feb 03 '18 at 08:44
  • @Rumplin, Apologize me, I didn't run a code and make a mistake. Now I correct a mistake and It must use Environment.ProcessorCount threads in parallel so will work much faster. – gabba Feb 06 '18 at 16:23
  • generate five files with many lines of `Guid.NewGuid()` for example. Like I wrote in my initial question. And then put into the search list the "shorter" GUID's to search for. Try my code and then yours and time it with `Stopwatch`. – Rumplin Feb 07 '18 at 15:30
0

This operation is first and foremost disk bound. Disk bound operations do not benefit from Multithreading. Indeed all you will do is swamp the Disk controler with a ton of conflictign requests at the same time, that a feature like NCQ has to striahgten out again.

If you had loaded all the files into memory first, your operation would be Memory Bound. And memory bound operations do not benefit from Multithreading either (usually; it goes into details of CPU and memory architecture here).

While a certain amount of Multitasking is mandatory in Programming, true Multithreading only helps with CPU bound operations. Nothing in there looks remotely CPU bound. So multithreading taht search (one thread per file) will not make it faster. And indeed likely make it slower due to all the Thread switching and synchronization overhead.

Christopher
  • 9,634
  • 2
  • 17
  • 31
  • What I observe is, that `foreach (var file in files)` is much slower than `Parallel.ForEach(files, (file) =>`. So I can't agree that there is no performance gain using multithreading. – Rumplin Jan 31 '18 at 13:53
  • That LINQ querry might actually be highly CPU bound (or ratehr CPU ineffective), so that might account for that. For Multithreading to work, the P part of the IPO Principle must be a lot of work. But you are not actually mutlithreading the whole operation - jsut the search in the files, once it is loaded into memory. – Christopher Jan 31 '18 at 13:59
  • In case of file of size 3MB it quickly read in to memory and then processed with multiple cores, so I suppose parallel will help. But code need to be improved – gabba Jan 31 '18 at 14:55
  • @gabba: The files size does not really mater. Disk and Network are glacially slow compared to everything else we programmers work with. The reason Multithreading helped, was because of the very complicated (CPU heavy) processing. Normally the I step of IPO-model should take the longest. But in this case, it is the P step. As he improoves the P step, I might be the worst again. – Christopher Feb 01 '18 at 10:50
  • @Christopher, you are right io operations very slow. Reading files in multiple threads - bad idea, but I mean that multithreading will help because cpu will not wait while data being read. PS the size are matters, because throughput, latency and total speed very different with various internal buffer sizes of sdd, hdd – gabba Feb 01 '18 at 12:05
  • A simple enumerator can deal with the CPU sitting idle while all files are read. The main thing was that this operation was *surprisingly* CPU bound, due to the LINQ. As the Search efficiency is improoved with the other posters idea, that might change. You can make something so effcient, it stops beign CPU bound. It could become literally too fast for Multithreading. But those are details that only the actuall programmer and user can figure out. – Christopher Feb 01 '18 at 13:07
  • @Christopher, I update my answer, I believe that splitting threads of IO read and CPU processing is better than single thread variant. Of course it's not perfect code, and I could be completely wrong. So feel free to fix me – gabba Feb 02 '18 at 17:54