I have a folder with many CSV files in it, which are around 3MB each in size.
example of content of one CSV:
afkla890sdfa9f8sadfkljsdfjas98sdf098,-1dskjdl4kjff;
afkla890sdfa9f8sadfkljsdfjas98sdf099,-1kskjd11kjsj;
afkla890sdfa9f8sadfkljsdfjas98sdf100,-1asfjdl1kjgf;
etc...
Now I have a Console app written in C#, that searches each CSV file for a certain string. And those strings to search for are in a txt file.
example of search txt file:
-1gnmjdl5dghs
-17kn3mskjfj4
-1plo3nds3ddd
then I call the method to search each search string in all files in given folder:
private static object _lockObject = new object();
public static IEnumerable<string> SearchContentListInFiles(string searchFolder, List<string> searchList)
{
var result = new List<string>();
var files = Directory.EnumerateFiles(searchFolder);
Parallel.ForEach(files, (file) =>
{
var fileContent = File.ReadLines(file);
if (fileContent.Any(x => searchList.Any(y => x.ToLower().Contains(y))))
{
lock (_lockObject)
{
foreach (string searchFound in fileContent.Where(x => searchList.Any(y => x.ToLower().Contains(y))))
{
result.Add(searchFound);
}
}
}
});
return result;
}
Question now is, can I anyhow improve performance of this operation? I have around 100GB of files to search trough. It takes aproximatly 1 hour to search all ~30.000 files with around 25 search strings, on a SSD disk and a good i7 CPU.
Would it make a difference to have larger CSV files or smaller CSV? I just want this search to be as fast as possible.
UPDATE
I have tried every suggestion that you wrote, and this is now what best performed for me (Removing ToLower from the LINQ yielded best performance boost. Search time from 1hour is now 16minutes!):
public static IEnumerable<string> SearchContentListInFiles(string searchFolder, HashSet<string> searchList)
{
var result = new BlockingCollection<string>();
var files = Directory.EnumerateFiles(searchFolder);
Parallel.ForEach(files, (file) =>
{
var fileContent = File.ReadLines(file); //.Select(x => x.ToLower());
if (fileContent.Any(x => searchList.Any(y => x.Contains(y))))
{
foreach (string searchFound in fileContent.Where(x => searchList.Any(y => x.Contains(y))))
{
result.Add(searchFound);
}
}
});
return result;
}