I am working on a project (in .NET 3.5) that reads in 2 files, then compares them and finds the missing objects.
Based on this data, I need to parse it further and locate the object location. I'll try explaining this further:
I have 2 lists: 1 list is a very long list of all files on a server, along with their physical address on the server, or other server, this file is a little over 1 billion lines long and continuously growing (a littler ridiculous, I know). File size is around 160MB currently. The other list is a report list that shows missing files on the server. This list is miniscule compared to list 1, and is usually under 1MB in size.
I have to intersect list 2 with list 1 and determine where the missing objects are located. The items in the list look like this (unfortunately it is space separated and not a CSV document): filename.extension rev rev# source server:harddriveLocation\|filenameOnServer.extension origin
Using a stream, I read in both files into separate string lists. I then take a regex and parse items from list 2 into a third list that contains the filename.extension,rev and rev#. All this works fantastically, its the performance that is killing me.
I am hoping there is a much more efficient way to do what I am doing.
foreach (String item in slMissingObjectReport)
{
if (item.Contains(".ext1") || item.Contains(".ext2") || item.Contains(".ext3"))
{
if (!item.Contains("|"))
{
slMissingObjects.Add(item + "," + slMissingObjectReport[i + 1] + "," + slMissingObjectReport[i + 2]); //object, rev, version
}
}
i++;
}
int j = 1; //debug only
foreach (String item in slMissingObjects)
{
IEnumerable<String> found = Enumerable.Empty<String>();
Stopwatch matchTime = new Stopwatch(); //used for debugging
matchTime.Start(); //start the stop watch
foreach (String items in slAllObjects.Where(s => s.Contains(item.Remove(item.IndexOf(',')))))
{
slFoundInAllObjects.Add(item);
}
matchTime.Stop();
tsStatus.Text = "Missing Object Count: " + slMissingObjects.Count + " | " + "All Objects count: " + slAllObjects.Count + " | Time elapsed: " + (taskTime.ElapsedMilliseconds) * 0.001 + "s | Items left: " + (slMissingObjects.Count - j).ToString();
j++;
}
taskTime.Stop();
lstStatus.Items.Add(("Time to complete all tasks: " + (taskTime.ElapsedMilliseconds) * 0.001) + "s");
This works, but since currently there are 1300 missing items in my missing objects list, it takes an average of 8 to 12 minutes to complete. The part that takes the longest is
foreach (String items in slAllObjects.Where(s => s.Contains(item.Remove(item.IndexOf(',')))))
{
slFoundInAllObjects.Add(item);
}
I just need a point in the correct direction along with maybe a hand on how I can improve this code I am working on. The LINQ isn't the killer it seems, its adding it to a list that seems to kill the performance.