Large Text File 1 > GB Frequency of KeyValuePair using File.ReadLine

Question

I'm new to C# and object-oriented programming in general. I have an application which parses a very large text file.

I have two dictionaries:

Dictionary<string, string> parsingDict //key: original value, value: replacement Dictionary<int, string> Frequency // key: count, value: counted string

I am finding the frequency of each key. I am able to get the desired output which is:

System1 has been replaced with MachineA 5 time(s)

System2 has been replaced with MachineB 7 time(s)

System3 has been replaced with MachineC 10 time(s)

System4 has been replaced with MachineD 19 time(s)

Following is my code:

String[] arrayofLine = File.ReadAllLines(File);
           foreach (var replacement in parsingDict.Keys)
        {
            for (int i = 0; i < arrayofLine.Length; i++)
            {
                if (arrayofLine[i].Contains(replacement))
                {
                    countr++;

                    Frequency.Add(countr, Convert.ToString(replacement));
                }
            }

        }


        Frequency = Frequency.GroupBy(s => s.Value)
                .Select(g => g.First())
                .ToDictionary(kvp => kvp.Key, kvp => kvp.Value);  //Get only the distinct records.

        foreach (var freq in Frequency)
        {
            sbFreq.AppendLine(string.Format("The text {0} was replaced {2} time(s) with {1} \n",
            freq.Value, parsingDict[freq.Value],
            arrayofLine.Where(x => x.Contains(freq.Value)).Count())); 
        }

Using String[] arrayofLine = File.ReadAllLines(File); increases memory utilization.

How can arrayofLine.Where(x => x.Contains(freq.Value)).Count()) be achieve using File.ReadLine as it is memory friendly.

What is the purpose of he second foreach you are never using the line ? — mybirthname, Jul 07 '17 at 12:36
Possible duplicate of [Reading large text files with streams in C#](https://stackoverflow.com/questions/2161895/reading-large-text-files-with-streams-in-c-sharp) — Owen Pauling, Jul 07 '17 at 12:36
You are reading the file too many times, one for each `Frecuency`. Use a `StreamReader` and rewrite (order) your foreach's — Cleptus, Jul 07 '17 at 12:38

Berin Loritsch · Answer 1 · 2017-07-07T12:57:10.277

0

You can read lines one at a time rather easily (ref).

The relevant code would look like this:

Dictionary<string,int> lineCount = new Dictionary<string,int>();
string line;

// Read the file and display it line by line.
using(System.IO.StreamReader file = 
   new System.IO.StreamReader("c:\\test.txt"))
{
   while((line = file.ReadLine()) != null)
   {
      string value = DiscoverFreq(line);
      lineCount[value] += 1;
    }
}

NOTE: it is important that you think about other bits of information you are storing as well. Appending lines from a large file into a string is essentially the same as reading the whole file at once, but with more garbage collection.

NOTE 2: I simplified the part where you update the counts. You'll have to check if the count entry is present and add it, or increment it if it is there. Alternatively you can initialice your lineCounts with all the freq.Values set to 0 before scanning the file.

If the number of unique words are high enough, then you may need to use a small database like SQLite to store the counts for you. That lets you query the information quickly without thinking about how to store and read a custom file you wrote yourself.

edited Jul 07 '17 at 12:57

answered Jul 07 '17 at 12:38

Berin Loritsch

11,400
4
30
57

How can arrayofLine.Where(x => x.Contains(freq.Value)).Count()) be achieve using File.ReadLine – Tango Jul 07 '17 at 12:48
Problem is that you are trying to search all lines at once. Perhaps set up a dictionary of counts and for each line processed you discover which `freq.Value` your dealing with and increment. After that, you can use that dictionary to get your final counts. – Berin Loritsch Jul 07 '17 at 12:52
Updated my post with the entire code which I am using to find the frequency. – Tango Jul 07 '17 at 12:57
About "Alternatively you can initialice your lineCounts with all the freq.Values set to 0 before scanning the file"... That would only work if the line values are already known/spected – Cleptus Jul 07 '17 at 13:03

Cleptus · Answer 2 · 2017-07-07T12:56:21.920

0

string line = string.Empty;
Dictionary<string, int> found = new Dictionary<int, string>();
using(System.IO.StreamReader file = new System.IO.StreamReader(@"c:\Path\To\BigFile.txt"))
{
   while(!file.EndOfStream)
   {
      line = file.ReadLine();
      // Matches found logic
      if (!found.ContainsKey(line)) found.Add(line, 1);
      else found[line] = found[line] + 1;
    }
}

edited Jul 07 '17 at 12:56

answered Jul 07 '17 at 12:46

Cleptus

3,446
4
28
34

How can arrayofLine.Where(x => x.Contains(freq.Value)).Count()) be achieve using File.ReadLine – Tango Jul 07 '17 at 12:48
@Tango Edited to add the frecuency part. Btw it looks like the key and the value datatypes where wrong (swtiched). I find it unnedesary doing all the grouping thing, you should keep the code as simple as possible. – Cleptus Jul 07 '17 at 12:57

Large Text File 1 > GB Frequency of KeyValuePair using File.ReadLine

2 Answers2