0

I am witting a program to fetch a large number of .csv files, read them and find an product with the highest frequency. What I want to do is read the files using Parallel.ForEach loop to read the files and store the products into a ConcurrentDictionary (since its thread safe). What I want know is to find a way to count the number of times a particular product is read and store that frequency as it's key, and the value as the name of the product itself. Any help please ?

Here is my code:


string[] files = File.ReadAllLines(@"C:\Users\Samuel Hendrix\Desktop\StoreData\StoreData\" + selectedStore + @"_" + selectedWeek + "_" + selectedYear + @".csv");

                decimal cost;

                //Splitting the content of the files into arrays, which are then stored into variable to be added to lists
                Parallel.ForEach (files, file =>
                {
                    string[] orderSplit = file.Split(',');

                    string items = orderSplit[0];

                    Products.TryAdd(items, items.Count() );


                });

Theraot
  • 31,890
  • 5
  • 57
  • 86
Mighty
  • 53
  • 6
  • What about two strings appearing with the same frequency? – Theraot Nov 23 '19 at 23:37
  • Change the value, in case it should ever happen – Mighty Nov 23 '19 at 23:52
  • I'm confused on a couple things: are you reading a single file, and want to process rows in parallel, or are you reading multiple files and want to process and want to process each one in parallel. Also are you counting frequency from all columns or only a particualr one (say, the first one). – Theraot Nov 24 '19 at 00:09
  • What I'm doing is accessing the .csv files, reading all lines, splitting the line based on ',' into an array, and getting the array[0] member which is added to the concurrent dictionary, all of what I said in parallel. But my problems I want to know how to count the number of times a particular item is read and store that value as the key of the item – Mighty Nov 24 '19 at 05:56

1 Answers1

0

Every time a text it found, you would have to find what key it has in the dictionary (which is not what dictionaries are designed for), and then move it to the next key. Aside from searching by value, that has two extra problems:

  • Moving an item from one key to another is not atomic. If the same text is found twice in parallel at the same time... both will try to move it from the same key to the next. At least one will fail.

  • There can only be one value per key. You do not know the total counts yet, thus, you could be replacing a value that is more frequent with one that is less frequent. For example, let us say that "A" appears 10 times and "B" appears 20 times. However, they both has only been found once so far... the key 1 can only point to "A" or "B", the other count is lost.

Thus, you need a dictionary that has string key and int value. The value is the frequency... Once it is populated, you can move the data to a dictionary that has the frequency as key.

Theraot
  • 31,890
  • 5
  • 57
  • 86
  • So what you are try to say is, instead of storing the frequency as the key, I should rather store it as an int value ? – Mighty Nov 24 '19 at 12:27
  • @Hendrix I suppose you want to be be able to easily query by frequency. However, you are not going to be able to concurrently populate a dictionary with the frequency as key directly from the file. Instead populate a dictionary with the string and key and the frequency as as value, and when it is done, move the data from that dictionary to the one you want. – Theraot Nov 24 '19 at 12:37