Summing duplicate values while reading in data

Question

I am reading in 5000 rows of data from a stream as follows from top to bottom and store it in a new CSV file.

ProductCode |Name   | Type  | Price
ABC | Shoe  | Trainers  | 3.99
ABC | Shoe  | Trainers  | 4.99
ABC | Shoe  | Trainers  | 5.99 
ABC | Shoe  | Heels | 3.99
ABC | Shoe  | Heels | 4.99
ABC | Shoe  | Heels | 5.99
...

Instead of having duplicate entries, I want the CSV to have one row but with the Price summed:

ProductCode |Name   | Type  | Price
ABC | Shoe  | Trainers  | 14.97
ABC | Shoe  | Heels | 14.97

I store each row as a Product:

public class Product
    {
        public string ProductCode { get; set; }
        public string Name { get; set; }
        public string Type { get; set; }
        public string Price { get; set; }
    }

After reading the data from the stream I end up with an IEnumerable<Product>.

My code is then:

string fileName = Path.Combine(directory, string.Format("{0}.csv", name));            
var results = Parse(stream).ToList(); //Parse returns IEnumerable<Product>
if (results.Any())
            {
                using (var streamWriter = File.CreateText(fileName))
                {
                    //writes the header line out
                    streamWriter.WriteLine("{0},{1}", header, name);

                    results.ForEach(p => { streamWriter.WriteLine(_parser.ConvertToOutputFormat(p)); });
                    streamWriter.Flush();
                    streamWriter.Close();
                }

                Optional<string> newFileName = Optional.Of(SharpZipWrapper.ZipFile(fileName, RepositoryDirectory));
                //cleanup
                File.Delete(fileName);
                return newFileName;
            }

I don't want to go through the 5000 rows again to remove the duplicates but would like to check if the entry already exists before I add it to the csv file.

What is the most efficient way to do this?

Jon Skeet · Accepted Answer · 2015-05-08T14:10:32.480

2

That sounds like you just need an appropriate LINQ transformations:

results = results
    .GroupBy(p => p.ProductCode)
    .Select(g => new Product {
        ProductCode = g.Key,
        Name = g.First().Name,
        Type = g.First().Type,
        Price = g.Sum(p => p.Price)
    })
    .ToList();

Or if your ProductCode isn't a unique ID for some odd reason:

results = results
    .GroupBy(p => new { p.ProductCode, p.Name, p.Type })
    .Select(g => new Product {
        ProductCode = g.Key.ProductCode,
        Name = g.Key.Name,
        Type = g.Key.Type,
        Price = g.Sum(p => p.Price)
    })
    .ToList();

This is assuming you've already changed your Product type to have a decimal type for the Price property, however. Prices aren't text, so shouldn't be stored as strings.

edited May 08 '15 at 14:10

answered May 08 '15 at 13:58

Jon Skeet

1,421,763
867
9,128
9,194

Wouldn't this sum up all Prices for all products with the same ProductCode even if they have different Types? – user3710760 May 08 '15 at 14:02
@user3710760 you can change the key to an anonymous type, eg: `new {p.ProductCode, p.Name, p.Type}` and use `g.Key.ProductCode`, `g.Key.Name`, `g.Key.Type` in the `Select` statement – Panagiotis Kanavos May 08 '15 at 14:08
1

@user3710760: I was expecting the ProductCode to be the unique key. Why would there be two different products with the same ProductCode? But yes, you could include the name and the type within the grouping if you needed to. Will edit... – Jon Skeet May 08 '15 at 14:09
They would have the same product code because it is the same product. e.g. you want to know how much you earned from this particular product. This worked. I have encountered another problem though, I actually have multiple types of csv outputs for the same Product class. For some I need to output ProductCode and Name, for others ProductCode and Type and for a third one all three. I deal with these different csv outputs by having 3 implementations of `ConvertToOutputFormat`. How can I define different keys for each possibility without replicating the code for each case? – user3710760 May 08 '15 at 16:20
@user3710760: If it's the same product, then it would have the same type and name, wouldn't it? That's contrary to your first objection. Now, as for your next question - you should ask with appropriate details in a new post. It's not helpful to Stack Overflow to keep asking extra questions in the same post. – Jon Skeet May 08 '15 at 16:28
I wasnt sure if I should create a new one or not, but thanks for clarifying: http://stackoverflow.com/questions/30129141/summing-duplicate-values-while-reading-in-data-for-different-types-of-outputs – user3710760 May 08 '15 at 17:01

General-Doomer · Answer 2 · 2015-05-08T14:05:54.653

2

List<Product> results = new List<Product>(new Product[]{
    new Product() { ProductCode="ABC ", Name="Shoe", Type="Trainers", Price="3.99" },
    new Product() { ProductCode="ABC ", Name="Shoe", Type="Trainers", Price="4.99" },
    new Product() { ProductCode="ABC ", Name="Shoe", Type="Trainers", Price="5.99" },
    new Product() { ProductCode="ABC ", Name="Shoe", Type="Heels", Price="3.99" },
    new Product() { ProductCode="ABC ", Name="Shoe", Type="Heels", Price="4.99" },
    new Product() { ProductCode="ABC ", Name="Shoe", Type="Heels", Price="5.99" },
});

results = (from e in results
           group e by new { e.ProductCode, e.Name, e.Type } into g
           select new Product
           {
               ProductCode = g.Key.ProductCode,
               Name = g.Key.Name,
               Type = g.Key.Type,
               Price = g.Sum(p => double.Parse(p.Price, CultureInfo.InvariantCulture)).ToString("0.00", CultureInfo.InvariantCulture)
           }).ToList();

edited May 08 '15 at 14:05

answered May 08 '15 at 14:01

General-Doomer

2,681
13
13

1

No need to create a tuple, you can use an anonymous type, eg: `new {e.ProductCode, e.Name, e.Type}`. This will make the rest of the code cleaner – Panagiotis Kanavos May 08 '15 at 14:05
2

I would recommend against using `double` for currency values. I've edited my answer to suggest an appropriate change to the `Product` type to use `decimal`. – Jon Skeet May 08 '15 at 14:11
Of course, doubles would be better, but original Product class has string `Price` property. – General-Doomer May 08 '15 at 14:15
@JonSkeet could you please elaborate shortly why you recommend `decimal` over `double` for currency values? Are not both suited for floating point numbers? – sceiler May 08 '15 at 14:22
@sceiler: As an example, the value 0.01 is a pretty useful one to be able to represent in currency values - but it can't be represented exactly as a `double`... See http://stackoverflow.com/questions/618535/difference-between-decimal-float-and-double-in-net/618596#618596 for more details. – Jon Skeet May 08 '15 at 14:24
Thanks that worked too but the other solution looked a bit cleaner. – user3710760 May 08 '15 at 17:30

idipous · Answer 3 · 2015-05-08T14:11:14.873

You could create a class with a dictionary with keys the product code and values the product.

Also you can read the stream line by line try to add to the dictionary a new key/value pair. but before you add the value you check if it contains the key (product code) and if yes you get the Product object for that key and you update the price.

Then you iterate over the dictionary and write to the csv. This way you don't nee to read twice to find duplicates before writing the CSV.

Ferdinando Santacroce · Answer 4 · 2015-05-08T14:09:59.280

-1

I don't want to go through the 5000 rows again to remove the duplicates but would like to check if the entry already exists before I add it to the csv file.

To achieve this, you can override Equals() on Product object, then check if Product exists in the list before adding it twice, then sum up Price instead.
Here you can find some guidelines while overriding Equals():
Guidelines for Overloading Equals() and Operator == (C# Programming Guide)

edited May 08 '15 at 14:09

answered May 08 '15 at 14:07

Ferdinando Santacroce

1,047
3
12
31

The OP wants to *sum* the prices, not simply find duplicates – Panagiotis Kanavos May 08 '15 at 14:09
Yes, you're right. I just edited my answer, but you have been faster than me :) – Ferdinando Santacroce May 08 '15 at 14:10
You don't need to override `Equals` to group and sum. – Panagiotis Kanavos May 08 '15 at 14:12
Yes, I know; I only suggested a way to add meaning to 'Product' object, then write the appropriate code to check if already present and eventually sum 'Price'. Anyway, a Linq transformation is more effective, for sure. – Ferdinando Santacroce May 08 '15 at 14:47

Summing duplicate values while reading in data

4 Answers4