Data set got merged incorrectly/randomly?

Question

I have a .csv file that looks something like this:

123,1-1-2020,[Apple]
123,1-2-2020,[Apple]
123,1-2-2020,[Beer]
345,1-3-2020,[Bacon]
345,1-4-2020,[Cheese]
345,1-4-2020,[Sausage]
345,1-5-2020,[Bacon]

I made a function that finds if both the number and date of any line is similar, the items of the lines will appended with a number next to it to show how many items are there in the Set<String> of items:

123,1-1-2020,1,[Apple]
123,1-2-2020,2,[Apple, Beer]
345,1-3-2020,1,[Bacon]
345,1-4-2020,2,[Cheese,Sausage]
345,1-5-2020,1,[Bacon]

While this is the intended result, the actual result with a larger set of data using my algorithm, items are sometimes randomly not counted and goes missing (the entire line disappears). The above example sometimes would becomes:

123,1-1-2020,1,[Apple]
123,1-2-2020,2,[Apple, Beer]
345,1-3-2020,1,[Bacon]
345,1-4-2020,2,[Cheese,Sausage]
345,1-5-2020,1,[Bacon]
// Any one of those output lines would sometimes disappear entirely.

I am very confused why is this happening. Below is the algorithm I implemented:

protected List<Receipt> convert(List<Receipt> list) {

        List<Receipt> receipts = new ArrayList<>();
        for (int i = 0; i < list.size(); i++) {
            List<Receipt> temp = new ArrayList<>();
            temp.add(list.get(i));
            for (int j = i + 1; j < list.size(); j++) {
                if (list.get(i).getNumber().equals(list.get(j).getNumber())) {
                    temp.add(list.get(j));
                }
            }

            Map<String, Set<String>> map = new HashMap<>();
            for (Receipt r : temp) {
                if (!map.containsKey(r.getDate())) {
                    map.put(r.getDate(), r.getItems());
                } else {
                    map.replace(r.getDate(), merge(map.get(r.getDate()), r.getItems()));
                }
            }

            for (Map.Entry<String, Set<String>> m : map.entrySet()) {
                receipts.add(new Receipt(list.get(i).getNumber(), m.getKey(), m.getValue()));
            }
            i = i + temp.size();
        }
        return receipts;
    }

// Merge function used above to append two item sets.
private static Set<String> merge(Set<String> a, Set<String> b) {

        return new HashSet<>() {
            {
                addAll(a);
                addAll(b);
            }
        };
    }

Where each line in the .csv is made into a Receipt object that has (String) getNumber(), (String) getDate(), and (Set<String>) getItems() method.

For a .csv with a similar format that has hundreds of thousands of lines, in the original data set, for example, the item Bacon was found 1334 times but the output of my algorithm, Bacon only appears randomly somewhere between 1160 - 1240 times. This happens to all other items as well. There are also some strange behaviors where a few random lines got the items appended wrong as well (same number, different date but still appended).

What could possibly be the cause of this randomness?

Edit : as suggestions in the comment section mentioned, the cause seems to be at i++ and i = i + temp.size() incrementing the i value incorrectly.

Shouldn't there be only one `1-4-2020` line, like this? `345,1-4-2020,3,[Cheese,Sausage,Bacon]` — Andreas, Nov 18 '20 at 01:26
@Andreas thanks for pointing out. It's a typo, I already fixed the example. — Mint, Nov 18 '20 at 01:28
@Andreas yes, that as well. I've been working for too many hours and my brain hardly functions any more. Thanks for pointing out again. — Mint, Nov 18 '20 at 01:45
also the definition of your data. e.g. what is the first number at the beginning of each line? — Mr. Brickowski, Nov 18 '20 at 02:16
Are all of the lines with a given `number` guaranteed to be contiguous? If not, `i = i + temp.size()` will skip the wrong lines sometimes. — tgdavies, Nov 18 '20 at 02:19
@Mr.Brickowski The first number is just something like customer ID. This is the link to download the whole project if you would want to take a look: https://mega.nz/file/zrwCzSqC#vzCRk3tY4gZBImBqwPUmk76_j23HrbbpJpaP1PF63X0 — Mint, Nov 18 '20 at 02:20
@tgdavies I'm also a little bit suspicious on that one as well but it's so difficult for me to track even in debug mode because of the large data set. I have the download link of the project here if you care to take a look: https://mega.nz/file/zrwCzSqC#vzCRk3tY4gZBImBqwPUmk76_j23HrbbpJpaP1PF63X0 — Mint, Nov 18 '20 at 02:22
In the line `temp.add(list.get(j));`, if j != i + temp.size() I think you have a problem. (I'm probably off-by-one, but I think that's useful to check.) You could run your input through `sort -t , -k 1`. — tgdavies, Nov 18 '20 at 02:59

score 1 · Accepted Answer · answered Nov 18 '20 at 17:12

First, you're using double-brace syntax in the merge() method, which is highly discouraged. See e.g. this answer: What is Double Brace initialization in Java?

Second, your code is iterating the list too much, you can (and should) do it in a single iteration.

Before going on, we need to look at the Receipt class, which is not shown in the question, but we can infer that it looks like this:

public class Receipt {
    private final String number;
    private final String date;
    private final Set<String> items;
    public Receipt(String number, String date, Set<String> items) {
        this.number = number;
        this.date = date;
        this.items = items;
    }
    public String getNumber() {
        return this.number;
    }
    public String getDate() {
        return this.date;
    }
    public Set<String> getItems() {
        return this.items;
    }
}

The goal is to merge receipts that share number and date. To do that in a single iteration, we need a Map, keyed by the combination of number and date. There are 3 ways to do that:

Implement equals() and hashCode() to be based on only those two fields, ignoring the items field. This is the easiest and simplest solution, but requires modifying the Receipt class, and to define "equality" to not include the item list, which is a debatable decision, so let us not do this.
Implement a dedicated class with just the two fields, and implement equals() and hashCode(). This has the advantage of leaving Receipt alone, but does require a new class.
Use the Receipt class as the key, but give the Map class a custom Comparator that only compares the two fields. With Java 8, we can do this without creating a new class, so let us try that. We'll need to use a TreeMap in order to supply a custom Comparator.

So, we'll create a TreeMap<Receipt, Set<String>>, using a custom Comparator, iterate the list once, merging the items as we go, then finally iterating the map to build the new list of merged receipts.

protected static List<Receipt> convert(List<Receipt> list) {
    Map<Receipt, Set<String>> map = new TreeMap<>(Comparator.comparing(Receipt::getDate)
                                                        .thenComparing(Receipt::getNumber));
    for (Receipt receipt : list) {
        map.merge(receipt, receipt.getItems(), (v1, v2) -> {
            Set<String> newSet = new TreeSet<>(v1);
            newSet.addAll(v2);
            return newSet;
        });
    }
    List<Receipt> result = new ArrayList<>();
    for (Entry<Receipt, Set<String>> entry : map.entrySet()) {
        result.add(new Receipt(entry.getKey().getNumber(),
                               entry.getKey().getDate(),
                               entry.getValue()));
    }
    return result;
}

It can of course also be done with Stream logic:

protected static List<Receipt> convert(List<Receipt> list) {
    return list.stream()
        .collect(Collectors.groupingBy(
                Function.identity(),
                () -> new TreeMap<>(Comparator.comparing(Receipt::getDate)
                                              .thenComparing(Receipt::getNumber)),
                Collectors.flatMapping(r -> r.getItems().stream(),
                        Collectors.toCollection(TreeSet::new))))
        .entrySet().stream()
        .map(e -> new Receipt(e.getKey().getNumber(), e.getKey().getDate(), e.getValue()))
        .collect(Collectors.toList());
}

Test
_{Uses Java 9 of methods.}

List<Receipt> list = List.of(
        new Receipt("123", "1-1-2020", Set.of("Apple")),
        new Receipt("123", "1-2-2020", Set.of("Apple")),
        new Receipt("123", "1-2-2020", Set.of("Beer")),
        new Receipt("345", "1-3-2020", Set.of("Bacon")),
        new Receipt("345", "1-4-2020", Set.of("Cheese")),
        new Receipt("345", "1-4-2020", Set.of("Sausage")),
        new Receipt("345", "1-5-2020", Set.of("Bacon")));

List<Receipt> converted = convert(list);
converted.forEach(r -> System.out.println(r.getNumber() + "," + r.getDate() + "," +
                                          r.getItems().size() + "," + r.getItems()));

Output

123,1-1-2020,1,[Apple]
123,1-2-2020,2,[Apple, Beer]
345,1-3-2020,1,[Bacon]
345,1-4-2020,2,[Cheese, Sausage]
345,1-5-2020,1,[Bacon]

Aboslutely beautiful solutions (even two of it!) and really opened my eyes. However, there's two thing I'd like to ask: 1/ If there are two `receipt` items with exact same number, date and item purchased (an edge case that actually happened), how should I handle this using your method? 2/ I'm really interested in knowing more about method 1 and 2 you mentioned as well, do you mind to enlighten me a bit more about it? — Mint, Nov 18 '20 at 22:07
@Mint Since your `items` is a `Set`, a duplicate item is ignored. --- As for `equals()` and `hashCode()`, you implement them in the standard way, except ignoring the `items` field. Whether as part of `Receipt` or as part of separate `ReceiptKey` class that doesn't even have an `items` field, they are the same. Then use `HashMap` instead of the `TreeMap` used here. — Andreas, Nov 18 '20 at 23:41

Data set got merged incorrectly/randomly?

1 Answers1