I'm playing a little bit with Java streams and i came up with a solution for a problem that i would like to share with you and see if my approach is correct.
I've dowloaded a dataset from https://catalog.data.gov/dataset/consumer-complaint-database which has 700k+ records of complains of customers. The information that i'm using is the following:
CompanyName ProductName
My objective is to get a result with:
The 10 companies with more occurrences in the dataset
The 10 products with more occurrences in the dataset
And get something like
Map<String, Map<String,Integer>>
Where, the key of the main map is the Company Name, and the Key in the secondary Map is the Product Name, and its value is the amount of times that the product has a complain in that company.
So the solution that i've done is the following:
@Test
public void joinGroupingsTest() throws URISyntaxException, IOException {
String path = CsvReaderTest.class.getResource("/complains.csv").toURI().toString();
complains = CsvReader.readFileStreamComplain(path.substring(path.indexOf('/')+1));
Map<String, List<Complain>> byCompany = complains.parallelStream()
.collect(Collectors.groupingBy(Complain::getCompany))
.entrySet().stream()
.sorted((f1, f2) -> Long.compare(f2.getValue().size(), f1.getValue().size()))
.limit(10)
.collect(Collectors.toMap(Entry::getKey, Entry::getValue));
Map<String, List<Complain>> byProduct = complains.parallelStream()
.collect(Collectors.groupingBy(Complain::getProduct))
.entrySet().stream()
.sorted((f1, f2) -> Long.compare(f2.getValue().size(), f1.getValue().size()))
.limit(10)
.collect(Collectors.toMap(Entry::getKey, Entry::getValue));
Map<String, List<Complain>> map = complains.parallelStream()
.filter((x) -> byCompany.get(x.getCompany()) != null
&& byProduct.get(x.getProduct()) != null)
.collect(Collectors.groupingBy(Complain::getCompany));
Map<String, Map<String, Long>> map2 = map.entrySet().parallelStream()
.collect(Collectors.toMap(
e -> e.getKey(),
e -> e.getValue().stream()
.collect(Collectors.groupingBy(Complain::getProduct, Collectors.counting()))
));
System.out.println(map2);
}
As you can see i have a couple of steps to achive this:
1) I get the 10 companies with more occurrences and the complains (records) associated
2) I get the 10 products with more occurrences and the complains (records) associated
3) I get a map with the company name as the key that is in the top 10 companies calculated before and the complains of the products that are also in the top 10 products
4) I do the transformation needed to get the map that i want.
Other than forking and separating the steps 1 and 2 in two different threads, is there any other consideration that i might have to improve the performance or even to use in a better ways the streams.
Thanks!