I am new to spark. I am following some of the basic example in the documentation.
I have a csv file like this: (a simplified version, the real one has nearly 40,000 lines)
date,category
19900108,apples
19900108,apples
19900308,peaches
19900408,peaches
19900508,pears
19910108,pears
19910108,peaches
19910308,apples
19910408,apples
19910508,apples
19920108,pears
19920108,peaches
19920308,apples
19920408,peaches
19920508,pears
This bit of scala code works fine for counting category totals
val textFile = sc.textFile("sample.csv")
textFile.filter(line => line.contains("1990")).filter(line =>line.contains("peaches")).count()
textFile.filter(line => line.contains("1990")).filter(line => line.contains("apples")).count()
textFile.filter(line => line.contains("1990")).filter(line => line.contains("pears")).count()
What is the best approach for looping through each line, adding category totals by year so that I end up writing a csv file like this:
date,apples,peaches,pears
1990,2,2,1
1991,3,1,1
1992,1,2,2
Any help would be appreciated.