I have 3 sets of CSV files that are basically a list of double values (with one double value each row) and are split per month:
A: aJan.csv, aFeb.csv, aMarch.csv
B: bJan.csv, bFeb.csv, bMarch.csv
C: cJan.csv, cFeb.csv, cMarch.csv
D: DJan.csv, DFeb.csv, DMarch.csv
I wanted to calculate all pair Pearson correlation on A,B,C,D. PySpark has a correlation method.
data = sc.parallelize(
np.array([range(10000), range(10000, 20000),range(20000, 30000)]).transpose())
print(Statistics.corr(data, method="pearson"))
My question is how I could make an 1 RDD from 3 files i.e. aJan.csv, aFeb.csv, aMarch.csv and then similarly for other. I know I could do something as mentioned here:How to read multiple text files into a single RDD? but I wanted the single view in month wise append format i.e first data is from Jan then append Feb.csv and then March.csv.