I have a CSV string which is an RDD and I need to convert it in to a spark DataFrame.
I will explain the problem from beginning.
I have this directory structure.
Csv_files (dir)
|- A.csv
|- B.csv
|- C.csv
All I have is access to Csv_files.zip, which is in a hdfs storage.
I could have directly read if each file was stored as A.gz, B.gz ... But it I have files within a directory which is compressed.
With the help of an answer on SO (How to open/stream .zip files through Spark?), I was able to convert this zip file in to a dictionary.
d = {
'A.csv':'A,B,C\n1,2,3\n4,5,6 ...'
'B.csv':'A,B,C\n7,8,9\n1,2,3 ...'
}
Now I should convert this csv_string 'A,B,C\n1,2,3\n4,5,6 ...'
to a dataframe. I tried this,
How can I efficiently convert csv_string to a meaningful dataframe ?
My Spark version is 1.6.2 and python 2.6.6.