0

I am trying to read csv data from a zip file, i know that .gz files are supported naturally in spark.read.csv(), but this is a zip file

How to open/stream .zip files through Spark? I check the above question and tried using it, but not sure how parse the RDD (a whole file of csv data represented as a ROW of text) into to a CSV dataframe

This is the code section used to extract data to RDD

import zipfile
import io

def zip_extract(x):
  file_path, content = row
  z_file = zipfile.ZipFile(io.BytesIO(content), "r")
  files = [i for i in z_file.namelist()]
  return z_file.open(files[0]).read()


zips = sc.binaryFiles("/path/to/some/zipfiles.zip")
data_rdd = zips.map(zip_extract)

Passing the rdd to spark.read.csv() is not giving the desired outcome

Geethanadh
  • 313
  • 5
  • 17

1 Answers1

1

Not sure I understand it correctly, if you already have a RDD, isn't it a simple call of data_rdd.toDF() to convert it to a DataFrame?

df=data_rdd.toDF()

niuer
  • 1,589
  • 2
  • 11
  • 14
  • that would just convert the text data from RDD to DF, but the text data is csv, i want that to be parsed into a DF – Geethanadh Jul 22 '19 at 21:39
  • I don't believe that kind of thing is supported, for now, at least. If you don't want a DF, what are you actually trying to do? – ASH Aug 11 '19 at 12:08
  • i don't think you understood clearly, i DO want and DF, but i want the csv parsed into the DF, not just as a text data but as CSV columns (that works with spark.read.csv() for .gz files but not zip files) – Geethanadh Aug 16 '19 at 08:30