I am trying to read csv data from a zip file, i know that .gz files are supported naturally in spark.read.csv(), but this is a zip file
How to open/stream .zip files through Spark? I check the above question and tried using it, but not sure how parse the RDD (a whole file of csv data represented as a ROW of text) into to a CSV dataframe
This is the code section used to extract data to RDD
import zipfile
import io
def zip_extract(x):
file_path, content = row
z_file = zipfile.ZipFile(io.BytesIO(content), "r")
files = [i for i in z_file.namelist()]
return z_file.open(files[0]).read()
zips = sc.binaryFiles("/path/to/some/zipfiles.zip")
data_rdd = zips.map(zip_extract)
Passing the rdd to spark.read.csv() is not giving the desired outcome