11

I have multiple zip files containing two types of files(A.csv & B.csv)

/data/jan.zip --> contains A.csv & B.csv
/data/feb.zip --> contains A.csv & B.csv

I want to read the contents of all the A.csv files inside all the zip files using pyspark.

 textFile = sc.textFile("hdfs://<HDFS loc>/data/*.zip")

Can someone tell me how to get the contents of A.csv files into an RDD?

zero323
  • 322,348
  • 103
  • 959
  • 935
Munesh
  • 1,509
  • 3
  • 20
  • 46

1 Answers1

-1

Here you want to read all csv files inside the zip files recursively.

val files = sc.CSVFiles("file://path/to/files/*.zip")
files.flatMap({case (name, content) =>
  unzip(content)
})

def unzip(content: String): List[String] = {
  ...
}
Ramineni Ravi Teja
  • 3,568
  • 26
  • 37