aws s3api list-objects-v2 --bucket cw-milenko-tests | grep 'tick_c'
output shows
"Key": "Json_gzips/tick_calculated_3_2020-05-27T11-50-22.json.gz",
"Key": "Json_gzips/tick_calculated_3_2020-05-27T11-52-59.json.gz",
"Key": "Json_gzips/tick_calculated_3_2020-05-27T11-55-08.json.gz",
"Key": "Json_gzips/tick_calculated_3_2020-05-27T11-57-30.json.gz",
"Key": "Json_gzips/tick_calculated_3_2020-05-27T11-59-59.json.gz",
"Key": "Json_gzips/tick_calculated_4_2020-05-27T09-14-28.json.gz",
"Key": "Json_gzips/tick_calculated_4_2020-05-27T11-35-38.json.gz",
With wc -l
aws s3api list-objects-v2 --bucket cw-milenko-tests | grep 'tick_c' | wc -l
457
I can read one file into data frame.
val path ="tick_calculated_2_2020-05-27T00-01-21.json"
scala> val tick1DF = spark.read.json(path)
tick1DF: org.apache.spark.sql.DataFrame = [aml_barcode_canc: string, aml_barcode_payoff: string ... 70 more fields]
I was surprised to see negative votes. What I want to know is how to load 457 files into RDD? I saw this SO question. Is it possible at all? What are the limitations? This is what I tried so far.
val rdd1 = sc.textFile("s3://cw-milenko-tests/Json_gzips/tick_calculated*.gz")
If I go for s3a
val rdd1 = sc.textFile("s3a://cw-milenko-tests/Json_gzips/tick_calculated*.gz")
rdd1: org.apache.spark.rdd.RDD[String] = s3a://cw-milenko-tests/Json_gzips/tick_calculated*.gz MapPartitionsRDD[3] at textFile at <console>:27
Doesn't work either.
Try to inspect my RDD.
scala> rdd1.take(1)
java.io.IOException: No FileSystem for scheme: s3
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
FileSytem was not recognized.
My GOAL:
s3://json.gz -> rdd -> parquet