0

I'm trying to create a spark job that can read in 1000's of json files and perform some actions, then write out to file (s3) again.

It is taking a long time and I keep running out of memory. I know that spark tries to infer the schema if one isn't given. The obvious thing to do would be to supply the schema when reading in. However, the schema changes from file to file depending on a many factors that aren't important. There are about 100 'core' columns that are in all files and these are the only ones I want.

Is it possible to write a partial schema that only reads the specific fields I want into spark using pyspark?

fungie127
  • 23
  • 1
  • 7
  • yes, you can provide partial schema. See this answer for example: https://stackoverflow.com/questions/63502556/pyspark-read-nested-json-from-a-string-type-column-and-create-columns/63503726#63503726 – Alex Ott Aug 21 '20 at 15:10
  • Thanks for your response @AlexOtt but that example doesn't do what I need. In the example, their dataframe is already read in. I need `df = spark.read.json('file_name', schema=partial_schema)`. The resulting dataframe should only have the partial schema. – fungie127 Aug 21 '20 at 15:25
  • that works as well... – Alex Ott Aug 21 '20 at 15:31

1 Answers1

1

At first, It is recommended to have a jsonl file that each of it contains a single json input data. Generally, you can read just a specific set of fields from a big json, but that should not be considered to be Sparks' job. You should have an initial method that converts your json input into an object of a serializable datatype; you should feed that object into your Spark pipeline.

Passing the schema is not an appropriate design, and it is just making the problem more severe. Instead, define a single method and extract the specific fields after loading the data from files. You can use the following link for finding how to extract some considering fields from a json string in python: How to extract specific fields and values from a JSON with python?

  • I tell spark which fields to extract after it's read in using `withColumn`. The problem is, there can be 100,000+ json files and it takes to long to do `df = spark.read.json('path/to/data')`. – fungie127 Aug 21 '20 at 14:37