0

I want to use Pyspark to read in hundreds of csv files, create a single dataframe that is (roughly) the concatenation of all the csvs. Since each csv can fit in memory, but not more than one or two at a time, this seems like a good fit for Pyspark. My strategy is not working, and I think it is because I want to make a Pyspark dataframe in the kernel function of my map function resulting in an error:

# initiate spark session and other variables
sc = SparkSession.builder.master("local").appName("Test").config(
"spark.driver.bindAddress", "127.0.0.1").getOrCreate() 

file_path_list = [path1, path2] ## list of string path variables

# make an rdd object so i can use .map: 
rdd = sc.sparkContext.parallelize(file_path_list) 

# make a kernel function for my future .map() application

def kernel_f(path):
    df = sc.read.options(delimiter=",", header=True).csv(path)
    return df 

# apply .map
rdd2 = rdd.map(kernel_f) 

# see first dataframe (so excited) 
rdd2.take(2)[0].show(3) 

this throws an error:

PicklingError: Could not serialize object: RuntimeError: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

My next step (supposing no error had appeared) was to use a reduce step to concatenate all the members (dataframes with same schema) of that rdd2

It seems related to this post but I don't understand the answer.

Questions:

  1. I think this means is that since my kernel_f calls sc. methods, it is against the rules. Is that right?
  2. I (think I) could use plain-old python (not pyspark) function map to apply the kernel_f to my file_path_list, then use plain-old functools.reduce to concatenate all these into a single pyspark dataframe, but that doesn't seem to leverage pyspark much. Does this seem like a good route?
  3. Can you teach me a good, ideally a "tied-for-best" way to do this?
travelingbones
  • 7,919
  • 6
  • 36
  • 43

1 Answers1

0

I don't have a definitive answer but just comments that might help. First off, I think the easiest way to do this is read the CSVs with a wildcard like shown here

  1. A Spark cluster is composed of the scheduler and the workers. You use the SparkSession to pass work to the scheduler. It seems they don't allow workers sending work to the scheduler, which seems like it can be an anti-pattern in a lot of use cases.

The design pattern is also weird here because you are not actually passing a DataFrame back. Spark operations are lazy unlike Pandas so that read is not happening immediately. I feel like if it worked, it would pass a DAG back, not data.

  1. It doesn't sound good because you want loading of files to be lazy. Given you can't use spark to read on a worker, you'd have to use Pandas/Python which evaluate immediately. You will run out of memory trying this even more.

Speaking of memory, Spark lets you perform out-of-memory computation but there are limits to how big can be out-of-memory relative to the memory available. You will inevitably run into errors if you really don't have enough memory by a considerable margin.

  1. I think you should use the wildcard as shown above.
Kevin Kho
  • 679
  • 4
  • 14