Read multiple csvs with Pyspark is time consuming

Question

I am using following code to read the CSVs.

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
      .master("local[1]") \
      .appName("SparkByExamples.com") \
      .getOrCreate() 


df = spark.read.csv(path+'*.csv')

spark.read.csv is taking a lot of time to read just 2 CSVs (each contain around 1200 rows only)

can someone tell me where did i go wrong?

pyspark read multiple csv files at once I found solution here, but it is time consuming. I am using local system.

#UPDATE

I tired the following approach which read 2 files but what if I want to read more than 3000 CSVs?


from pyspark.sql import SparkSession
 
spark = SparkSession.builder.appName('Read Multiple CSV Files').getOrCreate()
 
path = ['files/data1.csv',
        'files/data2.csv']
 
files = spark.read.csv(path, sep=',',
                       inferSchema=True, header=True)
 
df1 = files.toPandas()

Thanks

How long does it take? What's the size of csv and number of column? — Jonathan Lam, Oct 13 '22 at 03:48
I stopped the operation, because it should not take too long. size is 187 kb each (2 columns each). These are the same files i am using in Dask. – — Coder, Oct 13 '22 at 13:01
@JonathanLam if i go with just one file `spark.read.csv("file.csv")`, it works — Coder, Oct 13 '22 at 13:26
It doesn't make sense that you read 2 small CSV and it takes long time. How long does it takes? Could you share the CSV so that we can test? You first method to read the CSV is totally make sense, as long as those CSV have same schema. Most of the time it's because of schema inference or single executor table scanning. — Jonathan Lam, Oct 14 '22 at 01:45

Read multiple csvs with Pyspark is time consuming

0 Answers0