I am using following code to read the CSVs.
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[1]") \
.appName("SparkByExamples.com") \
.getOrCreate()
df = spark.read.csv(path+'*.csv')
spark.read.csv
is taking a lot of time to read just 2 CSVs (each contain around 1200 rows only)
can someone tell me where did i go wrong?
pyspark read multiple csv files at once I found solution here, but it is time consuming. I am using local system.
#UPDATE
I tired the following approach which read 2 files but what if I want to read more than 3000 CSVs?
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Read Multiple CSV Files').getOrCreate()
path = ['files/data1.csv',
'files/data2.csv']
files = spark.read.csv(path, sep=',',
inferSchema=True, header=True)
df1 = files.toPandas()
Thanks