I have a problem with reading CSV files in Pyspark and/or Spark.
My target is: Read in all CSV files from a certain directory into a Pandas data frame.
I am most familiar with Python including Pandas. Therefore this is my preferred language. The files are rather small so it should be no problem to calculate
Due to privacy reasons I adapted some names of files and paths - so they have "strange" names.
My first step is to import Pandas check the content of the folder with the CSV files.
%pyhton
import pandas as pd
%sh
hdfs dfs -ls /dbm/ast-gbm/ntsf
Which results in:
Found 140 items
/dbm/ast-gbm/ntsf/ast1234.csv
list of the found files - omitted here for reasons of brevity
So far so good!
Next I am trying to readin one example CSV file using Python. Here the problems start.
df = pd.read_csv("/dbm/ast-gbm/ntsf/ast1234.csv")
Which results in problem 1:
[... - omitted here for reasons of brevity]
FileNotFoundError: [Errno 2] File b'/dbm/ast-gbm/ntsf/ast1234.csv' does not exist: b'/dbm/ast-gbm/ntsf/ast1234.csv'
Since I am able to list all the files with the shell I do not understand the error.
As a workaround for this I tried to load the CSV files into a spark dataframe and convert that into a Pandas dataframe. Similiar to what the following stackoverflow post suggests.
%spark.spark
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
val ast_all = spark.read
.format("csv")
.option("sep", ";")
.option("inferSchema", "true")
.option("header", "true")
.load("/dbm/ast-gbm/ntsf/*.csv")
ast_all.createOrReplaceTempView("ast_all")
df = ast_all.select("*").toPandas()
Which results in problem 2:
console:40: error: value toPandas is not a member of org.apache.spark.sql.DataFrame
df = ast_all.select("*").toPandas()
Ideally I would find a solution for problem 1 or problem 2. Alternatively a different way to load 140 csv files into a pandas data frame would be fine as well.
Any ideas? Thank you!