-1

I have a problem with reading CSV files in Pyspark and/or Spark.

My target is: Read in all CSV files from a certain directory into a Pandas data frame.

I am most familiar with Python including Pandas. Therefore this is my preferred language. The files are rather small so it should be no problem to calculate

Due to privacy reasons I adapted some names of files and paths - so they have "strange" names.

My first step is to import Pandas check the content of the folder with the CSV files.

%pyhton
import pandas as pd
%sh
hdfs dfs -ls /dbm/ast-gbm/ntsf

Which results in:

Found 140 items

/dbm/ast-gbm/ntsf/ast1234.csv

list of the found files - omitted here for reasons of brevity

So far so good!

Next I am trying to readin one example CSV file using Python. Here the problems start.

df = pd.read_csv("/dbm/ast-gbm/ntsf/ast1234.csv")

Which results in problem 1:

[... - omitted here for reasons of brevity]

FileNotFoundError: [Errno 2] File b'/dbm/ast-gbm/ntsf/ast1234.csv' does not exist: b'/dbm/ast-gbm/ntsf/ast1234.csv'

Since I am able to list all the files with the shell I do not understand the error.

As a workaround for this I tried to load the CSV files into a spark dataframe and convert that into a Pandas dataframe. Similiar to what the following stackoverflow post suggests.

%spark.spark
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

val ast_all = spark.read
    .format("csv")
    .option("sep", ";")
    .option("inferSchema", "true")
    .option("header", "true")
    .load("/dbm/ast-gbm/ntsf/*.csv")

ast_all.createOrReplaceTempView("ast_all")

df = ast_all.select("*").toPandas()

Which results in problem 2:

console:40: error: value toPandas is not a member of org.apache.spark.sql.DataFrame

df = ast_all.select("*").toPandas()

Ideally I would find a solution for problem 1 or problem 2. Alternatively a different way to load 140 csv files into a pandas data frame would be fine as well.

Any ideas? Thank you!

Max
  • 21
  • 1
  • 6

1 Answers1

0

Trying to fix problem 2:

1 - This statement is wrong: val ast_all = spark.read. That code is Scala, not python. Replace it by ast_all = spark.read

2 - You can remove ast_all.createOrReplaceTempView("ast_all"). This statement is not needed.

3 - df = ast_all.select("*").toPandas() will require to have imported pandas, so add import pandas as pd at the top of your file.

  • Okay, thank you! 1 and 2 - I have done that! 3 - Pandas is imported above the Pandas statement. That was not clear in my question - sorry. I edited the question. Unfortunately no change in the error – Max Mar 10 '20 at 15:18