Pyspark read multiple csv files into a dataframe (OR RDD?)

Question

I've got a Spark 2.0.2 cluster that I'm hitting via Pyspark through Jupyter Notebook. I have multiple pipe delimited txt files (loaded into HDFS. but also available on a local directory) that I need to load using spark-csv into three separate dataframes, depending on the name of the file.

I see three approaches I can take - either I can use python to somehow iterate through the HDFS directory (haven't figured out how to do this yet, load each file and then do a union.

I also know that there exists some wildcard functionalty (see here) in spark - I can probably leverage

Lastly, I could use pandas to load the vanilla csv file from disk as a pandas dataframe and then create a spark dataframe. The downside here is that these files are large, and loading into memory on a single node could take ~8gb. (that's why this is moving to a cluster in the first place).

Here is the code I have so far and some pseudo code for the two methods:

import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
import pandas as pd

sc = pyspark.SparkContext(appName = 'claims_analysis', master='spark://someIP:7077')

spark = SparkSession(sc)

#METHOD 1 - iterate over HDFS directory
for currFile in os.listdir(HDFS:///someDir//):
    if #filename contains 'claim':
        #create or unionAll to merge claim_df
    if #filename contains 'pharm':
        #create or unionAll to merge pharm_df
    if #filename contains 'service':
        #create or unionAll to merge service_df

#Method 2 - some kind of wildcard functionality
claim_df = spark.read.format('com.databricks.spark.csv').options(delimiter = '|',header ='true',nullValue ='null').load('HDFS:///someDir//*<claim>.csv')
pharm_df = spark.read.format('com.databricks.spark.csv').options(delimiter = '|',header ='true',nullValue ='null').load('HDFS:///someDir//*<pharm>.csv')
service_df = spark.read.format('com.databricks.spark.csv').options(delimiter = '|',header ='true',nullValue ='null').load('HDFS:///someDir//*<service>.csv')


#METHOD 3 - load to a pandas df and then convert to spark df
for currFile in os.listdir(HDFS:///someDir//)
    pd_df = pd.read_csv(currFile, sep = '|')
    df = spark.createDataFrame(pd_df)
    if #filename contains 'claim':
        #create or unionAll to merge claim_df
    if #filename contains 'pharm':
        #create or unionAll to merge pharm_df
    if #filename contains 'service':
        #create or unionAll to merge service_df

Does anyone know how to implement method 1 or 2? I haven't been able to figure these out. Also, I was surprised that there isn't a better way to get csv files loaded into a pyspark dataframe - using a third party package for something that seems like it should be a native feature confused me (did I just miss the standard use case for loading csv files into a dataframe?) Ultimately, I'm going to be writing a consolidated single dataframe back to HDFS (using .write.parquet() ) so that I can then clear the memory and do some analytics using MLlib. If the approach I've highlighted isn't best practice, I would appreciate a push in the right direction!

I think you're on the right track with #2. Did you run into an error or something? Is there something about what you tried that didn't work? — santon, Dec 13 '16 at 23:26
I kept getting a file not found error, so I think the problem was in my wildcard implementation. Secondly, will all the files that match the wildcard be unioned automatically? I'm a little confused still about the spark wildcard functionality here. — flyingmeatball, Dec 14 '16 at 15:05
Yes, Spark will union all the records in all the files that match the wildcard. If you're getting a file-not-found, try with just a hard-coded URI to a single file. — santon, Dec 15 '16 at 18:59

Ramzy · Accepted Answer · 2016-12-14T16:52:55.683

Approach 1 :

In python you cannot directly refer to HDFS location. You need to take help of another library like pydoop. In scala and java, you have API. Even with pydoop, you will be reading the files one by one. It is bad to read files one by one and not use the parallel reading option provided by spark.

Approach 2 :

You should be able to point the multiple files with comma separated or with wild card. This way spark takes care of reading files and distribute them into partitions. But if you go with union option with each data frame there is one edge case when you dynamically read each file. When you have lot of files, the list can become so huge at driver level and can cause memory issues. Main reason is that, the read process is still happening at driver level.

This option is better. The spark will read all the files related to regex and convert them into partitions. You get one RDD for all the wildcard matches and from there you dont need to worry about union for individual rdd's

Sample code cnippet :

distFile = sc.textFile("/hdfs/path/to/folder/fixed_file_name_*.csv")

Approach 3 :

Unless you have some legacy application in python which uses the features of pandas, I would better prefer using spark provided API

Thanks for the reply - So it sounds like you're recommending option 2. I'm less worried about the number of files than the size of the files. Will the wildcard natively append the files together? For example, if there are 3 files that fit the wildcard, does it automatically union them for me, or does it return a list of 3 separate files? — flyingmeatball, Dec 14 '16 at 15:03

score 0 · Answer 2 · answered Jan 29 '20 at 18:52

I landed here trying to accomplish something similar. I have one function that will read HDFS and return a dictionary of lists.

def get_hdfs_input_files(hdfs_input_dir):
    """Returns a dictionary object with a file list from HDFS
    :rtype: dict
    """
    import subprocess
    sub_proc_cmd = "hdfs dfs -ls " + hdfs_input_dir + " | awk '{print $8}'"
    process = subprocess.run(sub_proc_cmd, shell=True, stdout=subprocess.PIPE)
    decoded_process = process.stdout.decode('utf-8')
    file_list = decoded_process.split("\n")
    claim_list, pharma_list, service_list = [], [], []
    for file in file_list:
        if file[-4:] == 'claim':
            claim_list.append(file)
        elif file[-4:] == 'pharma':
            pharma_list.append(file)
        elif file[-3:] == 'service':
            service_list.append(file)
    ret_dict = {'claim': claim_list, 'pharma': pharma_list, 'service': service_list}
    return ret_dict

Once you have a list of the CSV files, you can read them all into an RDD with Pyspark. The docs state that it the CSV DataFrameReader will accept a, "string, or list of strings, for input path(s), or RDD of Strings storing CSV rows". Just pass the method a list of files.

file_list = get_hdfs_input_files('/some/hdfs/dir')
claim_df = spark.read.csv(my_list.get('claim'), 
               delimiter = '|',header ='true',nullValue ='null')  
pharma_df = spark.read.csv(my_list.get('pharma'), 
               delimiter = '|',header ='true',nullValue ='null')
service_df = spark.read.csv(my_list.get('service'), 
               delimiter = '|',header ='true',nullValue ='null')

Pyspark read multiple csv files into a dataframe (OR RDD?)

2 Answers2