Reading multiple CSV files from Azure blob storage using Databricks PySpark

Question

How to read multiple CSV files with different columns and file path names and make a single dataframe.

Here is an example just 2 paths. However there are many paths based on frn and filename.

"wasbs://containername@accountname.blob.core.windows.net/blobname/frn=xyz123/filename=file1/myfile34.csv"

"wasbs://containername@accountname.blob.core.windows.net/blobname/frn=abc567/filename=file2/myfile84.csv"

Columns can be concatenated and filled with NA if there is no value present.
Adding new columns based on frn and filename as mentioned in path with respective dataframe of CSV file.
In each filename folder there is only one single CSV is present but there is multiple filename folder under each single frn folder.

For example:

myfile34.csv
   a  b frn      filename
0  3  4 xyz123   file1   
1  4  5 xyz123   file1
2  5  6 xyz123   file1

myfile84.csv    
   a  c frn      filename
0  1  3 abc567   file2
1  2  4 abc567   file2
2  3  5 abc567   file2

final df
  a b  c    frn      filename
0 3 4  NA   xyz123   file1  
1 4 5  NA   xyz123   file1  
2 5 6  NA   xyz123   file1  
3 1 NA 3    abc567   file2
4 2 NA 4    abc567   file2
5 3 NA 5.   abc567   file2

I tried this :

import databricks.koalas as ks

path = "wasbs://containername@accountname.blob.core.windows.net/blobname/*/*/"
df = ks.read_csv(path, sep="','",header='infer')

but how can i concat these different csv files and create new columns (like frn and filename) as well?

Does this answer your question? [How to perform union on two DataFrames with different amounts of columns in spark?](https://stackoverflow.com/questions/39758045/how-to-perform-union-on-two-dataframes-with-different-amounts-of-columns-in-spar) — blackbishop, Dec 16 '21 at 16:12
@blackbishop Not completely!, Actually there are multiple csv files which is stored in different paths. — vsb, Dec 16 '21 at 16:55

score 1 · Answer 1 · answered Dec 16 '21 at 15:54

1

My recommendation/workaround would be to use parquet file format, as its one of the best formats for cases of schema evolution.

Ofcourse, I dont deny the fact, we can do with few more steps in pyspark itself, which wont be straight forward.

https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging

# Write both dataframes in same location, utilizing append, in parquet format.
df_1.write.mode('append').parquet('<URI_HERE>/out')
df_2.write.mode('append').parquet('<URI_HERE>/out')

# Note: mergeSchema will mege files with different structures, and merge common columns and fill Null values for non-intersecting columns.
spark.read.option('mergeSchema', True).parquet('<URI_HERE>/out').show()

Happy learning!

answered Dec 16 '21 at 15:54

Jim Todd

1,488
1
11
15

I have many csv files (not just 2) and there is no option to modify format of file from csv to parquet. – vsb Dec 16 '21 at 17:05
@vsb, I see that you use kolas in databricks. So, of course there is an option to convert the dataframe to parquet https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_parquet.html – Jim Todd Dec 17 '21 at 11:03
@vsb, as you know the file structures are different from one another, so, you have to create the dataframes from each CSV file. So, just iterate through a loop. If you create it on top of the root folder, how will it be able to decide whats the structure it needs to follow? – Jim Todd Dec 17 '21 at 11:04

score 0 · Answer 2 · answered May 23 '22 at 15:58

Maybe this?

val myDFCsv = spark.read.format("csv")
   .option("sep","|")
   .option("inferSchema","true")
   .option("header","false")
   .load("mnt/rawdata/2019/01/01/client/ABC*.gz")

myDFCsv.show()
myDFCsv.head()
myDFCsv.count()


//////////////////////////////////////////
// If you also need to load the filename
val myDFCsv = spark.read.format("csv")
   .option("sep","|")
   .option("inferSchema","true")
   .option("header","false")
   .load("mnt/rawdata/2019/01/01/client/ABC*.gz")
   .withColumn("file_name",input_file_name())


myDFCsv.show(false)
myDFCsv.head()
myDFCsv.count()

Reading multiple CSV files from Azure blob storage using Databricks PySpark

2 Answers2