How to read multiple CSV files with different columns and file path names and make a single dataframe.
Here is an example just 2 paths. However there are many paths based on frn
and filename
.
"wasbs://containername@accountname.blob.core.windows.net/blobname/frn=xyz123/filename=file1/myfile34.csv"
"wasbs://containername@accountname.blob.core.windows.net/blobname/frn=abc567/filename=file2/myfile84.csv"
Columns can be concatenated and filled with NA if there is no value present.
Adding new columns based on
frn
and filename as mentioned in path with respective dataframe of CSV file.In each
filename
folder there is only one single CSV is present but there is multiplefilename
folder under each singlefrn
folder.
For example:
myfile34.csv
a b frn filename
0 3 4 xyz123 file1
1 4 5 xyz123 file1
2 5 6 xyz123 file1
myfile84.csv
a c frn filename
0 1 3 abc567 file2
1 2 4 abc567 file2
2 3 5 abc567 file2
final df
a b c frn filename
0 3 4 NA xyz123 file1
1 4 5 NA xyz123 file1
2 5 6 NA xyz123 file1
3 1 NA 3 abc567 file2
4 2 NA 4 abc567 file2
5 3 NA 5. abc567 file2
I tried this :
import databricks.koalas as ks
path = "wasbs://containername@accountname.blob.core.windows.net/blobname/*/*/"
df = ks.read_csv(path, sep="','",header='infer')
but how can i concat these different csv
files and create new columns (like frn
and filename
) as well?