Create DataFrames in a loop

Question

In Pyspark, I am using a mounted data lake container with the following content:

dbutils.fs.ls("/mnt/adlslirkov/hudl-layers/raw/BIMS/2022/01/")
Out[91]: [FileInfo(path='dbfs:/mnt/adlslirkov/hudl-layers/raw/BIMS/2022/01/01/', name='01/', size=0),
 FileInfo(path='dbfs:/mnt/adlslirkov/layers/raw/BM/2022/01/02/', name='02/', size=0),
 FileInfo(path='dbfs:/mnt/adlslirkov/layers/raw/BM/2022/01/03/', name='03/', size=0),
 FileInfo(path='dbfs:/mnt/adlslirkov/layers/raw/BM/2022/01/04/', name='04/', size=0),
 FileInfo(path='dbfs:/mnt/adlslirkov/layers/raw/BM/2022/01/05/', name='05/', size=0),
 FileInfo(path='dbfs:/mnt/adlslirkov/layers/raw/BM/2022/01/06/', name='06/', size=0),
 FileInfo(path='dbfs:/mnt/adlslirkov/layers/raw/BM/2022/01/07/', name='07/', size=0),
 FileInfo(path='dbfs:/mnt/adlslirkov/layers/raw/BM/2022/01/08/', name='08/', size=0),
 FileInfo(path='dbfs:/mnt/adlslirkov/layers/raw/BM/2022/01/09/', name='09/', size=0),
 FileInfo(path='dbfs:/mnt/adlslirkov/layers/raw/BM/2022/01/10/', name='10/', size=0)]

Using a loop I would like to create a DataFrame for each each of the files within these folders. I would like to have 10 DataFrames with names like df_bm_01012020, df_bm_02012022..etc. where the first two digits are the name of the folder where the file is. This is what I have right now:

df_default_name = "df_bm_"
df_default_path = "/mnt/adlslir/layers/raw/BM/2022/01/"
df_dict = {}

for i in dbutils.fs.ls("/mnt/adlslir/layers/raw/BM/2022/01/"):
    # convert each element to list
    lst_paths = list(i)
    # create dictionary
    df_dict[df_default_name + lst_paths[-2].replace('/', '') + "012022"] = df_default_path + lst_paths[-2].replace('/', '') + '/'

    
for i, y in df_dict.items():
    i = spark.read.format("parquet").option("header", True).load(y)
    i.display()

The last for loop returns all 10 DataFrames at once. However I would like to be able to access each one of them using its name. So for example if within the next cell I say display(df_bm_07012022), I would like to get that DataFrame, which is for the particular day. How should I do that?

Use `df_dict[i] = spark.read.parquet(y)` then access the dataframes using the dict `display(df_dict["df_bm_07012022"])` — blackbishop, Jan 15 '22 at 15:52
check how to access the dict using the keys . Check this out . https://stackoverflow.com/questions/11041405/why-dict-getkey-instead-of-dictkey — Indrajit Swain, Jan 16 '22 at 03:57

Create DataFrames in a loop

0 Answers0