0

I have made a loop where I iterate over (csv) files in a folder, read them into a dictionary of dataframes and name them after the csv file (e.g. file1.csv becomes file1_df). I do some work on the data and generate new rows, then I try to subset part of my dataframes into a new dataframe (file1_df2). I would like to later reference these dataframes outside of the dictionary.

    df_dict = {}
    for file in os.listdir(datadir):  # Loop over the files in that folder (only has CSV files)
        df_name = file[:-4] + '_df'  # Trim off .csv to name the dataframe
        df_dict[df_name] = pd.read_csv(os.path.join(datadir, file))

Is it possible to reference these dataframes by name? So later I can just call file1_df2 instead of df_dict["file1_df2"]?

In essence I am asking the same question as here. It doesn't look like he got this answered either, so I think this might not be possible, but I have yet to find an answer that explicitly says it isn't.


I know this is possible in languages like SAS and Stata, but I have never figured out how to do it in Python. In those languages, you can plug your placeholder variable directly into the name of something.

/* In SAS */
%let param = test1
libname path "C:\User\&param."

proc sql;
create &param._df as 
select * from path.&param.
quit;
/* In Stata */
foreach i in file1 file2 {
    import delimited "`i'.csv", clear
    save "`i'.dta", replace
}

etc. If this is not possible, I would like to know that with certainty. Thank you!

Zoupah
  • 225
  • 2
  • 9
  • 1
    it is possible but not a good idea , check https://stackoverflow.com/a/6181959/11220780 – Benoit de Menthière Oct 04 '19 at 21:00
  • 1
    Possible - yes see e.g. ```exec``` function: https://docs.python.org/3.5/library/functions.html#exec Recommended - rather not - because how will you know programatically, which names exist, and which don't. – Grzegorz Skibinski Oct 04 '19 at 21:00
  • Knowing it is possible and not recommended (and why it is not recommended) is also great information. Thank you for the resources. – Zoupah Oct 04 '19 at 21:19

1 Answers1

2

The lack of answers is likely because nobody can really tell WHY you want to do this. The question seems to stem from applying an SAS / Stata workflow to python that just doesn't make any sense.

However, i think this does what you're asking

import pandas as pd
my_csvs = ["name1.csv", "name2.csv", "name3.csv"]
my_dfs = [pd.read_csv(csv) for csv in my_csvs]
df_dict = {name.replace(".csv", "_df"): df for name, df in zip(my_csvs, my_dfs)}

# access dataframes with (advisable to use this method!)
csv2 = df_dict["name2_df"]

Then, we can add these keys to our name space with an exec() call:

# now add them to the namespace
for k in df_dict.keys():
    exec(f"{k} = df_dict['{k}']")
    # or use "{k} = df_dict['{k}']".format(k=k) for python < 3.5?

# Now does this work?
print(name2_df)

And this actually does work. However, any IDE is going to flag the last line, because it doesn't seem like you've declared that variable.

I strongly advise against using this.

Jwely
  • 682
  • 5
  • 18
  • I was about to post something like that. The beauty on it is that Python just create a reference and not a new object, and that is easily checked using ```csv2 is df_dict['name2_df']```. – accdias Oct 04 '19 at 21:03
  • 1
    Thank you, knowing that it is possible but not recommended is also very helpful. – Zoupah Oct 04 '19 at 21:20