How to iterate over a collection of pandas dataframes?

Question

I have scores of SAS dataset I want to export to pandas dataframe. The saspy module has a sd2fd method for this purpose. The issue I am having is described by this SO post which has links explaining why strings can not be substituted and used as variable names when executing code.

I'm defining the mk_df function to call the sd2fd method and then using a dictionary to pass the key/value pairs.

import os
import glob
from pathlib import Path
import saspy
import pandas as pd

p = Path('/home/trb/sasdata/export_2_df')
sas_datasets = []
df_names     = []
pya_tables   = []
sep = '.'

for i in p.rglob('*.sas7bdat'):
    sas_datasets.append(i.name.split(sep,1)[0])
    df_names.append('df_' + i.name.split(sep,1)[0])

sd_2_df_dict = dict(zip(sas_datasets,df_names))

sas = saspy.SASsession(results='HTML')

Returning:

Using SAS Config named: default
SAS Connection established. Subprocess id is 27752

Code continues...

# tell sas where to find the dataset
sas_code='''
   libname out_df "~/sasdata/export_2_df/";
'''
libref = sas.submit(sas_code)

# define the mk_df function
def mk_df(sas_name, df_name):
    df_name = sas.sd2df(table = sas_name, libref = 'out_df', method='CSV')
    return df_name

# call the mk_df function
for key, value in sd_2_df_dict.items():
    print(key, value)
    mk_df(key, value)

Returns:

cars df_cars
failure df_failure
airline df_airline
prdsale df_prdsale
retail df_retail
stocks df_stocks

However, none of the dataframes are created.

print(df_cars)

NameError                                 Traceback (most recent call last)
<ipython-input-18-aa21e263bad6> in <module>()
----> 1 print(df_cars)

NameError: name 'df_cars' is not defined

I verified the mk_df function works:

mk_df('stocks', 'df_stocks')

    Stock   Date    Open    High    Low     Close   Volume  AdjClose
0   IBM     2005-12-01  89.15   89.92   81.56   82.20   5976252.0   81.37
1   IBM     2005-11-01  81.85   89.94   80.64   88.90   5556471.0   88.01
2   IBM     2005-10-03  80.22   84.60   78.70   81.88   7019666.0   80.86
3   IBM     2005-09-01  80.16   82.11   76.93   80.22   5772280.0   79.22
4   IBM     2005-08-01  83.00   84.20   79.87   80.62   4801386.0   79.62

Printing the key and value return strings:

print(key, value)
   stocks df_stocks

How do I iterate the call to the mk_df function? Or is there a different approach I should consider?

@Python R SAS, that is a helpful observation. So I changed the mk_df function to include more information and make an attempt to explicitly name the output DataFrame.

def mk_df(sas_name, out_df):
    out_df = sas.sd2df(table = sas_name, libref = 'out_df', method='CSV')
    out_df.df_name = out_df
    name =[x for x in globals() if globals()[x] is out_df]
    print("Dataframe Name is: ",  name, "Type: ", type(out_df))
    return out_df

The call to the function is now:

j = 0

for key, value in sd_2_df_dict.items():
     mk_df(key, value).name=df_names[j]
     j += 1

Returns:

/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  This is separate from the ipykernel package so we can avoid doing imports until

Dataframe Name is:  [] Type:  <class 'pandas.core.frame.DataFrame'>
Dataframe Name is:  [] Type:  <class 'pandas.core.frame.DataFrame'>
Dataframe Name is:  [] Type:  <class 'pandas.core.frame.DataFrame'>
Dataframe Name is:  [] Type:  <class 'pandas.core.frame.DataFrame'>
Dataframe Name is:  [] Type:  <class 'pandas.core.frame.DataFrame'>
Dataframe Name is:  [] Type:  <class 'pandas.core.frame.DataFrame'>

`print(df_cars)` For this statement to work, the variable should have been initialized. However, I don't see that in the code you pasted. — ranka47, Jul 21 '19 at 02:41

SAS2Python · Answer 1 · 2019-07-22T00:52:37.163

I didn't run this code, so my response could be incorrect. What I see is that your mk_df function returns the dataframe created from the SAS dataset. However, you are not assigning it to anything when you invoke the function. So df_name is not available outside the mk_df function.

Changes based on your edits

About this piece of code - name =[x for x in globals() if globals()[x] is out_df]: you don't have any variable by the name df_cars, df_failure etc. created anywhere. Therefore there is nothing in the globals() dictionary that will get you a hit and hence x is always blank - which is manifested in your output.

I think what you are trying to do is create a dataframe by the name contained in the variable 'value' in the iteration. So if value is df_cars, you are trying to create a dataframe called df_cars.

The issue is that when you do something like df_name = inside the method, it rebinds the original reference and therefore the reference is lost. See this excellent discussion for details. How do I pass a variable by reference?

Also I think you have mixed up df_name and out_df or I didn't fully get what you are trying to do.

One way to do this would be to use the exec statement which is generally not recommended for various reasons:


def mk_df(sas_name):
    this_df = sas.sd2df(table = sas_name, libref = 'out_df', method='CSV')
    this_df.name = "df_" + sas_name
    return this_df

for key, value in sd_2_df_dict.items():
     exec(value + "= mk_df(" + key + ")")

But I think you are better of just maintaining the different dataframes through a dictionary. i.e.

dfs = dict()
for key, value in sd_2_df_dict.items():
     dfs[key] = mk_df(key)

Thank you @Python R SAS. Clearly I was approaching this from the SAS approach of `SELECT var into : macvar` semantics to inject a collection variables in the PDV. This helped to improve my understanding the role Dictionaries have in Python. The link you provided on how to pass a variable by reference was also very informative! — RandyB, Jul 22 '19 at 13:28

score 0 · Answer 2 · answered Jul 24 '19 at 17:27

Yes, that posted answer seems correct in that the data frame object created in sd2df isn't ever used so it's lost once you're out of scope of the mh_df routine. I see you had declared another list, which you didn't use: pya_tables

I think all you need to do is use that, for what I guess you were thinking of originally. Assign the returned data frame into that list, then you have the list of data frames you can then reference. Though I think you might want it to be a dictionary maybe; either way solves your problem though. Just a matter of how you want to reference them after the face; by index number or name.

Change the last line in your call to the mk_df routine (in my case I had 6 data sets in my library):

>>> for key, value in sd_2_df_dict.items():
...     print(key, value)
...     pya_tables.append(mk_df(key, value))
...
xpxout df_xpxout
scoredata df_scoredata
x2 df_x2
a df_a
tktg df_tktg
boo3f df_boo3f
>>> len(pya_tables)
6
>>> pya_tables[1]
   DURATION PROTOCOL SERVICE FLAG  SRC_BYTE  DST_BYTE LAND  WR_FRAG  URGENT  HOT  ...  I_ATTACK U_ATTACK  P_ATTACKu2r  P_ATTACKr2l  P_ATTACKprobe  P_ATTACKnormal  P_ATTACKdos  EM_EVENTPROBABILITY  EM_PROBABILITY  EM_CLASSIFICATION
0       0.0      tcp    http   SF     256.0    1169.0    0      0.0     0.0  0.0  ...    NORMAL   normal     0.000983     0.028647       0.027498        0.942848     0.000024             0.000983        0.942848             NORMAL
1       0.0      tcp    http   SF     248.0    2129.0    0      0.0     0.0  0.0  ...    NORMAL   normal     0.000983     0.028647       0.027498        0.942848     0.000024             0.000983        0.942848             NORMAL
2       0.0      tcp    http   SF     214.0   14959.0    0      0.0     0.0  0.0  ...    NORMAL   normal     0.000983     0.028647       0.027498        0.942848     0.000024             0.000983        0.942848             NORMAL
3       0.0      tcp    http   SF     235.0    6627.0    0      0.0     0.0  0.0  ...    NORMAL   normal     0.000983     0.028647       0.027498        0.942848     0.000024             0.000983        0.942848             NORMAL
4       0.0      tcp    http   SF     313.0     293.0    0      0.0     0.0  0.0  ...    NORMAL   normal     0.000983     0.028647       0.027498        0.942848     0.000024             0.000983        0.942848             NORMAL
5       0.0      tcp    http   SF     309.0   17798.0    0      0.0     0.0  0.0  ...    NORMAL   normal     0.000983     0.028647       0.027498        0.942848     0.000024             0.000983        0.942848             NORMAL
6       0.0      tcp    http   SF     231.0    2281.0    0      0.0     0.0  0.0  ...    NORMAL   normal     0.000983     0.028647       0.027498        0.942848     0.000024             0.000983        0.942848             NORMAL
7       0.0      tcp    http   SF     227.0    1247.0    0      0.0     0.0  0.0  ...    NORMAL   normal     0.000983     0.028647       0.027498        0.942848     0.000024             0.000983        0.942848             NORMAL
8       0.0      tcp    http   SF     233.0    3609.0    0      0.0     0.0  0.0  ...    NORMAL   normal     0.000983     0.028647       0.027498        0.942848     0.000024             0.000983        0.942848             NORMAL
9       0.0      tcp    http   SF     224.0    3609.0    0      0.0     0.0  0.0  ...    NORMAL   normal     0.000983     0.028647       0.027498        0.942848     0.000024             0.000983        0.942848             NORMAL

I used only your code, with my path, and just changed that last line assigning the returned df to the list, so it was accessible.

Tom

you might want to change the following pya_tables = {} pya_tables[value] = mk_df(key, value) >>> pya_tables['df_scoredata'] DURATION PROTOCOL SERVICE FLAG SRC_BYTE DST_BYTE LAND WR_FRAG URGENT HOT ... I_ATTACK U_ATTACK P_ATTACKu2r P_ATTACKr2l P_ATTACKprobe P_ATTACKnormal P_ATTACKdos EM_EVENTPROBABILITY EM_PROBABILITY EM_CLASSIFICATION ... can't seem to format this — Tom, Jul 24 '19 at 17:36

How to iterate over a collection of pandas dataframes?

2 Answers2