2

I have scores of SAS dataset I want to export to pandas dataframe. The saspy module has a sd2fd method for this purpose. The issue I am having is described by this SO post which has links explaining why strings can not be substituted and used as variable names when executing code.

I'm defining the mk_df function to call the sd2fd method and then using a dictionary to pass the key/value pairs.

import os
import glob
from pathlib import Path
import saspy
import pandas as pd

p = Path('/home/trb/sasdata/export_2_df')
sas_datasets = []
df_names     = []
pya_tables   = []
sep = '.'

for i in p.rglob('*.sas7bdat'):
    sas_datasets.append(i.name.split(sep,1)[0])
    df_names.append('df_' + i.name.split(sep,1)[0])

sd_2_df_dict = dict(zip(sas_datasets,df_names))

sas = saspy.SASsession(results='HTML')

Returning:

Using SAS Config named: default
SAS Connection established. Subprocess id is 27752

Code continues...

# tell sas where to find the dataset
sas_code='''
   libname out_df "~/sasdata/export_2_df/";
'''
libref = sas.submit(sas_code)

# define the mk_df function
def mk_df(sas_name, df_name):
    df_name = sas.sd2df(table = sas_name, libref = 'out_df', method='CSV')
    return df_name

# call the mk_df function
for key, value in sd_2_df_dict.items():
    print(key, value)
    mk_df(key, value)

Returns:

cars df_cars
failure df_failure
airline df_airline
prdsale df_prdsale
retail df_retail
stocks df_stocks

However, none of the dataframes are created.

print(df_cars)

NameError                                 Traceback (most recent call last)
<ipython-input-18-aa21e263bad6> in <module>()
----> 1 print(df_cars)

NameError: name 'df_cars' is not defined

I verified the mk_df function works:

mk_df('stocks', 'df_stocks')

    Stock   Date    Open    High    Low     Close   Volume  AdjClose
0   IBM     2005-12-01  89.15   89.92   81.56   82.20   5976252.0   81.37
1   IBM     2005-11-01  81.85   89.94   80.64   88.90   5556471.0   88.01
2   IBM     2005-10-03  80.22   84.60   78.70   81.88   7019666.0   80.86
3   IBM     2005-09-01  80.16   82.11   76.93   80.22   5772280.0   79.22
4   IBM     2005-08-01  83.00   84.20   79.87   80.62   4801386.0   79.62

Printing the key and value return strings:

print(key, value)
   stocks df_stocks

How do I iterate the call to the mk_df function? Or is there a different approach I should consider?

@Python R SAS, that is a helpful observation. So I changed the mk_df function to include more information and make an attempt to explicitly name the output DataFrame.

def mk_df(sas_name, out_df):
    out_df = sas.sd2df(table = sas_name, libref = 'out_df', method='CSV')
    out_df.df_name = out_df
    name =[x for x in globals() if globals()[x] is out_df]
    print("Dataframe Name is: ",  name, "Type: ", type(out_df))
    return out_df

The call to the function is now:

j = 0

for key, value in sd_2_df_dict.items():
     mk_df(key, value).name=df_names[j]
     j += 1

Returns:

/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  This is separate from the ipykernel package so we can avoid doing imports until

Dataframe Name is:  [] Type:  <class 'pandas.core.frame.DataFrame'>
Dataframe Name is:  [] Type:  <class 'pandas.core.frame.DataFrame'>
Dataframe Name is:  [] Type:  <class 'pandas.core.frame.DataFrame'>
Dataframe Name is:  [] Type:  <class 'pandas.core.frame.DataFrame'>
Dataframe Name is:  [] Type:  <class 'pandas.core.frame.DataFrame'>
Dataframe Name is:  [] Type:  <class 'pandas.core.frame.DataFrame'>
RandyB
  • 133
  • 1
  • 3
  • 14
  • `print(df_cars)` For this statement to work, the variable should have been initialized. However, I don't see that in the code you pasted. – ranka47 Jul 21 '19 at 02:41

2 Answers2

2

I didn't run this code, so my response could be incorrect. What I see is that your mk_df function returns the dataframe created from the SAS dataset. However, you are not assigning it to anything when you invoke the function. So df_name is not available outside the mk_df function.

Changes based on your edits

About this piece of code - name =[x for x in globals() if globals()[x] is out_df]: you don't have any variable by the name df_cars, df_failure etc. created anywhere. Therefore there is nothing in the globals() dictionary that will get you a hit and hence x is always blank - which is manifested in your output.

I think what you are trying to do is create a dataframe by the name contained in the variable 'value' in the iteration. So if value is df_cars, you are trying to create a dataframe called df_cars.

The issue is that when you do something like df_name = inside the method, it rebinds the original reference and therefore the reference is lost. See this excellent discussion for details. How do I pass a variable by reference?

Also I think you have mixed up df_name and out_df or I didn't fully get what you are trying to do.

One way to do this would be to use the exec statement which is generally not recommended for various reasons:


def mk_df(sas_name):
    this_df = sas.sd2df(table = sas_name, libref = 'out_df', method='CSV')
    this_df.name = "df_" + sas_name
    return this_df

for key, value in sd_2_df_dict.items():
     exec(value + "= mk_df(" + key + ")")

But I think you are better of just maintaining the different dataframes through a dictionary. i.e.

dfs = dict()
for key, value in sd_2_df_dict.items():
     dfs[key] = mk_df(key)

SAS2Python
  • 1,277
  • 5
  • 16
  • Thank you @Python R SAS. Clearly I was approaching this from the SAS approach of `SELECT var into : macvar` semantics to inject a collection variables in the PDV. This helped to improve my understanding the role Dictionaries have in Python. The link you provided on how to pass a variable by reference was also very informative! – RandyB Jul 22 '19 at 13:28
0

Yes, that posted answer seems correct in that the data frame object created in sd2df isn't ever used so it's lost once you're out of scope of the mh_df routine. I see you had declared another list, which you didn't use: pya_tables

I think all you need to do is use that, for what I guess you were thinking of originally. Assign the returned data frame into that list, then you have the list of data frames you can then reference. Though I think you might want it to be a dictionary maybe; either way solves your problem though. Just a matter of how you want to reference them after the face; by index number or name.

Change the last line in your call to the mk_df routine (in my case I had 6 data sets in my library):

>>> for key, value in sd_2_df_dict.items():
...     print(key, value)
...     pya_tables.append(mk_df(key, value))
...
xpxout df_xpxout
scoredata df_scoredata
x2 df_x2
a df_a
tktg df_tktg
boo3f df_boo3f
>>> len(pya_tables)
6
>>> pya_tables[1]
   DURATION PROTOCOL SERVICE FLAG  SRC_BYTE  DST_BYTE LAND  WR_FRAG  URGENT  HOT  ...  I_ATTACK U_ATTACK  P_ATTACKu2r  P_ATTACKr2l  P_ATTACKprobe  P_ATTACKnormal  P_ATTACKdos  EM_EVENTPROBABILITY  EM_PROBABILITY  EM_CLASSIFICATION
0       0.0      tcp    http   SF     256.0    1169.0    0      0.0     0.0  0.0  ...    NORMAL   normal     0.000983     0.028647       0.027498        0.942848     0.000024             0.000983        0.942848             NORMAL
1       0.0      tcp    http   SF     248.0    2129.0    0      0.0     0.0  0.0  ...    NORMAL   normal     0.000983     0.028647       0.027498        0.942848     0.000024             0.000983        0.942848             NORMAL
2       0.0      tcp    http   SF     214.0   14959.0    0      0.0     0.0  0.0  ...    NORMAL   normal     0.000983     0.028647       0.027498        0.942848     0.000024             0.000983        0.942848             NORMAL
3       0.0      tcp    http   SF     235.0    6627.0    0      0.0     0.0  0.0  ...    NORMAL   normal     0.000983     0.028647       0.027498        0.942848     0.000024             0.000983        0.942848             NORMAL
4       0.0      tcp    http   SF     313.0     293.0    0      0.0     0.0  0.0  ...    NORMAL   normal     0.000983     0.028647       0.027498        0.942848     0.000024             0.000983        0.942848             NORMAL
5       0.0      tcp    http   SF     309.0   17798.0    0      0.0     0.0  0.0  ...    NORMAL   normal     0.000983     0.028647       0.027498        0.942848     0.000024             0.000983        0.942848             NORMAL
6       0.0      tcp    http   SF     231.0    2281.0    0      0.0     0.0  0.0  ...    NORMAL   normal     0.000983     0.028647       0.027498        0.942848     0.000024             0.000983        0.942848             NORMAL
7       0.0      tcp    http   SF     227.0    1247.0    0      0.0     0.0  0.0  ...    NORMAL   normal     0.000983     0.028647       0.027498        0.942848     0.000024             0.000983        0.942848             NORMAL
8       0.0      tcp    http   SF     233.0    3609.0    0      0.0     0.0  0.0  ...    NORMAL   normal     0.000983     0.028647       0.027498        0.942848     0.000024             0.000983        0.942848             NORMAL
9       0.0      tcp    http   SF     224.0    3609.0    0      0.0     0.0  0.0  ...    NORMAL   normal     0.000983     0.028647       0.027498        0.942848     0.000024             0.000983        0.942848             NORMAL

I used only your code, with my path, and just changed that last line assigning the returned df to the list, so it was accessible.

Tom

Tom
  • 68
  • 6
  • you might want to change the following pya_tables = {} pya_tables[value] = mk_df(key, value) >>> pya_tables['df_scoredata'] DURATION PROTOCOL SERVICE FLAG SRC_BYTE DST_BYTE LAND WR_FRAG URGENT HOT ... I_ATTACK U_ATTACK P_ATTACKu2r P_ATTACKr2l P_ATTACKprobe P_ATTACKnormal P_ATTACKdos EM_EVENTPROBABILITY EM_PROBABILITY EM_CLASSIFICATION ... can't seem to format this – Tom Jul 24 '19 at 17:36