0

I am looking to retrieve the name of an instance of DataFrame, that I pass as an argument to my function, to be able to use this name in the execution of the function. Example in a script:

display(df_on_step_42)

I would like to retrieve the string "df_on_step_42" to use in the execution of the display function (that display the content of the DataFrame).

As a last resort, I can pass as argument of DataFrame and its name:

display(df_on_step_42, "df_on_step_42")

But I would prefer to do without this second argument.

PySpark DataFrames are non-transformable, so in our data pipeline, we cannot systematically put a name attribute to all the new DataFrames that come from other DataFrames.

ajonvill
  • 1
  • 1
  • There is no reasonable way to do this. This is fundamentally a bad design. Variable names should not carry data. If you want to associate a string with another object, then make that association explicit, e.g., make it a pair, create a custom class with both, etc etc. Or in this case, simply pass the string as an argument to `display` – juanpa.arrivillaga Feb 15 '23 at 20:08

2 Answers2

0

You can use the globals() dictionary to search for your variable by matching it using eval.

As @juanpa.arrivillaga mentions, this is fundamentally bad design, but if you need to, here is one way to do this inspired by this old SO answer for python2 -

import pandas as pd

df_on_step_42 = pd.DataFrame()

def get_var_name(var):
    for k in globals().keys():
        try:
            if eval(k) is var:
                return k
        except:
            pass
        
get_var_name(df_on_step_42)
'df_on_step_42'

Your display would then look like -

display(df_on_step_42, get_var_name(df_on_step_42))

Caution

This will fail for views of variables since they are just pointing to the memory of the original variable. This means that the original variable occurs first in the global dictionary during an iteration of the keys, it will return the name of the original variable.

a = 123
b = a

get_var_name(b)
'a'
Akshay Sehgal
  • 18,741
  • 3
  • 21
  • 51
  • Thank you Akshay for your response. Unfortunately, this doesn't give me any results so far. It is in the display function which is on another module, that I want to use the name of the dataframe passed as an argument. Testing your solution, when I print globals().keys() in the body of the display() function, I get: globals().keys()=dict_keys(['__name__', '__doc__', '__package__', '__loader__', '__spec__', '__file__', '__cached__', '__builtins__', 'tk', 'Table' , 'DataFrame', 'get_var_name', 'display']) The function get_var_name(df) returns me tk, which does not correspond to my need. – ajonvill Feb 16 '23 at 16:22
0

I finally found a solution to my problem using the inspect and re libraries.

I use the following lines which correspond to the use of the display() function

import inspect
import again

def display(df):
      frame = inspect.getouterframes(inspect.currentframe())[1]
      name = re.match("\s*(\S*).display", frame.code_context[0])[1]
      print(name)

display(df_on_step_42)

The inspect library allows me to get the call context of the function, in this context, the code_context attribute gives me the text of the line where the function is called, and finally the regex library allows me to isolate the name of the dataframe given as parameter.

It’s not optimal but it works.

ajonvill
  • 1
  • 1