0

I currently have this python code (I'm using Apache Spark, but pretty sure that it doesn't matter for this question).

import numpy as np
import pandas as pd
from sklearn import feature_extraction
from sklearn import tree
from pyspark import SparkConf, SparkContext

## Module Constants
APP_NAME = "My Spark Application"
df = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

def train_tree():
    # Do more stuff with the data, call other functions
    pass

def main(sc):
    cat_columns = ["Sex", "Pclass"]

    # PROBLEM IS HERE
    cat_dict = df[cat_columns].to_dict(orient='records')

    vec = feature_extraction.DictVectorizer()
    cat_vector = vec.fit_transform(cat_dict).toarray()

    df_vector = pd.DataFrame(cat_vector)
    vector_columns = vec.get_feature_names()
    df_vector.columns = vector_columns
    df_vector.index = df.index

    # train data

    df = df.drop(cat_columns, axis=1)
    df = df.join(df_vector)

    train_tree()

if __name__ == "__main__":
    # Configure Spark    
    conf = SparkConf().setAppName(APP_NAME)
    conf = conf.setMaster("local[*]")
    sc   = SparkContext(conf=conf)

    # Execute Main functionality
    main(sc)

When I run it, I get the error: cat_dict = df[cat_columns].to_dict(orient='records') UnboundLocalError: local variable 'df' referenced before assignment

I find this puzzling because I am defining the variable df outside of the main function scope at the top of the file. Why would using this variable inside the function trigger this error? I have also tried putting the df variable definition inside the if __name__ == "__main__": statement (before the main function is called)

Now, obviously there are lots of ways I could solve this, but this is more about helping me to understand Python better. So I want to ask:

a) Why this error even occurs?

b) How best to solve it given that: - I don't want to put the df definition inside the main function because I want to access it in other functions. - I don't want to use a class - I don't want to use a global variable - I don't want to pass df around in function parameters

cs_stackX
  • 1,407
  • 2
  • 19
  • 27
  • For b) you're going to have to pick one! Have you read any of the numerous other `UnboundLocalError` questions? – jonrsharpe Jun 06 '15 at 15:11
  • @jonrsharpe is there really no other option? I basically just want to have access to a variable in all the functions. Seems strange that I can't do that without additional complexity. – cs_stackX Jun 06 '15 at 15:14
  • why dont you just make it global inside `main`: `global df`?? – Nikos M. Jun 06 '15 at 15:18
  • where do you use cs? – Padraic Cunningham Jun 06 '15 at 15:18
  • @PadraicCunningham could you elaborate a little more? Or do you mean sc? – cs_stackX Jun 06 '15 at 15:21
  • @cs_stackX, yes I meant sc, also why don't you want to pass the dataframe? – Padraic Cunningham Jun 06 '15 at 15:22
  • @PadraicCunningham sc is the Spark Context, which is required for using Python files with Apache Spark. This code is simplified, so I actually have quite a few functions and would need to pass the dataframe to all of them...I just thought there could be a less repetitive way of writing the code – cs_stackX Jun 06 '15 at 15:25
  • If you want to share a common variable I would either pass it to the functions or use a class. Using global is another option but a pretty ugly solution. – Padraic Cunningham Jun 06 '15 at 15:26
  • I think the confusion is mainly due to the fact that - global variables aren't 'trivially' available in functions, you've to be a bit 'explicit' about those being used so a simple change like `global df` in your `def main` should suffice. or you pass it explicitly to `main` as `main(sc,df)` the choice is yours! I agree this is a bit un-intuitive – gabhijit Jun 06 '15 at 15:45
  • @cs_stackX you dont have to write `global df` inside every function because it is already accessible. The problem is when you re-assign another value to `df` inside a function. Thats only then that you get an error. So the reason you get the error is because of the line `df = df.drop(cat_columns, axis=1)` which you can change to `df.drop(cat_columns, axis=1, inplace=True)` but then the one below it also re-assign `df`. So functions that does that re-assign the global df you need to declare `global df` or pass df as an argument. Functions that dont you just use df straight away. – dopstar Jun 06 '15 at 16:56

2 Answers2

1

You can use the variable df in your main() (or any other function) and it will work just fine but if you try to assign value to it in a function (like you are doing in main() under #train data), it will give the unboundlocalerror exception. It will consider that variable as local variable and will therefore throw that exception.

Using global keyword with df in main() will solve your problem.

sparky
  • 11
  • 2
  • Upvote for correctly identifying the problem - however, as I've mentioned in my answer and others have said in the comments, not sure using the global keyword is the best option (although it would work) – cs_stackX Jun 07 '15 at 06:32
0

I think it's worth summarizing the comments into an a detailed answer for future readers of this question.

The reason why the UnboundLocalError is getting thrown here is due to the way Python function scope works. Although my df variable is defined outside of the main function at the uppermost scope, attempting to re-assign it in the main function creates the error. This excellent answer puts it nicely, to paraphrase:

Now we get to df = df.drop(cat_columns, axis=1) When Python scans that line, it says "ahah, there's a variable named df, I'll put it into my local scope dictionary." Then when it goes looking for a value for df for the df on the right hand side of the assignment, it finds its local variable named df, which has no value yet, and so throws the error.

To fix my code I made the following change:

def main(sc):

    cat_columns = ["Sex", "Pclass", "SibSp"]
    cat_dict = df[cat_columns].to_dict(orient='records')

    vec = feature_extraction.DictVectorizer()
    cat_vector = vec.fit_transform(cat_dict).toarray()

    df_vector = pd.DataFrame(cat_vector)
    vector_columns = vec.get_feature_names()
    df_vector.columns = vector_columns
    df_vector.index = df.index

    # train data

    df_updated = df.drop(cat_columns, axis=1) # This used to be df = df.drop(cat_columns, axis=1) 
    df_updated = df_updated.join(df_vector)

    train_tree(df_updated) # passing the df_updated to the function

This removes the UnboundLocalError. To keep using the df variable in other functions, I pass it in as a parameter (albeit with a different name). This could get confusing, so as suggested by @Padraic Cunningham you could pass the variable in the main function:

if __name__ == "__main__":
    # Configure Spark

    conf = SparkConf().setAppName(APP_NAME)
    conf = conf.setMaster("local[*]")
    sc   = SparkContext(conf=conf)
    df = pd.read_csv("train.csv")
    test = pd.read_csv("test.csv")

    # df.Age = df.Age.astype(int)
    # test.Age = test.Age.astype(int)

    # Execute Main functionality
    main(sc,df)

Other options would be to use a class, or to use a global variable. I felt that these two options were overkill (a class) or inelegant (global). However, this is purely my personal taste.

Community
  • 1
  • 1
cs_stackX
  • 1,407
  • 2
  • 19
  • 27