I currently have this python code (I'm using Apache Spark, but pretty sure that it doesn't matter for this question).
import numpy as np
import pandas as pd
from sklearn import feature_extraction
from sklearn import tree
from pyspark import SparkConf, SparkContext
## Module Constants
APP_NAME = "My Spark Application"
df = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
def train_tree():
# Do more stuff with the data, call other functions
pass
def main(sc):
cat_columns = ["Sex", "Pclass"]
# PROBLEM IS HERE
cat_dict = df[cat_columns].to_dict(orient='records')
vec = feature_extraction.DictVectorizer()
cat_vector = vec.fit_transform(cat_dict).toarray()
df_vector = pd.DataFrame(cat_vector)
vector_columns = vec.get_feature_names()
df_vector.columns = vector_columns
df_vector.index = df.index
# train data
df = df.drop(cat_columns, axis=1)
df = df.join(df_vector)
train_tree()
if __name__ == "__main__":
# Configure Spark
conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster("local[*]")
sc = SparkContext(conf=conf)
# Execute Main functionality
main(sc)
When I run it, I get the error: cat_dict = df[cat_columns].to_dict(orient='records') UnboundLocalError: local variable 'df' referenced before assignment
I find this puzzling because I am defining the variable df outside of the main
function scope at the top of the file. Why would using this variable inside the function trigger this error? I have also tried putting the df
variable definition inside the if __name__ == "__main__":
statement (before the main
function is called)
Now, obviously there are lots of ways I could solve this, but this is more about helping me to understand Python better. So I want to ask:
a) Why this error even occurs?
b) How best to solve it given that:
- I don't want to put the df
definition inside the main
function because I want to access it in other functions.
- I don't want to use a class
- I don't want to use a global variable
- I don't want to pass df
around in function parameters