Doing this in pandas
is certainly a dupe. However, it seems that you are converting a spark DataFrame
to a pandas DataFrame
.
Instead of performing the (expensive) collect operation and then filtering the columns you want, it's better to just filter on the spark
side using select()
:
df1 = sqlContext.read.csv(input_path + '/' + lot_number +'.csv', header=True)
pandas_df = df1.select(include_cols).toPandas()
You should also think about whether or not converting to a pandas DataFrame
is really what you want to do. Just about anything you can do in pandas
can also be done in spark
.
EDIT
I misunderstood your question originally. Based on your comments, I think this is what you're looking for:
selected_columns = [c for c in df1.columns if any([x in c for x in include_cols])]
pandas_df = df1.select(selected_columns).toPandas()
Explanation:
Iterate through the columns in df1
and keep only those for which at least one of the strings in include_cols
is contained in the column name. The any()
functions returns True
if at least one of the conditions is True
.