1
include_cols_path = sys.argv[5]
with open(include_cols_path) as f:
include_cols = f.read().splitlines()

include_cols is a list of strings

df1 = sqlContext.read.csv(input_path + '/' + lot_number +'.csv', header=True).toPandas()

df1 is a dataframe of a large file. I would like to only retain the columns with names that contain any of the strings in include_cols.

pault
  • 41,343
  • 15
  • 107
  • 149
Cody
  • 389
  • 3
  • 4
  • 11
  • Possible duplicate of [python pandas selecting columns from a dataframe via a list of column names](https://stackoverflow.com/questions/38969267/python-pandas-selecting-columns-from-a-dataframe-via-a-list-of-column-names) – zero323 Jan 08 '18 at 19:44
  • Possible duplicate of [Selecting/Excluding sets of columns in Pandas](https://stackoverflow.com/questions/14940743/selecting-excluding-sets-of-columns-in-pandas) – Brandon Barney Jan 08 '18 at 19:52

3 Answers3

0

Doing this in pandas is certainly a dupe. However, it seems that you are converting a spark DataFrame to a pandas DataFrame.

Instead of performing the (expensive) collect operation and then filtering the columns you want, it's better to just filter on the spark side using select():

df1 = sqlContext.read.csv(input_path + '/' + lot_number +'.csv', header=True)
pandas_df = df1.select(include_cols).toPandas()

You should also think about whether or not converting to a pandas DataFrame is really what you want to do. Just about anything you can do in pandas can also be done in spark.

EDIT

I misunderstood your question originally. Based on your comments, I think this is what you're looking for:

selected_columns = [c for c in df1.columns if any([x in c for x in include_cols])]
pandas_df = df1.select(selected_columns).toPandas()

Explanation: Iterate through the columns in df1 and keep only those for which at least one of the strings in include_cols is contained in the column name. The any() functions returns True if at least one of the conditions is True.

pault
  • 41,343
  • 15
  • 107
  • 149
  • Thank you so much! If I want to search where the columns in df1 contain the string (%string%) from include_cols, would this be the correct syntax? – Cody Jan 08 '18 at 20:17
  • Do you mean filter the list of columns so that you only select the columns that contain the string? If so, try this: `include_cols = [c for c in include_cols if 'string' in c]` (I'm assuming you're using the `%` as a wildcard.) If you're asking how to filter the rows, that's an entirely different question. – pault Jan 08 '18 at 20:19
  • For instance, if include_cols contains 2 values, 'apple' and 'orange', I would like to filter df1 for all columns with column names containing the strings, 'apple' and 'orange'. I think it should be something close to this but not sure: pandas_df = df1.select(col for col in df1.columns if include_cols in col] – Cody Jan 08 '18 at 20:32
  • Try this: `pandas_df = df1.select([c for c in df1.columns if any([x in c for x in include_cols])])` – pault Jan 08 '18 at 20:58
0
final_cols = [col for col in df.columns.values if col in include_cols]
df = df[final_cols]
Myles Hollowed
  • 546
  • 4
  • 16
  • 1
    While this code snippet may solve the question, [including an explanation](http://meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – Isma Jan 09 '18 at 10:51
0
df1.loc[:, df1.columns.str.contains('|'.join(include_cols))]

For example:

df1 = pd.DataFrame(data=np.random.random((5, 5)), columns=list('ABCDE'))
include_cols = ['A', 'C', 'Z']
df1.loc[:, df1.columns.str.contains('|'.join(include_cols))]
>>>           A         C
    0  0.247271  0.761153
    1  0.390240  0.050055
    2  0.333401  0.823384
    3  0.821196  0.929520
    4  0.210226  0.406168

The '|'.join(include_cols) part will create an or condition with all elements of the input list. In the above example A|C|Z. This conditions will be True if one of the element is contained in the column names using the .contains() method on the column names.

Prikers
  • 858
  • 1
  • 9
  • 24