Remove columns where all items in column are identical (excluding header) and match a specified string

Question

My question is an extension of Delete Column in Pandas based on Condition, but I have headers and the information isn't binary. Instead of removing a column containing all zeros, I'd like to be able to pass a variable "search_var" (containing a string) to filter out columns containing only that string.

I initially thought I should read in the df and iterate across each column, read each column in as a list, and print columns where len(col_list) > 2 and search_var not in col_list. The solution provided to the previous post involving a boolean dataframe (df != search_var) intrigued me there might be a simpler way, but how could I go around the issue that the header will not match and therefore cannot purely filter on True/False?

What I have (non-working):

import pandas as pd
df = pd.read_table('input.tsv', dtype=str)
with open('output.tsv', 'aw') as ofh:
    df['col_list'] = list(df.values)
    if len(col_list) < 3 and search_var not in col_list:
        df.to_csv(ofh, sep='\t', encoding='utf-8', header=False)

Example input, search_var = 'red'

Name  Header1 Header2 Header3
name1 red     red     red
name2 red     orange  red
name3 red     yellow  red
name4 red     green   red
name5 red     blue    blue

Expected Output

Name  Header2 Header3
name1 red     red
name2 orange  red
name3 yellow  red
name4 green   red
name5 blue    blue

score 2 · Accepted Answer · answered Sep 01 '16 at 22:53

You can check the number of non-red item in the column, if it is not zero then select it using loc:

df.loc[:, (df != 'red').sum() != 0]

#    Name   Header2   Header3
# 0 name1       red       red
# 1 name2    orange       red
# 2 name3    yellow       red
# 3 name4     green       red
# 4 name5      blue      blue

Remove columns where all items in column are identical (excluding header) and match a specified string

Example input, search_var = 'red'

Expected Output

1 Answers1