0

I want to filter through my column headers and pull out columns that match specific strings. Currently I do this using lines in my code that go like this:

word_possibilities = ['word1', 'word2', 'word3']

new_df = (df.filter(regex='|'.join(re.escape(x) for x in word_possibilities)).columns.to_list()) 

This works fine, except it pull out columns with headers like 'word111' for example. I would like it to select only columns that match the word possibilities exactly, not just contain the string.

Is there a way to modify the line for this?

not_speshal
  • 22,093
  • 2
  • 15
  • 30
jcat
  • 51
  • 8
  • What are your column headers like? Are the "word possibilities" in the headers separated from other words by something - spaces, underscores etc? – not_speshal Apr 13 '22 at 14:02
  • The word possibilities are all quite different to each other, sometimes it is the variable name, sometimes it is the variable name but shortened, sometimes it is just the unit of the variable, sometimes it is a combination of the above, sometimes with underscores etc and sometimes not. So each of the variations of how the column lists the variable is very different – jcat Apr 13 '22 at 14:06
  • It is not as simple as there sometimes being punctuation in the way, so don't think the already answered thread is helpful – jcat Apr 13 '22 at 14:08
  • Then there's no way to say "word1" but not "word11" if there's nothing differentiating them. – not_speshal Apr 13 '22 at 14:08
  • It is impossible to filter based on an exact string? – jcat Apr 13 '22 at 14:09
  • No idea what you mean. If you do `df[word_possibilities]`, you get exact specified columns but seems like you want columns that *contain* these possibilities. "word11" does contain "word1" so no idea how you expect to differentiate them. – not_speshal Apr 13 '22 at 14:11
  • I mean, for example if my data frame has many rows, and many columns of which 3 are 'apple' , 'apple_notreally' , 'fruit' , and I want to filter to create a new dataframe which has just the columns 'apple' and 'fruit', and all their values, but not any of the other columns including 'apple_notreally' , how would I do this? At the moment it is done my creating a list specifying eg 'apple' , 'fruit' , but this includes 'apple_not really' in the filtering. I then have many many dataframes, which many many variation on columns, but want only the info I specified in my list to be pulled out. – jcat Apr 13 '22 at 14:15

0 Answers0