2

I am trying to filter out a large dataframe and don't want rows that contain certain values in the column 'Product Description'.

I have looked at how can i remove multiple rows with different labels in one command in pandas?

and

Remove rows not .isin('X')

and applied the code. However,

  df[-df['label'].isin(List)] 

is not working for me and I am not sure what to do.

Here is my exact code:

List2 = ['set up','setup','and install',....etc etc]

(I also tried List2 = ( ..etc ) with parentheses instead of brackets and it didn't work)

Computers_No_UNSPSC =Compters_No_UNSPSC[- Computers_No_UNSPSC['Product Description'].isin(List2)]

(I also tried using ~ instead of - which didn't work)

Is there something that I am doing wrong/missing. When I look at my Computers_No_UNSPSC dataframe, I see that there are rows still containing words in the list I created. It doesn't seem to filter out what I don't want.

Thanks for the help!

**I believe the List2 is working. I have rows of data that where people are describe their computer purchases. I want all computers bought not 'computer repair' or 'computer software'. So I created a list that seems to capture peripherals/things I don't want...well when I say

print List2 

I get

['set up', 'setup', 'and install', ' server', 'labor', 'services', 'processing', 'license', 'renewal', 'repair', 'case', 'speakers', 'cord', 'support', 'cart', 'docking station', 'components', 'accessories', 'software', ' membership', ' headsets ', ' keyboard', ' mouse', ' peripheral', ' part', ' charger', ' battery', ' drive', ' print', ' cable', ' supp', ' usb', ' shelf', 'disk', 'memory', 'studio', 'training', 'adapter', 'wiring', 'mirror']

Does this mean that it recognizes each string as a word? so when I apply the filter it will filter against each of the words in my List2?

A =A[-A['Product Description'].isin(List2)] 

This seems to be the part that isn't working but again, I am not sure where I went wrong.

Community
  • 1
  • 1
Alexis
  • 8,531
  • 5
  • 19
  • 21
  • Can you post sample data where this doesn't work and the list where it fails to match – EdChum Feb 18 '14 at 20:45
  • What part exactly is not working? You're mentioning two parts: the - and the `.isin(List2)` So are both parts not working or just the one? – KodyVanRy Feb 18 '14 at 20:46

1 Answers1

1

I dont think you understand how that works its checking if label == anything in that list ... not if label contains anything in that list ...

It sounds like a label might look like

label = "set up computer"

isin will look for exact matches ... not partial matches

label in ["set","up","computer"] #is false for example
"set" in ["set","up","computer"] #is true for example

note: this obviously is not pandas isin but that works the same ...

to do what you want you need to check the list of words against label

any(word in label for word in blacklisted_words)

which is going to be much slower

Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • 1
    Thanks for that explanation. I didn't understand exactly what was going on. I did try to do ... C2 = C2[ (-C2['Product Description'].str.contains('set up')) | (-C2['Product Description'].str.contains('setup')) ...etc etc but this didn't work and I believe its for the same reason – Alexis Feb 18 '14 at 20:55
  • yeah :) I think your starting to understand now though :) go ahead and accept if this answers your question(even though it doesnt solve your problem). – Joran Beasley Feb 18 '14 at 20:59
  • Is there no easier vectorized way to do this with pandas? Do I have to use a for loop to filter out everything that has these words in it? – Alexis Feb 18 '14 at 21:19
  • 1
    Try this - df[df.label.apply(lambda s: numpy.any([t in s for t in List2]))] – user1827356 Feb 18 '14 at 23:03