0

For my question, I have found quite a few entries that explain how to drop rows with specific column values; however, I've not been able to find (I know a post might be out there) a post that addresses how to drop rows in a dataframe with specific column values across multiple columns (34 in this case).

  1. How to drop rows of Pandas DataFrame whose value in certain columns is NaN
  2. Drop Rows by Multiple Column Criteria in DataFrame
  3. Drop rows in pandas dataframe based on columns value

baddata

zip        age    item1    item2    item3    item4    item5    item6    item7    item34

12345       10    1        0        1        1        0         0       1           0

23456       20   10       111       11       1        0         1       9           8

45678       60    1        0         1       1        0         1       0           1

I want to retain all those rows that has values of '1' or '0' (drop all rows for which col values in 34 cols are not '1' or '0'). This is what I tried so far:

baddata = pd.DataFrame(data=dirtydata, columns=['zip','age','item1','item2'...'item34'])

gooddata=baddata.dropna() # some rows have NaN; drops rows with NaN values

option-1:

gooddata[gooddata[['item1','item2'...'item34']].isin([0,1])] #this makes values for zip and age NaN; not sure why?

option-2:

gooddata[gooddata[['item1','item2'...'item34']].map(len) < 2).any(axis=1)] #also tried replacing 'any' with 'all'; did not work

option-3:

cols_of_interest=['item1','item2'...'item34'] gooddata[gooddata.drop(gooddata[cols_of_interest].map(len) < 2)] #doubtful about the syntax and usage of functions

Community
  • 1
  • 1
ads
  • 3
  • 1
  • 4
  • Let me be clear, you want to drop all rows where the value in `item34` is not `0` or `1`? Is this what you want? That's it? – Joe T. Boka Jun 05 '16 at 02:22
  • Joe R - I want to only retain those rows which have values of '0' or '1' for various items i.e remove all those rows that has values other than '0' or '1' as values in cols item1, item2, item3, item4,...item34. – ads Jun 05 '16 at 02:48
  • Expected Result: zip age item1 item2 item3 item4 item5 item6 item7 item34 12345 10 1 0 1 1 0 0 1 0 45678 60 1 0 1 1 0 1 0 1 – ads Jun 05 '16 at 02:54
  • @Merlin how do I get the expected result with good data as stated in row1 and row3. row2 is an example of how different items with values other than 1 or 0 must be dropped or not retained in the dataframe. Hope I am not making it too confusing. – ads Jun 05 '16 at 03:00
  • I _think_ I have answered this question below. – John Karasinski Jun 05 '16 at 03:03
  • All the rows have values other than `0` or `1` in column `age`. Is it possible that you mean removing `columns` not `rows`? – Joe T. Boka Jun 05 '16 at 03:05
  • @JoeR If I remove columns, then wouldn't my dataFrame lose columns (item)? I want to keep **item1** to **item34**, but drop those records (relational db perspective) that contain values other than **1** or **0** in columns (item). Apologize, if I'm misinterpreting. – ads Jun 05 '16 at 03:13
  • @Merlin All the columns have to be retained. Due to lack of space, I iterated that item1 to item34 are in total 34 columns; they all have to be kept as-is because they represent different product lines. If the product is bought, it's being shown as '1', if not '0', any other values represent messy data, which I am trying to weed out. – ads Jun 05 '16 at 03:15

2 Answers2

2

Start by selecting all the columns after age

df[df.columns[2:]]

   item1  item2  item3  item4  item5  item6  item7  item34
0      1      0      1      1      0      0      1       0
1     10    111     11      1      0      1      9       8
2      1      0      1      1      0      1      0       1

check if their values are 0 or 1

df[df.columns[2:]].isin((0, 1))

   item1  item2  item3 item4 item5 item6  item7 item34
0   True   True   True  True  True  True   True   True
1  False  False  False  True  True  True  False  False
2   True   True   True  True  True  True   True   True

check if all values in the row are True

df[df.columns[2:]].isin((0, 1)).all(axis=1)

0    True
1    False
2    True
dtype: bool

select only these rows

df[df[df.columns[2:]].isin((0, 1)).all(axis=1)]

     zip  age  item1  item2  item3  item4  item5  item6  item7  item34
0  12345   10      1      0      1      1      0      0      1       0
2  45678   60      1      0      1      1      0      1      0       1

EDIT

Breaking this out a bit more clearly, we have

relevant_columns = df[df.columns[2:]]
values_as_ints = relevant_columns.convert_objects(convert_numeric=True)
values_valid = values_as_ints.isin((0, 1))
row_valid = values_valid.all(axis=1)
good_rows = df[row_valid]
John Karasinski
  • 977
  • 7
  • 16
  • Are the values in your DataFrame `string`s, or `int`s? – John Karasinski Jun 05 '16 at 03:09
  • they're of type float, integer, object. I should have clarified that. – ads Jun 05 '16 at 03:19
  • You could instead try `df[df[df.columns[2:]].astype(int).isin((0, 1)).all(axis=1)]`. – John Karasinski Jun 05 '16 at 03:20
  • @karasinski Getting this error message: invalid literal for long() with base 10: ' ' – ads Jun 05 '16 at 03:28
  • what does **astype**(int) do? – ads Jun 05 '16 at 03:31
  • `astype(int)` attempts to convert your values from `str` to `int`. Since this seems to be throwing errors for you, you can instead do `df[df[df.columns[2:]].isin(('0', '1')).all(axis=1)]` which will check for `str` values instead. – John Karasinski Jun 05 '16 at 03:32
  • Your suggestion of using for str values returned 0 rows 34 cols. – ads Jun 05 '16 at 03:36
  • 1
    It's difficult to solve your issue, as none of these problems are showing up in the example data you posted above. I edited my answer above, could you try that? – John Karasinski Jun 05 '16 at 03:40
  • That last edit of your worked! However, it gave a warning of this nature: "FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric." – ads Jun 05 '16 at 03:55
  • I slightly refined your code and now it worked with the original that you had posted earlier: gooddata=gooddata.convert_objects(convert_numeric=True) gooddata_final=gooddata[gooddata[gooddata[gooddata.columns[2:]].isin((0,1)).all(axis=1) – ads Jun 05 '16 at 05:23
0

Try this:

 print df
 zip     age  item1  item2  item3  item4  item5  item6  item7  item34
12345   10      1      0      1      1      0      0      1       0
23456   20     10    111     11      1      0      1      9       8
45678   60      1      0      1      1      0      1      0       1

dfSlice = df[df.columns[2:]]
def mapZeroOne(x):
    if x == 0 or x == 1:   
       return x

dfNa = dfSlice.applymap(mapZeroOne)
print dfNa

      item1  item2  item3  item4  item5  item6  item7  item34
12345    1.0    0.0    1.0      1      0      0    1.0     0.0
23456    NaN    NaN    NaN      1      0      1    NaN     NaN
45678    1.0    0.0    1.0      1      0      1    0.0     1.0

dfAge =  df[['zip',"age"]] 
print  dfAge

zip     age
12345   10
23456   20
45678   60


df_new = pd.concat([dfAge, dfNa], axis=1)
 print df_new 

zip     age  item1  item2  item3  item4  item5  item6  item7  item34
12345   10    1.0    0.0    1.0      1      0      0    1.0     0.0
23456   20    NaN    NaN    NaN      1      0      1    NaN     NaN
45678   60    1.0    0.0    1.0      1      0      1    0.0     1.0

print df_new.dropna()

zip    age  item1  item2  item3  item4  item5  item6  item7  item34
12345   10    1.0    0.0    1.0      1      0      0    1.0     0.0
45678   60    1.0    0.0    1.0      1      0      1    0.0     1.0

You may need to adjust 0 to "0" and 1 to "1".

Merlin
  • 24,552
  • 41
  • 131
  • 206
  • Why would I only use specific 'item34' for the bitwise operation? Other items could also have invalid data values as represented in row1. Trying to understand the usage of bitwise operation and specific use of only item34 – ads Jun 05 '16 at 03:24