1

I want to filter a Pandas Series to remove certain values. This seems like such a simple task, but the preferred answer to the same question doesn't work for me.

Here's my reproducible example:

data = np.array([['','Col1','Col2'],
                ['Row1',1,2],
                ['Row2',3,4]])

myDF = pd.DataFrame(data=data[1:,1:],
                  index=data[1:,0],
                  columns=data[0,1:])

mySeries = myDF.loc[:, "Col1"]
mySeries[mySeries != 1]

I expect the final line to output a single row, containing the value 3, but instead I get:

Row1    1
Row2    3
Name: Col1, dtype: object

What am I doing wrong?

jpp
  • 159,742
  • 34
  • 281
  • 339
Tom Wagstaff
  • 1,443
  • 2
  • 13
  • 15

4 Answers4

3

Your Series contains strings.

>>> mySeries.tolist()
>>> ['1', '3']

You can use

>>> mySeries[mySeries != '1']
>>> 
Row2    3
Name: Col1, dtype: object

This happens because numpy arrays hold a single data type, thus the integers are casted to strings when you create data.

If you want the integers, you can use

>>> mySeries = mySeries.astype(int)
>>> mySeries
>>> 
Row1    1
Row2    3
Name: Col1, dtype: int64

and your original code will work just fine.

timgeb
  • 76,762
  • 20
  • 123
  • 145
2
mySeries = mySeries.astype(int)
mySeries.loc[mySeries != 1]
Naga kiran
  • 4,528
  • 1
  • 17
  • 31
2

Consider the dtype of the NumPy array you are creating:

data = np.array([['','Col1','Col2'],
                 ['Row1',1,2],
                 ['Row2',3,4]])

print(data)

array([['', 'Col1', 'Col2'],
       ['Row1', '1', '2'],
       ['Row2', '3', '4']], 
      dtype='<U4')

Combining strings and integers in a nested list before feeding to np.array creates an array of strings, evidenced by '<U4', which represents the maximum number of characters.

If you use lists instead, you won't meet this problem as the implementation ensures an array is created with appropriate types:

data = [['','Col1','Col2'],
        ['Row1',1,2],
        ['Row2',3,4]]

myDF = pd.DataFrame(data=[i[1:] for i in data[1:]],
                    index=[i[0] for i in data[1:]],
                    columns=data[0][1:])
jpp
  • 159,742
  • 34
  • 281
  • 339
1
mySeries = pd.to_numeric(mySeries)

..that will fix it

cardamom
  • 6,873
  • 11
  • 48
  • 102