2

I have a pandas df with two variables:

id    name
011    Peter Parker
022    Warners Brother
101    Bruce Wayne

Currently both of them are of object type.

Say I want to create smaller dataframes by filtering with some conditions

df_small = df.loc[df['id']=='011']
df_small2 = df.loc[df['name']=='Peter Parker']

I have thought of and seen people converting the object-type column into other specific data type. My question, do I need to do that at all if I can filter them based on string comparison (as above) already? What are the benefits of converting them into a specific string or int/float type?

KubiK888
  • 4,377
  • 14
  • 61
  • 115
  • For your case not need convert , – BENY Nov 06 '18 at 16:48
  • 1
    One of the costs of converting to a numeric type is that `'011'` will be converted to `11`. It can be problematic for cases where `'0011'` is not the same as `'011'` – ALollz Nov 06 '18 at 16:49
  • It depends on what you want to do with the df afterwards. If you are going to do many difference int comparisons, for instance, it may be beneficial to do the conversion to ints only *once* instead of pandas having to do the internal casting on every function call. – Bram Vanroy Nov 06 '18 at 16:49
  • Agreed, but in what situations I would need to convert? Like more sophisticated search using regex etc? – KubiK888 Nov 06 '18 at 16:50
  • 1
    There's no such thing as a 'string-type' pandas column. That's just an `object` column. – PMende Nov 06 '18 at 16:51
  • So does it mean the "object" is essentially the same as "string" type? The following syntax kind of confuse me 'df_sm = df.loc[df['name'].str.contains('Peter')]', as it seems to suggest you need to convert that 'name' column into string before evoking the 'contains' function. – KubiK888 Nov 06 '18 at 17:02
  • 2
    Those methods are only available to strings. `x.str.contains(pat)` is basically just `pat in x` (row-wise). For instance, `'e' in 'hello'` will work, while `'e' in 4` will throw a `TypeError`, because `in` is not a valid method for numeric types. – ALollz Nov 06 '18 at 17:10

1 Answers1

3

You asked the benefits of converting from string or object dtypes. There are at least 2 I can think of right off the bat. Take the following dataframe for example:

df = pd.DataFrame({'int_col':np.random.randint(0,10,10000), 'str_col':np.random.choice(list('1234567980'), 10000)})

>>> df.head()
   int_col str_col
0        7       0
1        0       1
2        1       8
3        6       1
4        6       0

This dataframe comprises 10000 rows, and has one int column and one object (i.e. string) column for showing.

Memory advantage:

The integer column takes a lot less memory than the object column:

>>> import sys
>>> sys.getsizeof(df['int_col'])
80104
>>> sys.getsizeof(df['str_col'])
660104

Speed advantage:

Since your example is about filtering, take a look at the speed difference when filtering on integers instead of strings:

import timeit

def filter_int(df=df):
    return df.loc[df.int_col == 1]


def filter_str(df=df):
    return df.loc[df.str_col == '1']

>>> timeit.timeit(filter_int, number=100) / 100
0.0006298311000864488
>>> timeit.timeit(filter_str, number=100) / 100
0.0016585511100129225

This type of speed difference could potentially speed up your code significantly in some cases.

sacuL
  • 49,704
  • 8
  • 81
  • 106
  • 1
    I'd mention you can, at a small cost, get the best of both worlds via categorical data, e.g. differentiate between `'011'` and `'11'` *and* enabling vectorised operations. – jpp Nov 06 '18 at 17:44