3

I have a dataframe df and one of the columns count is contains strings. These strings are mostly convertable to integers (e.g. 0006) which is what I will do with them. However some of the entries in count are blank strings of spaces. How can I

  • Drop all the rows where the count value is a blank string.
  • Substitute all the blank values in that column with some numeric value of my choice.

The dataframe is very large if there are particularly efficient ways of doing this.

Simd
  • 19,447
  • 42
  • 136
  • 271
  • 1
    Possible duplicate of [Python Pandas DataFrame remove Empty Cells](http://stackoverflow.com/questions/29314033/python-pandas-dataframe-remove-empty-cells) – David Zemens Aug 31 '16 at 14:42
  • 4
    So you want to do two incompatible things, right? Once you have dropped blank values then there is no point substituting blank values... – IanS Aug 31 '16 at 15:13
  • 1
    Yes those were meant to be two different options. – Simd Aug 31 '16 at 16:29

2 Answers2

3

It seems that you want two different things. But first, convert column to numeric and coerce errors:

df['count'] = pd.to_numeric(df['count'], errors='coerce')

To drop rows (use subset to avoid dropping NaN from other columns):

df.dropna(subset=['count'])

To replace with default value:

df['count'] = df['count'].fillna(default_value)
IanS
  • 15,771
  • 9
  • 60
  • 84
2

Use dropna or fillna after pd.to_numeric(errosr='coerce')

consider a pandas series s

s = pd.Series(np.random.choice(('0001', ''), 1000000), name='intish')

drop method 1 (less robust)

s[s != ''].astype(int)

drop method 2 (more robust)

pd.to_numeric(s, 'coerce').dropna().astype(int)

drop timing

enter image description here

More robust method is faster


fill method 1

pd.to_numeric(s, 'coerce').fillna(0).astype(int)

fill method 2

s.where(s.astype(bool), 0).astype(int)

fill timing

enter image description here

Takes same amount of time as dropping

piRSquared
  • 285,575
  • 57
  • 475
  • 624