0

I am using df= df.replace('No data', np.nan) on a csv file containing ‘No data’ instead of blank/null entries where there is no numeric data. Using the head() method I see that the replace method does replace the ‘No data’ entries with NaN. When I use df.info() it says that the datatypes for each of the series is an object.

When I open the csv file in Excel and manually edit the data using find and replace to change ‘No data’ to blank/null entries, although the dataframes look exactly the same when I use df.head(), when I use df.info() it says that the datatypes are floats.

I was wondering why this was and how can I make it so that the datatypes for my series are floats, without having to manually edit the csv files.

Joe T. Boka
  • 6,554
  • 6
  • 29
  • 48
pjdavis
  • 325
  • 4
  • 25
  • 1
    Reason being is columns were imported as object because 'No data' is a string literal. Without it, columns import as floats. Replace does not convert data types. So convert with `astype` after replace. – Parfait Dec 17 '16 at 15:16

3 Answers3

2

If the rest of the data in your columns is numeric then you should use pd.to_numeric(df, errors='coerce')

Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
0
import pandas as pd
import numpy as np

# Create data for columns with strings in it
column_data = [1,2,3] + ['no data']
# Costruct data frame with two columns
df = pd.DataFrame({'col1':column_data, 'col2':column_data[::-1]})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
col1    4 non-null object
col2    4 non-null object
dtypes: object(2)
memory usage: 144.0+ bytes

# Replace 'no data' with Nan
df_nan = df.replace('no data', np.nan)
# Set type of all columns to float
df_result = df_nan.as_type({c:float for c in df_nan.columns})
df_result.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
col1    3 non-null float64
col2    3 non-null float64
dtypes: float64(2)
memory usage: 144.0 bytes
cast42
  • 1,999
  • 1
  • 16
  • 10
0

Consider using the converters arg in pandas.read_csv() where you pass a dictionary of column numbers referencing a conversion function. Below checks the No data string and conditionally replaces value with np.nan otherwise leave as is:

import numpy as np
import pandas as pd

c_fct = lambda x : float(x if 'No data' not in x else np.nan)
convertdict = {1:c_fct, 2:c_fct, 3:c_fct, 4:c_fct, 5:c_fct}

df = pd.read_csv('Input.csv', converters=convertdict)

Input CSV

ID      Col1            Col2            Col3            Col4            Col5
TGG     0.634516647     0.900464347     0.998505978     0.170422713     0.893340128
GRI     No data         0.045915333     0.718398939     0.924813864     No data
NLB     0.921127268     0.614460813     0.677857676     0.343612947     0.559437744
SEI     0.081852313     No data         0.890816385     0.943313021     0.874857844
LOY     0.632556715     0.362855866     0.038702448     0.253762859     No data
OPL     0.375088582     0.268283238     0.761552111     0.589547625     0.192223208
CTK     0.349464541     0.844718987     No data         0.841439909     0.898093646
EUE     0.629784261     0.982589843     0.315670377     0.832419474     0.950044814
JLP     0.543942659     0.988380305     0.417191823     0.823857176     0.542514099
RHK     0.728053447     0.521816539     0.402523435     No data         0.558226706
AEM     0.005495116     0.715363776     0.075508356     0.959119268     0.844730368
VLQ     0.21146319      0.558208766     0.501769554     0.226539046     0.795861461
MDB     0.230514689     0.223163664     No data         0.324636384     0.700716246
LPH     0.853433224     0.582678173     0.633109347     0.432191426     No data
PEP     0.41096305      No data         .627776178      0.482359278     0.179863537
UQK     0.252598809     0.497517585     0.276060768     No data         0.087985623
KGJ     0.033985585     0.033702088     anNo data       0.286682709     0.543349787
JUQ     0.25971543      0.142067155     0.597985191     0.219841249     0.699822866
NYW     No data         0.17187907      0.157413049     0.209011772     0.592824483

Output

print(df)

#      ID      Col1      Col2      Col3      Col4      Col5
# 0   TGG  0.634517  0.900464  0.998506  0.170423  0.893340
# 1   GRI       NaN  0.045915  0.718399  0.924814       NaN
# 2   NLB  0.921127  0.614461  0.677858  0.343613  0.559438
# 3   SEI  0.081852       NaN  0.890816  0.943313  0.874858
# 4   LOY  0.632557  0.362856  0.038702  0.253763       NaN
# 5   OPL  0.375089  0.268283  0.761552  0.589548  0.192223
# 6   CTK  0.349465  0.844719       NaN  0.841440  0.898094
# 7   EUE  0.629784  0.982590  0.315670  0.832419  0.950045
# 8   JLP  0.543943  0.988380  0.417192  0.823857  0.542514
# 9   RHK  0.728053  0.521817  0.402523       NaN  0.558227
# 10  AEM  0.005495  0.715364  0.075508  0.959119  0.844730
# 11  VLQ  0.211463  0.558209  0.501770  0.226539  0.795861
# 12  MDB  0.230515  0.223164       NaN  0.324636  0.700716
# 13  LPH  0.853433  0.582678  0.633109  0.432191       NaN
# 14  PEP  0.410963       NaN  0.627776  0.482359  0.179864
# 15  UQK  0.252599  0.497518  0.276061       NaN  0.087986
# 16  KGJ  0.033986  0.033702       NaN  0.286683  0.543350
# 17  JUQ  0.259715  0.142067  0.597985  0.219841  0.699823
# 18  NYW       NaN  0.171879  0.157413  0.209012  0.592824

print(df.types)

# ID       object
# Col1    float64
# Col2    float64
# Col3    float64
# Col4    float64
# Col5    float64
# dtype: object
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • 1
    Good point. OP should really be checking for *No data* and can do so with `in` operator, proven to be [faster](http://stackoverflow.com/questions/4901523/whats-a-faster-operation-re-match-search-or-str-find/4901653#4901653). I edited accordingly. – Parfait Dec 19 '16 at 15:27