2

I have 2 csv files with same columns but different data. After importing them to pandas, the dtype of column costin one of them is float, but object in another.

I found a similar question, in this case the answer is "this was a bug in <=0.12 (but is fixed in 0.13)" according to Andy Hayden.

But the questions is: both of my csv files have a similar min number, 1.000000e-02 neither blank value.

(I'm using Python 3.7, Pandas 0.23.4 on PyCharm2018.2)

# csv 1: before pd.to_numeric
count     174526
unique     84873
top         0.41
freq         505
Name: cost, dtype: object

# csv 1: after pd.to_numeric
count    1.745260e+05
mean     3.608746e+04
std      4.690326e+05
min      1.000000e-02
25%      1.040000e+01
50%      1.190400e+02
75%      1.433350e+03
max      5.400000e+07
Name: cost, dtype: float64

# csv 2: 
count    2.578860e+05
mean     1.588632e+04
std      3.295925e+05
min      1.000000e-02
25%      2.820000e+00
50%      2.109000e+01
75%      2.426200e+02
max      6.030000e+07
Name: cost, dtype: float64

In another point of view, if I break my code into 2 parts, everything is fine for csv2:

df = pd.read_csv('file_name.csv',low_memory=False)
df = df[df.Cloumn1 != 'Value1']
df['cost_T'] = df['cost'] / 1000
df.to_csv('new_file_name.csv', index=False)
"""
TypeError: unsupported operand type(s) for /: 'str' and 'int'
"""
df = pd.read_csv('file_name.csv',low_memory=False)
df = df[df.Cloumn1 != 'Value1']
df.to_csv('new_file_name.csv', index=False)

df = pd.read_csv('new_file_name.csv', low_memory=False)
df['cost_T'] = df['cost'] / 1000
df.to_csv('final_file_name.csv', index=False)
"""
everything is fine.
"""

If someone has any idea, please let me know.

Sean.H
  • 640
  • 1
  • 6
  • 18
  • I'm a bit confused. You say you have two CSV files, but I don't see the contents of the CSV files - just the resulting series after you imported them. So it's a bit hard to figure out what's happening. Is there a blank field, a nan, an #undef, or some other character string in one of the rows? This might help you find the problem row https://stackoverflow.com/questions/21771133/finding-non-numeric-rows-in-dataframe-in-pandas. – Tom Johnson Feb 13 '19 at 19:43
  • thx Tom. There is no nan, blank, or other character string. Sorry I can not upload the csv file because of the data sensitive. But I upload more info. hope they are helpful. Thanks again. – Sean.H Feb 14 '19 at 07:04
  • In your first code snippet where you get the unsupported operand error, the error implies that the 'cost' column has a string in one of the rows. In the second snippet you are writing out the intermediate dataframe which, when read back in, is resulting in the 'cost' column now longer containing a string. So I'm back to trying to figure out where the 'bad' data is. Worse case you can cut the file in half and try each half to see if it works or not. Continue binary searching like this until you isolate the rows that are causing the problem. – Tom Johnson Feb 14 '19 at 14:53
  • Tom :) , thx for your precious idea. – Sean.H Feb 15 '19 at 09:49

0 Answers0