0

My DataFrame has about 20 columns, with mixed column types; one of them is a 15 to 18 digits ID number. Some rows don't have an ID number (there are NaNs in the column). When reading the .csv, the ID number is written using scientific notation, losing the benefit of an ID number...

I am trying to find a way to save the DataFrame as a csv (using .to_csv), while keeping this ID number in full int form.

The closest thing I found was Format / Suppress Scientific Notation from Python Pandas Aggregation Results, but it changes all the columns, where I would like to change only the one.

Thanks for your help!

Community
  • 1
  • 1
Flo
  • 31
  • 8
  • 1
    Sorry `NaN` cannot be represented by `int`, so you need to decide what you want to do with these, either drop them or convert the column to `str` – EdChum Jan 23 '17 at 11:22
  • i'm afraid the only way to achieve that is to use a placeholder for the `NaN`'s like a special negative number: `-99999` – MaxU - stand with Ukraine Jan 23 '17 at 11:22
  • I think the best is convert `ID` column to string in `read_csv` like `read_csv(filename, dtype={'ID': str})` – jezrael Jan 23 '17 at 11:24
  • I am happy that the problem I am encountering is not trivial. But I am sad the solution seems to be to use placeholders... I will use your suggestion MaxU, it seems to be the easiest and most accurate to implement. Thank you. I will let this question sit for a little while, in case some genius has a miracle solution, and will mark this question as solved in a couple days. – Flo Jan 23 '17 at 12:27

3 Answers3

0

You can use float_format when calling the to_csv()

df.to_csv(filepath, index=False, sep='\t', float_format='%.6f')

Full answer here: convert scientific notation to decimal pandas python

In your case with ID's you can try changing the 6 to a 0

Community
  • 1
  • 1
Christian Safka
  • 317
  • 1
  • 9
  • I realize this wouldn't work for a single column. Perhaps you can apply a function to that column that will try to return the value as int. – Christian Safka Jan 23 '17 at 11:35
  • Yeah I tried this, but then the problem with NaNs comes in... =/ – Flo Jan 23 '17 at 12:19
  • You don't want to replace the NaNs with -999 or some other number? – Christian Safka Jan 23 '17 at 13:03
  • I wanted to see if someone knew of an elegent solution avoiding changing any of the data. I ended up doing just that, though, as suggested by MaxU. – Flo Jan 23 '17 at 13:39
0

As MaxU said in the comments, the best way is likely to use a placeholder for the NaNs.

I used .fillna(-9999) on my column to remove the NaNs,then it's easy to express the ID as int (using .astype(int) or dtype).

Problem solved.

Rahul Agarwal
  • 4,034
  • 7
  • 27
  • 51
Flo
  • 31
  • 8
0

As of pandas 0.24 (January 2019), you can represent your data as an arrays.IntegerArray, corresponding to nullable integers, allowing you to achieve what you want while sticking to idiomatic pandas.

For example, suppose the following is what you would get with floats:

In [99]: df.Id
Out[99]:
0    1.000000e+18
1    2.000000e+18
2    3.000000e+18
3             NaN
4    4.000000e+18
Name: Id, dtype: float64

In [100]: df.Id.to_csv('output.csv')

In [101]: !cat output.csv
0,1e+18
1,2e+18
2,3e+18
3,
4,4e+18

Then, using the dtype 'Int64', you get the following:

In [102]: df.Id.astype('Int64')
Out[102]:
0    1000000000000000000
1    2000000000000000000
2    3000000000000000000
3                    NaN
4    4000000000000000000
Name: Id, dtype: Int64

In [103]: df.Id.astype('Int64').to_csv('output.csv')

In [104]: !cat output.csv
0,1000000000000000000
1,2000000000000000000
2,3000000000000000000
3,
4,4000000000000000000
fuglede
  • 17,388
  • 2
  • 54
  • 99