0

I'm trying to read an old DBase file exported to CSV and some columns are just empty. First I had a problem with converting integer columns to float but tanks to @Nathan's answer here Pandas read_csv dtype read all columns but few as string the problem was resolved. After I had the right column types - using code bellow:

def read_csv(file_name):
    # todo set correct data types for the columns
    inferred_types = {}
    columns = pd.read_csv(file_name + '.csv', header=0, index_col=1, squeeze=True, keep_default_na=False, nrows=0).columns
    for col in columns:
        col_type = col.split(',')
        try:
            if len(col_type) < 2:
                inferred_types[col] = str
            elif col_type[1] == 'C':
                inferred_types[col] = str
            elif col_type[1] == 'N': 
                if 'EGN' in col:  # special case
                    inferred_types[col] = str
                else:
                    if col_type[3] == '0':
                        inferred_types[col] = np.int64
                    else:
                        inferred_types[col] = np.float64
            else:
                inferred_types[col] = str
        except Exception as e:
            print(f'{file_name} {col} -> {e}')

    df = pd.read_csv(
        file_name + '.csv', header=0, index_col=1, squeeze=True, keep_default_na=False, dtype=inferred_types
    )
    return df

I got a ValueError when pandas hit an empty cell in a column.

What I'm missing here, please?

EDIT: Here are the first few rows of one of the problematic files:

,"CODE_CURR,N,2,0","CURRENCIES,C,20","CUOZN,C,3","FOR_WHAT,N,5,0","CURS_DT,N,13,7","DATE_VAL,C,8","DATE_ACT,C,8","TIME_ACT,C,8","ID_NUMBER,C,2","SUBS_CODE,C,7","USER_CODE,C,7"
0,1,австралийски долари,AUD,,46.665,,,,,,
1,2,австрийски шилинги,ATS,,5.758,,,,,,
2,3,белгийски франкове,BEF,,1.969,,,,,,

The problem is that the column "FOR_WHAT,N,5,0" (which should be integer) is entirely empty, so the ValueError: ValueError: invalid literal for int() with base 10: ''

EDIT2: I would highly appreciate any workaround! Data set is not so big, so performance is not an issue here.

AssenD
  • 50
  • 1
  • 11
  • 1
    I don't think that solution is v.good, I think it would be better if you provide a small sample of your dataset and when you're trying to achieve. – Umar.H Mar 17 '21 at 14:34
  • @Manakin Edited the post with some sample data from the CSV – AssenD Mar 17 '21 at 14:40
  • It appears that this is just a bug in pandas as can be seen here: https://github.com/pandas-dev/pandas/issues/2631 in the comment from StefRe and possibly here: https://github.com/pandas-dev/pandas/issues/32453 – AssenD Mar 18 '21 at 07:50
  • What value would you like for empty cells? `None` requires the columns to have a `object` dtype, and NaN (np.nan) requires the column to have a floating point dtype. Pandas does offer a nullable integer type, but it is still declared *experimental*, meaning *use at own risk*. – Serge Ballesta Mar 18 '21 at 08:02
  • @Serge Ballesta: I'm considering processing the CSV as two step process: 1. just let pandas read them without hints (already did that but with some strange for my opinion results), but pandas read them without errors anyway 2. half-manually change columns to desired columns types. I'll post a solution when I find it... :) – AssenD Mar 18 '21 at 08:28
  • @Serge Ballista: I would like to store the data set in some database server - DB2, PostgreSQL or similar, with minimal changes to the data. I expected that null values for integers are accepted by pandas as in all contemporary data bases ... – AssenD Mar 18 '21 at 08:50

1 Answers1

1

You have 2 ways to use NULL values (in database sense) in a Pandas column containing integer value.

  1. the still official way: convert the column to float64 and use NaN for NULL values.

    The nice thing is that np.nan support is good in most database adapters, so all NaN values should be automatically converted to NULL database values if you insert (or update) them in a database. The downside is that float64 cannot hold exactly integer values higher than 2**48 (IEEE 754 mantissa is only 48 bits).

  2. the experimental way: use the new pd.Int64Dtype

    This new type can hold any 64 bit integer value and a special pd.NA value. So it provides exactly what you want. The downside here is that the documentation explicitely says:

    IntegerArray is currently experimental. Its API or implementation may change without warning.

    Long story short, it may work or not for your use case (support in the database adapter) and you could have to adapt your code if something change in a later version.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • Thank you Serge! I'll take the second suggestion as this will be one time action and the code is (will be) really very simple. – AssenD Mar 18 '21 at 09:26
  • I changed `inferred_types[col] = np.int64` to `inferred_types[col] = pd.Int64Dtype()` and as a proof of concept exported the dataframe to SQLite and it worked as a magic :) @Serge thanks again! – AssenD Mar 18 '21 at 09:44