I'm trying to read an old DBase file exported to CSV and some columns are just empty. First I had a problem with converting integer columns to float but tanks to @Nathan's answer here Pandas read_csv dtype read all columns but few as string the problem was resolved. After I had the right column types - using code bellow:
def read_csv(file_name):
# todo set correct data types for the columns
inferred_types = {}
columns = pd.read_csv(file_name + '.csv', header=0, index_col=1, squeeze=True, keep_default_na=False, nrows=0).columns
for col in columns:
col_type = col.split(',')
try:
if len(col_type) < 2:
inferred_types[col] = str
elif col_type[1] == 'C':
inferred_types[col] = str
elif col_type[1] == 'N':
if 'EGN' in col: # special case
inferred_types[col] = str
else:
if col_type[3] == '0':
inferred_types[col] = np.int64
else:
inferred_types[col] = np.float64
else:
inferred_types[col] = str
except Exception as e:
print(f'{file_name} {col} -> {e}')
df = pd.read_csv(
file_name + '.csv', header=0, index_col=1, squeeze=True, keep_default_na=False, dtype=inferred_types
)
return df
I got a ValueError when pandas hit an empty cell in a column.
What I'm missing here, please?
EDIT: Here are the first few rows of one of the problematic files:
,"CODE_CURR,N,2,0","CURRENCIES,C,20","CUOZN,C,3","FOR_WHAT,N,5,0","CURS_DT,N,13,7","DATE_VAL,C,8","DATE_ACT,C,8","TIME_ACT,C,8","ID_NUMBER,C,2","SUBS_CODE,C,7","USER_CODE,C,7"
0,1,австралийски долари,AUD,,46.665,,,,,,
1,2,австрийски шилинги,ATS,,5.758,,,,,,
2,3,белгийски франкове,BEF,,1.969,,,,,,
The problem is that the column "FOR_WHAT,N,5,0" (which should be integer) is entirely empty, so the ValueError:
ValueError: invalid literal for int() with base 10: ''
EDIT2: I would highly appreciate any workaround! Data set is not so big, so performance is not an issue here.