I need to read CSV from files (1 at a time) that can have different number of columns, where newer files have extra columns that old files don't have.
date|time|name|math
20101230|1345|mickey|0.5|
date|time|name|math|literature|physics
20101230|1345|mickey|0.5|3.5|9
date|time|name|math|literature|physics|chemistry|art
20101230|1345|mickey|0.5|3.5|9|6|7.4
I need to write code that can both old and new formats. The output dataframe will always use latest format. When the code read a file with old format, each unavailable column will be initialized with 1 default value. So in the above example, the output will always contain 8 columns, even if the file only contains 4.
The simplest solution is:
df = pandas.read_csv('input.txt',
dtype = {'date': int, 'time': int, 'name': str, 'math': float,
'literature': float, 'physics': float, 'chemistry': float, 'art': float})
n_cols = len(df.columnns)
if n_cols == 4:
df['literature'] = 0.0
df['physics'] = 0.0
df['chemistry'] = 0.0
df['art'] = 0.0
elif n_cols == 6:
df['chemistry'] = 0.0
df['art'] = 0.0
elif ...
return df
However, this solution doesn't look good, since you have to change a lot of old code everytime there's a new format.
How should I handle this problem?
Edit: the question was closed because it's "similar" to this. But it's very clearly different question. I need to load 1 file that might have missing columns (compared to latest format), not loading multiple files then concatenate them.