Background:
I have a dataframe, with a column that looks like this:
>>> merge_df['AAChange']
0 STK11:NM_000455:exon1:c.148_149TG
Name: AAChange, dtype: object
I need to split it into separate columns on the ':' character, like this:
>>> new_cols = merge_df['AAChange'].str.split(':').apply(pd.Series,1)
>>> new_cols
0 1 2 3
0 STK11 NM_000455 exon1 c.148_149TG
Then I need to rename the columns, so I store the new names in a list:
>>> new_colnames = ['Gene.AA', 'Transcript', 'Exon', 'Coding', 'Amino Acid Change']
However, there is a problem: All 5 of these columns must exist in the output, but in this data entry a field was missing in the source data, leaving only 4 fields. So, trying to rename the columns fails:
>>> new_cols.columns = new_colnames
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/local/apps/python/2.7.3/lib/python2.7/site-packages/pandas/core/generic.py", line 2371, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/src/properties.pyx", line 65, in pandas.lib.AxisProperty.__set__ (pandas/lib.c:45002)
File "/local/apps/python/2.7.3/lib/python2.7/site-packages/pandas/core/generic.py", line 425, in _set_axis
self._data.set_axis(axis, labels)
File "/local/apps/python/2.7.3/lib/python2.7/site-packages/pandas/core/internals.py", line 2572, in set_axis
'new values have %d elements' % (old_len, new_len))
ValueError: Length mismatch: Expected axis has 4 elements, new values have 5 elements
So, I want to both add an empty column for every missing column, and change the column names simultaneously. This answer seemed to have a good solution; to reindex based on the new columns list. However, it does not give the desired results:
>>> new_cols.reindex(columns = new_colnames)
Gene.AA Transcript Exon Coding Amino Acid Change
0 NaN NaN NaN NaN NaN
Now I've got all the missing columns, but the original data has been lost. Is there a better solution that will let me rename the existing columns and add all missing columns?
The desired output would look like this:
>>> new_cols.reindex(columns = new_colnames)
Gene.AA Transcript Exon Coding Amino Acid Change
0 STK11 NM_000455 exon1 c.148_149TG NaN