1

Background:

I have a dataframe, with a column that looks like this:

>>> merge_df['AAChange']
0    STK11:NM_000455:exon1:c.148_149TG
Name: AAChange, dtype: object

I need to split it into separate columns on the ':' character, like this:

>>> new_cols = merge_df['AAChange'].str.split(':').apply(pd.Series,1)
>>> new_cols
       0          1      2            3
0  STK11  NM_000455  exon1  c.148_149TG

Then I need to rename the columns, so I store the new names in a list:

>>> new_colnames = ['Gene.AA', 'Transcript', 'Exon', 'Coding', 'Amino Acid Change']

However, there is a problem: All 5 of these columns must exist in the output, but in this data entry a field was missing in the source data, leaving only 4 fields. So, trying to rename the columns fails:

>>> new_cols.columns = new_colnames
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/local/apps/python/2.7.3/lib/python2.7/site-packages/pandas/core/generic.py", line 2371, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/src/properties.pyx", line 65, in pandas.lib.AxisProperty.__set__ (pandas/lib.c:45002)
  File "/local/apps/python/2.7.3/lib/python2.7/site-packages/pandas/core/generic.py", line 425, in _set_axis
    self._data.set_axis(axis, labels)
  File "/local/apps/python/2.7.3/lib/python2.7/site-packages/pandas/core/internals.py", line 2572, in set_axis
    'new values have %d elements' % (old_len, new_len))
ValueError: Length mismatch: Expected axis has 4 elements, new values have 5 elements

So, I want to both add an empty column for every missing column, and change the column names simultaneously. This answer seemed to have a good solution; to reindex based on the new columns list. However, it does not give the desired results:

>>> new_cols.reindex(columns = new_colnames)
   Gene.AA  Transcript  Exon  Coding  Amino Acid Change
0      NaN         NaN   NaN     NaN                NaN

Now I've got all the missing columns, but the original data has been lost. Is there a better solution that will let me rename the existing columns and add all missing columns?

The desired output would look like this:

>>> new_cols.reindex(columns = new_colnames)
   Gene.AA  Transcript   Exon         Coding  Amino Acid Change
0    STK11   NM_000455  exon1   c.148_149TG                NaN
Community
  • 1
  • 1
user5359531
  • 3,217
  • 6
  • 30
  • 55
  • The output of your split contains 4 columns while you're trying to assign 5 names. I guess you're trying to name the index? – sirfz Dec 09 '16 at 18:22
  • I am trying to rename the columns based on the list in `new_colnames`. In 99% of cases, the output of the split gives all 5 desired fields and `new_cols.columns = new_colnames` works correctly, but occasionally I hit a data entry like this which is missing a field, thus only giving 4 columns after splitting. – user5359531 Dec 09 '16 at 18:28
  • Alright so you need to re-write your question, your problem is actually handling the cases where the result yields less than the desired number of fields. – sirfz Dec 09 '16 at 18:29
  • You could just check the length of `new_cols` and simply add the missing column (assuming it's always the last one) `new_cols[4] = None` and then assign the column names. – sirfz Dec 09 '16 at 18:34
  • How would you know which columns are missing if less than 5 are returned? – Ted Petrou Dec 09 '16 at 18:36
  • I cannot assume its always `new_cols[4]` that will be missing. I had considered comparing lengths of new & old colnames and filling in empty columns accordingly, but that seemed overly-complicated for something that might be built-in by default, similar to the [reindex](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html) method. – user5359531 Dec 09 '16 at 18:37
  • @TedPetrou that is a another issue that I'll have to figure out separately :( – user5359531 Dec 09 '16 at 18:39
  • @user5359531 you could consider writing a custom function for the `apply` method which always return the correct number of rows after checking the result of the split. – sirfz Dec 09 '16 at 19:20

1 Answers1

0

You can rename the original column names with the leading desired ones.

new_cols.columns = new_colnames[:-1]

# new_cols
  Gene.AA Transcript   Exon       Coding
0   STK11  NM_000455  exon1  c.148_149TG

Then, insert the extra one by the below command. It inserts the new column as the #4 column and fills it with nan values.

new_cols.insert(4, new_colnames[-1], [np.nan]*len(new_cols.index))

# new_cols
  Gene.AA Transcript   Exon       Coding  Amino Acid Change
0   STK11  NM_000455  exon1  c.148_149TG                NaN
amin
  • 1,413
  • 14
  • 24