0

I have a dataset with a very larger number of columns formatted like a.b.c.d.e. My goal is to do two things. I would like to change the column name to 'c' I would also like to generate a dictionary for later use where 'c' maps to 'b.c' so I can change the names back at a later point. The full function I am using is below.

def trim_col_names(df):

    cols = []
    string_matches = {}
    for col in df.columns[3:]:
        tokens = col.split('.')
        trimmed = tokens[2]
        cols.append(trimmed)
        colname = '.'.join(tokens[1:3])
        string_matches[trimmed] = colname
    df.columns = list(df.columns)[:3] + cols

    df_p = trim_col_names(df_p)

Tokens prints as expected ['a', 'b', 'c', 'd', 'e'] however I am getting the following error. trimmed = tokens[2] IndexError: list index out of range

Interestingly when I switched the order or the lines trimmed = tokens[2] and colname = '.'.join(tokens[1:3]) so colname was executed first, the error still appeared on trimmed which makes me think the problem is isolated to this one line. I also use very similar lines in other functions within this code with no issue. What am I missing?

Here is a sample dataset. It is thousands of columns so I have only given a very small subset of data. If this is not sufficient I can provide a larger dataset.

X  Y  Z  tpm.293SLAM_rinderpest_infection_00hr.CH123.bhg.gh   tpm.293SLAM_rinderpest_infection_01hr.CH124.byl.gw  tpm.293SLAM_rinderpest_infection_02hr.CH125.lmg.ge

x  y  z                          2                                                2                                               4

x1 y1 z1                         3                                                8                                               2

x2 y2 z2                         4                                                5                                               7

I am trying to keep CH123 as the column name and 293SLAM_rinderpest_infection_00hr.CH123 as the value it maps to in the dictioary.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
keenan
  • 462
  • 3
  • 12
  • Please [create a reproducible copy of the DataFrame with `df.head(10).to_clipboard(sep=',')`](https://stackoverflow.com/questions/52413246/how-to-provide-a-copy-of-your-dataframe-with-to-clipboard), [edit] the question, and paste the clipboard into a code block. The question should [Provide a Minimal, Reproducible Example (e.g. code, data, errors) as text](https://stackoverflow.com/help/minimal-reproducible-example) – Trenton McKinney Jul 03 '20 at 00:39
  • The dataset is thousands of columns so I have recreated a smaller version and included it in the question. – keenan Jul 03 '20 at 00:54
  • Voted to close because I suspect your error is with bad data somewhere inside your dataset. Perhaps putting `assert len(tokens) >= 2, print(tokens)` will help catch the issue for you. – David Jul 03 '20 at 05:56
  • @David If the error is just with bad data why does `colname = '.'.join(tokens[1:3])` not raise an error? This is joining together 1&2 where as trimmed is just selecting 2? – keenan Jul 03 '20 at 17:09
  • @keenan That's a range query so given `x = []` and `x[1:3]` that produces `[]` – David Jul 04 '20 at 19:13

0 Answers0