1

I have a very large dataframe with 1,000 columns. The first few columns occur only once, denoting a customer. The next few columns are representative of multiple encounters with the customer, with an underscore and the number encounter. Every additional encounter adds a new column, so there is NOT a fixed number of columns -- it'll grow with time.

Sample dataframe header structure excerpt:

id    dob    gender    pro_1    pro_10   pro_11   pro_2 ... pro_9    pre_1   pre_10   ...

I'm trying to re-order the columns based on the number after the column name, so all _1 should be together, all _2 should be together, etc, like so:

id    dob    gender    pro_1    pre_1    que_1    fre_1    gen_1    pro2    pre_2    que_2    fre_2    ...

(Note that the re-order should order the numbers correctly; the current order treats them like strings, which orders 1, 10, 11, etc. rather than 1, 2, 3)

Is this possible to do in pandas, or should I be looking at something else? Any help would be greatly appreciated! Thank you!

EDIT:

Alternatively, is it also possible to re-arrange column names based on the string part AND number part of the column names? So the output would then look similar to the original, except the numbers would be considered so that the order is more intuitive:

id    dob    gender    pro_1    pro_2    pro_3    ...    pre_1    pre_2    pre_3   ...

EDIT 2.0:

Just wanted to thank everyone for helping! While only one of the responses worked, I really appreciate the effort and learned a lot about other approaches / ways to think about this.

Chrestomanci
  • 201
  • 1
  • 9
  • Is the number always the last character in the columns? And does it matter which columns go first (pro vs pre)? – Edeki Okoh May 02 '19 at 17:20
  • Yes, except for the columns that do not repeat--those have no numbers at the end of them. If possible, they should remain the first columns in the dataframe and should not move. It doesn't matter which columns go first, nope, although it would be pretty cool if the order could be maintained -- but not necessary if it's not practical. The only order that is important is to maintain consistency between groups, so if group 1 has pre before pro, then group 2 should have the same, etc. – Chrestomanci May 02 '19 at 17:22
  • For the alternative where instead of grouping by numbers, the numbers embedded within the column names are ordered properly, the order IS important, so pro should go before pre, etc, if that makes sense! – Chrestomanci May 02 '19 at 17:28
  • But it is always growing how will you know what goes before what? Or is the prefix of the columns always pro and pre? – Edeki Okoh May 02 '19 at 17:28
  • There are multiple columns, not just pro and pre, I just provided those as examples because there are actually around 60 distinct columns that can repeat in total. When a new encounter is added, the order is maintained, but a new column is generated by the system for each distinct column (so there would be a new pro, pre, gen, que, etc. with an underscore +1 the previous maximum). Does that make more sense? Please let me know if I can explain this better! – Chrestomanci May 02 '19 at 17:42
  • It makes sense. I can sort on the first condition making the id dob and gender the first column but the sort is alphabetical so the ordering of pro, gen, que will be determined by which letter comes first. Is that ok? – Edeki Okoh May 02 '19 at 17:56
  • That works! Thank you! – Chrestomanci May 02 '19 at 17:57

4 Answers4

1

Luckily there is a one liner in python that can fix this:

df = df.reindex(sorted(df.columns), axis=1)

For Example lets say you had this dataframe:

import pandas as pd import numpy as np

df = pd.DataFrame({'Name': [2, 4, 8, 0],
                   'ID': [2, 0, 0, 0],
                   'Prod3': [10, 2, 1, 8],
                   'Prod1': [2, 4, 8, 0],
                   'Prod_1': [2, 4, 8, 0],
                   'Pre7': [2, 0, 0, 0],
                   'Pre2': [10, 2, 1, 8],
                   'Pre_2': [10, 2, 1, 8],
                   'Pre_9': [10, 2, 1, 8]}
                   )

print(df)

Output:

   Name  ID  Prod3  Prod1  Prod_1  Pre7  Pre2  Pre_2  Pre_9
0     2   2     10      2       2     2    10     10     10
1     4   0      2      4       4     0     2      2      2
2     8   0      1      8       8     0     1      1      1
3     0   0      8      0       0     0     8      8      8

Then used

df = df.reindex(sorted(df.columns), axis=1)

Then the dataframe will then look like:

   ID  Name  Pre2  Pre7  Pre_2  Pre_9  Prod1  Prod3  Prod_1
0   2     2    10     2     10     10      2     10       2
1   0     4     2     0      2      2      4      2       4
2   0     8     1     0      1      1      8      1       8
3   0     0     8     0      8      8      0      8       0

As you can see, the columns without underscore will come first, followed by an ordering based on the number after the underscore. However this also sorts of the column names, so the column names that come first in the alphabet will be first.

Edeki Okoh
  • 1,786
  • 15
  • 27
  • Unfortunately this is not working; it is still ordering columns based on an interpretation of the number as a string (e.g., 1, 10, 11, 12, 13, 14, 2, 3, 4, 5) -- since 1 is < 2 in a string. – Chrestomanci May 02 '19 at 18:03
1

You need to split you column on '_' then convert to int:

c = ['A_1','A_10','A_2','A_3','B_1','B_10','B_2','B_3']

df = pd.DataFrame(np.random.randint(0,100,(2,8)), columns = c)
df.reindex(sorted(df.columns, key = lambda x: int(x.split('_')[1])), axis=1)

Output:

   A_1  B_1  A_2  B_2  A_3  B_3  A_10  B_10
0   68   11   59   69   37   68    76    17
1   19   37   52   54   23   93    85     3

Next case, you need human sorting:

import re
def atoi(text):
    return int(text) if text.isdigit() else text

def natural_keys(text):
    '''
    alist.sort(key=natural_keys) sorts in human order
    http://nedbatchelder.com/blog/200712/human_sorting.html
    (See Toothy's implementation in the comments)
    '''
    return [ atoi(c) for c in re.split(r'(\d+)', text) ]



df.reindex(sorted(df.columns, key = lambda x:natural_keys(x)), axis=1)

Output:

   A_1  A_2  A_3  A_10  B_1  B_2  B_3  B_10
0   68   59   37    76   11   69   68    17
1   19   52   23    85   37   54   93     3
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • tdf.reindex(sorted(tdf.columns, key = lambda x: int(x.split('_')[1])),axis=1) gave this error: ValueError: cannot reindex from a duplicate axis – Chrestomanci May 02 '19 at 19:28
  • Do you have to columns with the same name? – Scott Boston May 02 '19 at 19:29
  • Oh yes, turns out there was a group of columns that had the same name; I renamed that group and the code worked. However, the output was weird -- it returned 3 rows: first row was all of the column names, and the other 2 rows had seemingly random numbers. – Chrestomanci May 02 '19 at 19:49
1

Here is one way you can try:

# column names copied from your example
example_cols = 'id    dob    gender    pro_1    pro_10   pro_11   pro_2  pro_9    pre_1   pre_10'.split()

# sample DF
df = pd.DataFrame([range(len(example_cols))], columns=example_cols)
df
#   id  dob  gender  pro_1  pro_10  pro_11  pro_2  pro_9  pre_1  pre_10
#0   0    1       2      3       4       5      6      7      8       9

# number of columns excluded from sorting
N = 3

# get a list of columns from the dataframe
cols = df.columns.tolist()

# split, create an tuple of (column_name, prefix, number) and sorted based on the 2nd and 3rd item of the tuple, then retrieved the first item.
# adjust "key = lambda x: x[2]" to group cols by numbers only
cols_new = cols[:N] + [ a[0] for a in sorted([ (c, p, int(n)) for c in cols[N:] for p,n in [c.split('_')]], key = lambda x: (x[1], x[2])) ]

# get the new dataframe based on the cols_new
df_new = df[cols_new]
#   id  dob  gender  pre_1  pre_10  pro_1  pro_2  pro_9  pro_10  pro_11
#0   0    1       2      8       9      3      6      7       4       5
jxc
  • 13,553
  • 4
  • 16
  • 34
0

Try this.

To re-order the columns based on the number after the column name

cols_fixed = df.columns[:3]  # change index no based on your df
cols_variable = df.columns[3:]  # change index no based on your df
cols_variable = sorted(cols_variable, key=lambda x : int(x.split('_')[1]))  # split based on the number after '_'
cols_new = cols_fixed + cols_variable 
new_df = pd.DataFrame(df[cols_new])

To re-arrange column names based on the string part AND number part of the column names

cols_fixed = df.columns[:3]  # change index no based on your df
cols_variable = df.columns[3:]  # change index no based on your df
cols_variable = sorted(cols_variable)
cols_new = cols_fixed + cols_variable 
new_df = pd.DataFrame(df[cols_new])
Supratim Haldar
  • 2,376
  • 3
  • 16
  • 26
  • cols_variable = sorted(cols_variable, key=lambda x : int(x.split('_')[1])) gave me "IndexError: list index out of range" for the first one, and for the second one gave me "ValueError: operands could not be broadcast together with shapes (0,) (1107,) " – Chrestomanci May 02 '19 at 20:02
  • When you execute `cols_variable = df.columns[3:]`, are names of all columns from index 3 to end in the format "_"? – Supratim Haldar May 03 '19 at 04:58