3

My Python program generates a data file in pandas format

        Source    LogP    MolWt  HBA  HBD
0        cne_1  1.1732  263.405    3    1
1       cne_10  2.6639  197.237    2    0
2      cne_100 -0.2886  170.193    4    2
3     cne_1000  1.9644  304.709    5    1
4     cne_1001  1.4986  162.144    3    1
...        ...     ...      ...  ...  ...
1031   cne_995  3.0179  347.219    4    2
1032   cne_996  4.8419  407.495    6    2
1033   cne_997  3.3560  354.524    3    1
1034   cne_998  7.5465  635.316    4    2
1035   cne_999  3.3514  389.556    4    1

I need to sort the strings based on the second (Source) column according to the number, so the right order of the lines after sorting should be : cne_1, cne_2, cne_3, cne_4 et I've tried to use:

df_sorted = df.sort_values('Source', ascending=True)

but it did not produse any changings in the order of the lines.

3 Answers3

4

For last pandas versions is possible use parameter key with split values by _ and converting values to integers:

df_sorted = df.sort_values('Source', key=lambda x: x.str.split('_').str[1].astype(int)) 

Or is possible get positions of sorted values by Series.argsort and pass to DataFrame.iloc:

df_sorted = df.iloc[df['Source'].str.split('_').str[1].astype(int).argsort()]
print (df_sorted)
        Source    LogP    MolWt  HBA  HBD
0        cne_1  1.1732  263.405    3    1
1       cne_10  2.6639  197.237    2    0
2      cne_100 -0.2886  170.193    4    2
1031   cne_995  3.0179  347.219    4    2
1032   cne_996  4.8419  407.495    6    2
1033   cne_997  3.3560  354.524    3    1
1034   cne_998  7.5465  635.316    4    2
1035   cne_999  3.3514  389.556    4    1
3     cne_1000  1.9644  304.709    5    1
4     cne_1001  1.4986  162.144    3    1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

Get the integer value in the column and then sort using that.

df['sortIndex'] = df.Source.str.replace('cne_', '', regex=False).astype(int)
df_sorted = df.sort_values('sortIndex', ascending=True)
Vishnudev Krishnadas
  • 10,679
  • 2
  • 23
  • 55
1

Extract the digits, convert to int and sort accordingly. (.sort_values(0) because the unnamed column is named 0 automatically)

df_sorted = df.loc[df["Source"].str.extract(r"_(\d+)").astype(int).sort_values(0).index]

Result

print(df_sorted)
        Source    LogP    MolWt  HBA  HBD
0        cne_1  1.1732  263.405    3    1
1       cne_10  2.6639  197.237    2    0
2      cne_100 -0.2886  170.193    4    2
1031   cne_995  3.0179  347.219    4    2
1032   cne_996  4.8419  407.495    6    2
1033   cne_997  3.3560  354.524    3    1
1034   cne_998  7.5465  635.316    4    2
1035   cne_999  3.3514  389.556    4    1
3     cne_1000  1.9644  304.709    5    1
4     cne_1001  1.4986  162.144    3    1
Bill Huang
  • 4,491
  • 2
  • 13
  • 31