1

I have a TSV file which I want to read, sort by a specific column and write it back.
Two problems I ran into are:

  • providing a custom key results an error (which I will show the backtrace at the end of the post)
  • without providing a custom key, the sorting is done. but when I write the dataframe back to a TSV file, there is another column which anotates the row numbering which I want to remove.

I have tried to use pandas module in the following manner:

import re
import pandas as pd

def natural_sort_key(s, _nsre=re.compile('([0-9]+)')):
    return [int(text) if text.isdigit() else text.lower() for text in _nsre.split(s)]

def main(path):
    with open(path, 'r') as f:
        df = pd.read_csv(path, delimiter='\t')

        a = df.sort_values('#mm10.kgXref.geneSymbol', key=natural_sort_key, na_position='first')
        a.to_csv('mouse_conversion_by_gene_symbol', sep='\t')

if __name__ == '__main__':
    main('mouse_conversion')

The backtrace I am getting after providing a custom key is:

Traceback (most recent call last):
  File "sortTables.py", line 22, in <module>
    main('mouse_conversion')
  File "sortTables.py", line 12, in main
    a = df.sort_values('#mm10.kgXref.geneSymbol', key=natural_sort_key, na_position='first')
  File "/home/eliran/miniconda/envs/newenv/lib/python3.7/site-packages/pandas/core/frame.py", line 5297, in sort_values
    k, kind=kind, ascending=ascending, na_position=na_position, key=key
  File "/home/eliran/miniconda/envs/newenv/lib/python3.7/site-packages/pandas/core/sorting.py", line 287, in nargsort
    items = ensure_key_mapped(items, key)
  File "/home/eliran/miniconda/envs/newenv/lib/python3.7/site-packages/pandas/core/sorting.py", line 420, in ensure_key_mapped
    result = key(values.copy())
  File "sortTables.py", line 6, in natural_sort_key
    return [int(text) if text.isdigit() else text.lower() for text in _nsre.split(s)]
TypeError: expected string or bytes-like object

As for the second problem, here's an example:
for this input:

#mm10.kgXref.geneSymbol mm10.kgXref.refseq  mm10.knownToEnsembl.name
Rp1 NM_011283   ENSMUST00000027032.5
Gm37483     ENSMUST00000194382.1
Sox17   NM_011441   ENSMUST00000027035.9

I am getting this output:

#mm10.kgXref.geneSymbol mm10.kgXref.refseq  mm10.knownToEnsembl.name
19    Rp1   NM_011283   ENSMUST00000027032.5
21    Gm37483       ENSMUST00000194382.1
29    Sox17    NM_011441    ENSMUST00000027035.9

And I would like to delete the column with the row enumeration.
Would appreciate some insight on both problems.

EDIT: found an answer in the docs about the enumeration problem. if that's relevant for somebody else, simply use this:

a.to_csv('mouse_conversion_by_gene_symbol', sep='\t', index=False)

instead of the original line

EDIT 2 : after implementing the solution suggested I was able to sort the dataframe by the first and last column.
When I try to sort the dataframe by the second column I get the exact same backtrace from above.
The only logical difference I see is that the second column includes NaN values and the other columns don't.
How can I modify the code to solve this problem?

Eliran Turgeman
  • 1,526
  • 2
  • 16
  • 34
  • do you want to sort by the digit in the string? your current returns a list – Ezer K Aug 10 '20 at 15:17
  • Not just the digit, it's a natural sort. You can look at what problem I will run into if I don't use this kind of sorting here https://stackoverflow.com/questions/4836710/is-there-a-built-in-function-for-string-natural-sort – Eliran Turgeman Aug 10 '20 at 15:26

2 Answers2

1

according to docs key func should get and give a Series (BTW, pd.read_csv does not need with open), so try this:

import re
import pandas as pd

def natural_sort_key(S, _nsre=re.compile('([0-9]+)')):
    return pd.Series([[int(text) if text.isdigit() else text.lower() for text in _nsre.split(s)] for s in S.values])

def main(path):
    df = pd.read_csv(path, delimiter='\t')

    a = df.sort_values('#mm10.kgXref.geneSymbol', key=natural_sort_key, na_position='first')
    a.to_csv('mouse_conversion_by_gene_symbol', sep='\t')

if __name__ == '__main__':
    main('mouse_conversion')
Ezer K
  • 3,637
  • 3
  • 18
  • 34
  • I actually tried to create 3 different files where every one of them is sorted by a different column. for the first and last column this code worked, but for the middle one it didn't (same traceback as in the original code), perhaps you have an idea why? – Eliran Turgeman Aug 10 '20 at 15:56
  • no idea, try to find a sample of the data that reproduces that error and post it – Ezer K Aug 10 '20 at 15:58
  • As opposed to the other columns, there are some rows which have a blank second column so I think that could be the problem – Eliran Turgeman Aug 10 '20 at 16:02
0
with open("test.tsv","r") as f:
    table=[[b,a,c] for [a,b,c] in [r.split('\t') for r in f.read().split('\n')] ]
table.sort()
with open("output.tsv","w") as f:
    f.write('\n'.join(['\t'.join([a,b,c]) for [b,a,c] in table]))

Without pandas, You can use above python code to sort the data in test.tsv and write to output.tst
This program sorts based on 2nd column. to sort based on other columns, change lines 2 and 5 accordingly. like if you want to sort based on 3rd column change [b,a,c] into [c,a,b] in both the lines to make the 'column to be sorted' as the first in the list.

Liju
  • 2,273
  • 3
  • 6
  • 21