I have a TSV file which I want to read, sort by a specific column and write it back.
Two problems I ran into are:
- providing a custom key results an error (which I will show the backtrace at the end of the post)
- without providing a custom key, the sorting is done. but when I write the dataframe back to a TSV file, there is another column which anotates the row numbering which I want to remove.
I have tried to use pandas
module in the following manner:
import re
import pandas as pd
def natural_sort_key(s, _nsre=re.compile('([0-9]+)')):
return [int(text) if text.isdigit() else text.lower() for text in _nsre.split(s)]
def main(path):
with open(path, 'r') as f:
df = pd.read_csv(path, delimiter='\t')
a = df.sort_values('#mm10.kgXref.geneSymbol', key=natural_sort_key, na_position='first')
a.to_csv('mouse_conversion_by_gene_symbol', sep='\t')
if __name__ == '__main__':
main('mouse_conversion')
The backtrace I am getting after providing a custom key is:
Traceback (most recent call last):
File "sortTables.py", line 22, in <module>
main('mouse_conversion')
File "sortTables.py", line 12, in main
a = df.sort_values('#mm10.kgXref.geneSymbol', key=natural_sort_key, na_position='first')
File "/home/eliran/miniconda/envs/newenv/lib/python3.7/site-packages/pandas/core/frame.py", line 5297, in sort_values
k, kind=kind, ascending=ascending, na_position=na_position, key=key
File "/home/eliran/miniconda/envs/newenv/lib/python3.7/site-packages/pandas/core/sorting.py", line 287, in nargsort
items = ensure_key_mapped(items, key)
File "/home/eliran/miniconda/envs/newenv/lib/python3.7/site-packages/pandas/core/sorting.py", line 420, in ensure_key_mapped
result = key(values.copy())
File "sortTables.py", line 6, in natural_sort_key
return [int(text) if text.isdigit() else text.lower() for text in _nsre.split(s)]
TypeError: expected string or bytes-like object
As for the second problem, here's an example:
for this input:
#mm10.kgXref.geneSymbol mm10.kgXref.refseq mm10.knownToEnsembl.name
Rp1 NM_011283 ENSMUST00000027032.5
Gm37483 ENSMUST00000194382.1
Sox17 NM_011441 ENSMUST00000027035.9
I am getting this output:
#mm10.kgXref.geneSymbol mm10.kgXref.refseq mm10.knownToEnsembl.name
19 Rp1 NM_011283 ENSMUST00000027032.5
21 Gm37483 ENSMUST00000194382.1
29 Sox17 NM_011441 ENSMUST00000027035.9
And I would like to delete the column with the row enumeration.
Would appreciate some insight on both problems.
EDIT: found an answer in the docs about the enumeration problem. if that's relevant for somebody else, simply use this:
a.to_csv('mouse_conversion_by_gene_symbol', sep='\t', index=False)
instead of the original line
EDIT 2 : after implementing the solution suggested I was able to sort the dataframe by the first and last column.
When I try to sort the dataframe by the second column I get the exact same backtrace from above.
The only logical difference I see is that the second column includes NaN
values and the other columns don't.
How can I modify the code to solve this problem?