How to sort input without fixed-width field organization?

Question

I have a .txt file full of such lines as:

Name | Email@example.com | Score
Name2 | Email2@madeupsite.com | Score

where Score is an integer from 0 to 1 billion.

And I want to sort this file by score from big to small. My issue is that because names and emails are different lengths, the score isn't in a consistent spot every time that I can access it. How would I overcome this problem?

(I'm not too sure how to word the title so I hope this body can explain it better; please let me know if the question is not clear)

Your input data is really **PSV (Pipe-Separated Value)**. You could either read it with [`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) with `sep='|'`. Or just do a `split('|')` on each line. Then sort, by column 3. — smci, Jun 29 '19 at 01:57
When you say "txt input without consistent organization", you really only mean "without fixed-width fields". But it still has separators ('|') which you can split on, so it does have consistent organization. — smci, Jun 29 '19 at 01:59
Related: [How to read file when the words are separated by “|” (PSV)?](https://stackoverflow.com/questions/55997528/how-to-read-file-when-the-words-are-separated-by-psv) — smci, Jun 29 '19 at 03:04
Related: [How to sort pandas dataframe from one column, specified by number](https://stackoverflow.com/questions/37787698/how-to-sort-pandas-dataframe-from-one-column) — smci, Jun 29 '19 at 03:14

sin tribu · Answer 1 · 2019-06-29T01:44:39.677


#a list to store your data, open the file to retrieve the data
data = []
with open( 'fname.txt' ) as f:
    for line in f:
        # line.split( '|' ) splits the string into a list separated by '|' )
        data.append( line.strip().split('|') )

# convert the scores into an integer
for d in data:
    d[2] = int( d[2] )

# sort the data using 2nd element of row from big to small
sorted_data = sorted( data, key=lambda x: return x[2], reverse=True )

Stephen B · Answer 2 · 2019-06-29T01:51:34.440

First, we can read the lines of the file. Next, we use list comprehension to split each line on the separator "|", take the last index, and convert to integer for sorting. We sort in reverse order and set the key so the output will be line indices, and then set lines_sorted equal to the order of sorted lines.

with open("file.txt", "r") as f:
    lines = f.readlines()
    scores = [int(l.split("|")[-1]) for l in lines]
    sorted_idx = sorted(range(len(scores)), key=lambda k: scores[k], reverse=True)
    lines_sorted = [lines[i] for i in sorted_idx]

See this question for more suggestions on sorting and returning an index.

Example With "file.txt" containing the following:

Name | Email@example.com | 1000
Name2 | Email2@madeupsite.com | 10
Name3 | Email3@madeupsite.com | 100

lines_sorted will contain:

["Name | Email@example.com | 1000",
 "Name3 | Email3@madeupsite.com | 100", 
 "Name2 | Email2@madeupsite.com | 10"]

score 0 · Answer 3 · answered Jun 29 '19 at 01:47

Once you have your lines in an list, you can use sort or sorted to sort it. The trick will be passing a key that pulls out that integer. One option is to take a slice from the last | to the end of the line and make an integer from that string. rfind() is helpful for that:

lines = ['Name | Email@example.com | 1001',
         'Name2 | Email2@madeupsite.com | 2',
         'Name2 | Email2@madeupsite.com | 200'
]

s = sorted(lines, key = lambda s: int(s[s.rfind('|')+1:]))
list(s)

result:

['Name2 | Email2@madeupsite.com | 2',
 'Name2 | Email2@madeupsite.com | 200',
 'Name | Email@example.com | 1001']

score 0 · Answer 4 · answered Jun 29 '19 at 02:05

Use the custom sort key function on rpartition of each string

Input:

lines = ['Name | Email@example.com | 50',
         'Name2 | Email2@madeupsite.com | 400',
         'Name3 | Email2@madeupsite.com | 15']

Output:

sorted(lines, key=lambda x: int(x.rpartition('|')[-1]))

Out[1128]:
['Name3 | Email2@madeupsite.com | 15',
 'Name | Email@example.com | 50',
 'Name2 | Email2@madeupsite.com | 400']

score 0 · Answer 5 · answered Jun 29 '19 at 02:21

Your input data is PSV (Pipe-Separated Value). You can read it with pandas.read_csv with sep='|':

dat = """
Name1 | Email@example.com | 456
Name2 | Email2@madeupsite.com | 123 
Name44 | jimmy@yahoo.co.ar | 79
"""

import pandas as pd
df = pd.read_csv(pd.compat.StringIO(dat), sep='|', header=None)

df.sort_values(2, ascending=True)

         0                        1    2
2  Name44        jimmy@yahoo.co.ar    79
1   Name2    Email2@madeupsite.com   123
0   Name1        Email@example.com   456

How to sort input without fixed-width field organization?

5 Answers5