1

I have dumped the output of a command into a text file which consists of multiple columns results in multiple rows. The first column contains the Equipment ID and second column contain the time (In UTC)

I want to sort the rows on the basis of increasing time order(Setup time). How to do that?

Here is my command out put (dumped to a text file) :

Equipment ID  | Setup Time - GPS (UTC)                      | End Time - GPS (UTC)
            3 | 2068512564500 (2019-08-30 22:22:26.500 UTC) | 2068513054300 (2019-08-30 22:30:36.300 UTC)
            2 | 2068506579500 (2019-08-30 20:42:41.500 UTC) | 2068507041300 (2019-08-30 20:50:23.300 UTC)
            2 | 2068513133500 (2019-08-30 22:31:55.500 UTC) | 2068513614300 (2019-08-30 22:39:56.300 UTC)
            3 | 2068506038500 (2019-08-30 20:33:40.500 UTC) | 2068506399300 (2019-08-30 20:39:41.300 UTC)
            1 | 2068512827500 (2019-08-30 22:26:49.500 UTC) | 2068512852300 (2019-08-30 22:27:14.300 UTC)
James Z
  • 12,209
  • 10
  • 24
  • 44
  • read text from file to memory, split text in lines (`split("\n")`), split every rows into cells (`split("|")`), sort all data using second column (`sort()`, `sorted()`), write it back to file. – furas Sep 02 '19 at 12:09
  • 1
    you could use module `pandas` and read it to DataFrame using `read_csv()` with `|` as separator. And then you can sort it in `DataFrame` and writ it back with `to_csv()` – furas Sep 02 '19 at 12:12
  • In your scenario, what about piping the output to a .csv file instead? Use the '|' char as its delimiter and read the file using pythons csv module. – Charlie Sep 02 '19 at 12:13
  • on Linux in Bash you could do `cat example.txt | sort -k 2` but it moves header to the end. You would have to use `head`, `tail` to keep header at the top: [Is there a way to ignore header lines in a UNIX sort?](https://stackoverflow.com/questions/14562423/is-there-a-way-to-ignore-header-lines-in-a-unix-sort) – furas Sep 02 '19 at 12:17

3 Answers3

0

The following will load the data from file output.txt and print out the individual lines in order of SetupTime:

import re

def collect_data(file_path):
    f = open(file_path, 'r')
    data = f.read()
    collection = []
    # RegExp with Capture Group around(0: The Whole Line, 1: Equipment ID, 2: SetupTime)
    data_rows = re.findall('((\d+) \| \d+ \((\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3}) UTC\) \| \d+ \(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3} UTC\))', data)
    for row in data_rows:
        collection.append(
            {
                'EquipmentId': row[1],
                'SetupTime': row[2],
                'Line': row[0]
            }
        )
    return collection

if __name__ == '__main__':
    collection = collect_data('output.txt')
    newlist = sorted(collection, key=lambda k: k['SetupTime'])
    for item in newlist:
        print(item['Line'])

Output:

3 | 2068506038500 (2019-08-30 20:33:40.500 UTC) | 2068506399300 (2019-08-30 20:39:41.300 UTC)
2 | 2068506579500 (2019-08-30 20:42:41.500 UTC) | 2068507041300 (2019-08-30 20:50:23.300 UTC)
3 | 2068512564500 (2019-08-30 22:22:26.500 UTC) | 2068513054300 (2019-08-30 22:30:36.300 UTC)
1 | 2068512827500 (2019-08-30 22:26:49.500 UTC) | 2068512852300 (2019-08-30 22:27:14.300 UTC)
2 | 2068513133500 (2019-08-30 22:31:55.500 UTC) | 2068513614300 (2019-08-30 22:39:56.300 UTC)
Community
  • 1
  • 1
Joshua Loader
  • 384
  • 1
  • 5
0

You can use DataFrame from module pandas and read it as csv with separator |

import pandas as pd

df = pd.read_csv("data.txt", sep='|', dtype=str)
df = df.sort_values(' Setup Time - GPS (UTC)                      ')
df.to_csv('output.csv', sep='|', index=False)

Because read_csv() tries to convert values to numbers so I use dtype=str to keep all as strings.

I keeps all spaces in values and column's names so later I can write it back formated as before. But I have to also use spaces in name ' Setup Time - GPS (UTC) ' to sort it

Because DataFrame adds index to every row so I have to skip indexes in to_csv()


EDIT: Example using io.StringIO to read data from memory instead of file so everyone can easily test it without saving data in file.

import pandas as pd
import io

data ='''Equipment ID  | Setup Time - GPS (UTC)                      | End Time - GPS (UTC)
            3 | 2068512564500 (2019-08-30 22:22:26.500 UTC) | 2068513054300 (2019-08-30 22:30:36.300 UTC)
            2 | 2068506579500 (2019-08-30 20:42:41.500 UTC) | 2068507041300 (2019-08-30 20:50:23.300 UTC)
            2 | 2068513133500 (2019-08-30 22:31:55.500 UTC) | 2068513614300 (2019-08-30 22:39:56.300 UTC)
            3 | 2068506038500 (2019-08-30 20:33:40.500 UTC) | 2068506399300 (2019-08-30 20:39:41.300 UTC)
            1 | 2068512827500 (2019-08-30 22:26:49.500 UTC) | 2068512852300 (2019-08-30 22:27:14.300 UTC)'''

#file_ = "data.txt"
file_ = io.StringIO(data)

df = pd.read_csv(file_, sep='|', dtype=str)
df = df.sort_values(' Setup Time - GPS (UTC)                      ')
#df.to_csv('output.csv', sep='|', index=False)

pd.options.display.width = 150
pd.options.display.max_columns = 5
print(df)

Result in DataFrame:

   Equipment ID     Setup Time - GPS (UTC)                                                End Time - GPS (UTC)
3              3    2068506038500 (2019-08-30 20:33:40.500 UTC)    2068506399300 (2019-08-30 20:39:41.300 UTC)
1              2    2068506579500 (2019-08-30 20:42:41.500 UTC)    2068507041300 (2019-08-30 20:50:23.300 UTC)
0              3    2068512564500 (2019-08-30 22:22:26.500 UTC)    2068513054300 (2019-08-30 22:30:36.300 UTC)
4              1    2068512827500 (2019-08-30 22:26:49.500 UTC)    2068512852300 (2019-08-30 22:27:14.300 UTC)
2              2    2068513133500 (2019-08-30 22:31:55.500 UTC)    2068513614300 (2019-08-30 22:39:56.300 UTC)
furas
  • 134,197
  • 12
  • 106
  • 148
0

If your file is not too long, I would simply use split(), sorted(), and nothing else.

txt = '''
Equipment ID  | Setup Time - GPS (UTC)                      | End Time - GPS (UTC)
            3 | 2068512564500 (2019-08-30 22:22:26.500 UTC) | 2068513054300 (2019-08-30 22:30:36.300 UTC)
            2 | 2068506579500 (2019-08-30 20:42:41.500 UTC) | 2068507041300 (2019-08-30 20:50:23.300 UTC)
            2 | 2068513133500 (2019-08-30 22:31:55.500 UTC) | 2068513614300 (2019-08-30 22:39:56.300 UTC)
            3 | 2068506038500 (2019-08-30 20:33:40.500 UTC) | 2068506399300 (2019-08-30 20:39:41.300 UTC)
            1 | 2068512827500 (2019-08-30 22:26:49.500 UTC) | 2068512852300 (2019-08-30 22:27:14.300 UTC)
'''

lines = []
for line in txt.split('\n'):
    if len(line):
        lines.append(line.split('('))

#Separate the header from the rest,
#and sort the entries based on the Setup Time
header, lines = lines[0], sorted(lines[1:], key=lambda x: x[1])

#write back the sorted text
sorted_txt = '('.join(header)
for line in lines:
    sorted_txt += '\n' + '('.join(line)

The output of print(sorted_txt):

Equipment ID  | Setup Time - GPS (UTC)                      | End Time - GPS (UTC)
            3 | 2068506038500 (2019-08-30 20:33:40.500 UTC) | 2068506399300 (2019-08-30 20:39:41.300 UTC)
            2 | 2068506579500 (2019-08-30 20:42:41.500 UTC) | 2068507041300 (2019-08-30 20:50:23.300 UTC)
            3 | 2068512564500 (2019-08-30 22:22:26.500 UTC) | 2068513054300 (2019-08-30 22:30:36.300 UTC)
            1 | 2068512827500 (2019-08-30 22:26:49.500 UTC) | 2068512852300 (2019-08-30 22:27:14.300 UTC)
            2 | 2068513133500 (2019-08-30 22:31:55.500 UTC) | 2068513614300 (2019-08-30 22:39:56.300 UTC)

You can write the output on a file, of course.

If your file is very long or if you need to repeat this computation over and over in production, then go for Pandas as suggested by @furas.

alec_djinn
  • 10,104
  • 8
  • 46
  • 71