0

I need to read through lines in multiple files; the first value in each line is the runtime, the third is the job id, and the fourth is the status. I have created lists to store each of these values. Now I'm not understanding how to connect all of these lists and sort them based on the lines with the top 20 fastest runtimes. Does anybody have a suggestion for how I can do that? Thank you!

for filePath in glob.glob(os.path.join(path1, '*.gz')):
    with gzip.open(filePath, 'rt', newline="") as file:
        reader = csv.reader(file)
        for line in file:
            for row in reader:
                runTime = row[0]
                ID = row[2]
                eventType = row[3]
                jobList.append(ID)
                timeList.append(runTime)
                eventList.append(eventType)

    jobList = sorted(set(jobList))
    counter = len(jobList)
    print ("There are %s unique jobs." % (counter))
    i = 1
    while i < 21:
        print("#%s\t%s\t%s\t%s" % (i, timeList[i], jobList[i], eventList[i]))
        i = i + 1
mkrieger1
  • 19,194
  • 5
  • 54
  • 65
Liz
  • 33
  • 2
  • 6
  • 1
    Just a style note - it's more pythonic to use names like `run_time` and `event_type` instead of `runTime` and `eventType`. – dmlicht Mar 03 '17 at 16:09

3 Answers3

1

Instead of using three different lists, you can use a single list and append tuples to the list..Like so

combinedList.append((runTime, ID, eventType))

You can then sort the combinedList of tuples as shown here: How to sort (list/tuple) of lists/tuples?

You can make more improvements, such as use namedtuples in python etc. Look them up on SO or google

Note: there may be other "efficient" ways to do this. For example use python heapq library and create a heap of size 20 to sort by top 20 run times. You can learn more about them on python's website or Stack overflow but you may need some more algorithmic background

Community
  • 1
  • 1
labheshr
  • 2,858
  • 5
  • 23
  • 34
  • Okay, I understand that, but if I were to create a dictionary for this, how would I sort them so that I could print only the top 20 longest run time jobs? I guess what I'm asking is how to sort the dictionary by that one value – Liz Mar 03 '17 at 17:44
  • Basically, you will store key as ID and value as (runTime, eventType)..then sort by longest runtime in a way as shown here: http://stackoverflow.com/questions/7349646/sorting-a-dictionary-of-tuples-in-python – labheshr Mar 03 '17 at 19:19
0

Instead of maintaining three lists jobList, timeList, eventList, you can store (runTime, eventType) tuples in a dictionary, using ID as key, by replacing

jobList = []
timeList = []
eventList = []
…
jobList.append(ID)
timeList.append(runTime)
eventList.append(eventType)

by

jobs = {}  # an empty dictionary
…
jobs[ID] = (runTime, eventType)

To loop over that dictionary sorted by increasing runTime values:

for ID, (runTime, eventType) in sorted(jobs.items(), key=lambda item: item[1][0]):
    # do something with it
mkrieger1
  • 19,194
  • 5
  • 54
  • 65
0

Using the python sorted built in would work better for you if you kept runTime, ID, and eventType together in a data structure. I would recommend using a namedtuple, as it allows you to be clear about what you're doing. You can do the following:

from collections import namedtuple
Job = namedtuple("Job", "runtime id event_type")

Then you're code could change to be:

for filePath in glob.glob(os.path.join(path1, '*.gz')):
    with gzip.open(filePath, 'rt', newline="") as file:
        reader = csv.reader(file)
        for line in file:
            for row in reader:
                runTime = row[0]
                ID = row[2]
                eventType = row[3]
                job = Job(runTime, ID, eventType)
                jobs.append(job)

    jobs = sorted(jobs)
    n_jobs = len(jobs)
    print("There are %s unique jobs." % (n_jobs))
    for job in jobs[:20]:
        print("#%s\t%s\t%s\t%s" % (i, job.runtime, job.id, job.event_type))

It's worth noting, this sorting will work properly because by default, tuples are sorted by their first element. If there is a tie, your sort algorithm will move the comparison to the next elements of the tuple.

dmlicht
  • 2,328
  • 2
  • 15
  • 16