0

I have a file with the following input:

    ID    time count
100000458   18  1
100000458   18  1
100000458   18  1
100000458   18  1
100000458   18  1
100000458   17  1
100000458   17  1
100000458   17  1
100000458   17  1
100005361   00  1
100005361   10  1
100005361   10  1
100005361   10  1

what I want to achieve is an output which prints the maximum occurring time of a particular id along with the frequency. e.g.

[100000458 18 5]
[100005361 10 3]

and so on. and if there is a tie then print both times along with the frequency.

I believe using a dictionary in python will be the best way to go but I have been unable to implement a nested dictionary. Other option is to use a list but not sure how well it will scale for large datasets. Any help will be much appreciated.

ajaanbaahu
  • 344
  • 3
  • 20
  • are you summing up the counts? – Sharif Mamun Mar 03 '14 at 00:57
  • @S.M.AlMamun yes the counts have to be summed – ajaanbaahu Mar 03 '14 at 01:03
  • Is the file sorted by ID? – mackworth Mar 03 '14 at 01:40
  • @mackworth yes it is, basically is is a sorted and shuffled output from the mapper and this result will be fed to the reducer. Hence after each unique id has been processed fully (taking all different times associated with it) I need to print the max output and then re initialize the key and value holder to take in the new id. Hope this explains a bit more. – ajaanbaahu Mar 03 '14 at 01:43

4 Answers4

3

If input is already grouped by id and time as in the example in your question then you could use itertools.groupby() to compute the statistics on the fly:

#!/usr/bin/env python
import sys
from itertools import groupby

file = sys.stdin
next(file) # skip header line

lines = (line.split() for line in file if line.strip())
for id, same_id in groupby(lines, key=lambda x: x[0]): # by id
    max_time, max_count = None, 0
    for time, same_time in groupby(same_id, key=lambda x: x[1]): # by time
        count = sum(int(c) for _, _, c in same_time)
        if count > max_count:
            max_time, max_count = time, count
    print("{} {} {}".format(id, max_time, max_count))

Output

100000458 18 5
100005361 10 3
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • @ J.F. Sebastian Thanks for the solution. How would I resolve ties e.g. if the case is [100000458 18 5] and [100000458 17 5]? I want to print both the entries. – ajaanbaahu Mar 03 '14 at 03:19
  • @user3197086: make `max_time` a list and add `if count == max_count: max_time.append(time)`. Adjust `elif count > max_count: max_time, max_count = [time], count` – jfs Mar 03 '14 at 03:40
  • @ J.F. Sebastian I added that and it works well! Just curious is there any way to make the calculations via a single for loop? – ajaanbaahu Mar 04 '14 at 05:26
  • @user3197086: yes. The algorithm as a whole is `O(n)` (even more: it is single-pass). Though [inlining `groupby()` won't be pretty](http://docs.python.org/2/library/itertools.html#itertools.groupby). – jfs Mar 04 '14 at 09:18
  • @ J.F. Sebastian If it is not too much a hassle, can you explain the code a bit or comment on the lines so that in the future I can have a better understanding of how to use the group by method. I am fairly new to python and the lamda and group by and some more lines in the code are new to me. – ajaanbaahu Mar 08 '14 at 21:29
  • 1
    @user3197086: Read [How do I use Python's itertools.groupby()?](http://stackoverflow.com/q/773/4279). Ask if you have any more question. – jfs Mar 08 '14 at 21:44
  • @ J.F.Sebastian Great! I followed the explanation. For large dataset though if there are missing values or extra values than 3 how and where should I check for that the input to group by and sum are 3? I tried putting a 'continue' statement at various steps but I run into 'need more than 2 values to unpack' error when I test on dataset with missing values – ajaanbaahu Mar 09 '14 at 02:47
  • @user3197086: `groupby()` doesn't care how many values in the line there are as long as the `key` function returns something e.g., `by_time = lambda x: x[1] if len(x) > 1 else None`. The same fix if you want to avoid the unpack error: `count = sum(int(x[2]) if len(x) > 2 else 0 for x in same_time)` – jfs Mar 09 '14 at 03:05
0

This could be a really simple solution. Let's say the input string is in the variable inpStr.

result = dict()
for line in inpStr.splitlines():
    id, time, count = line.split()
    # If it is the first time that I see id
    if id not in result:
        result[id] = dict()
    # this is the key line. I create a dictionary of dictionaries
    result[id][time] = result[id].get(time, 0) + int(count)

# Once I finished looping through the list I need to find the maximum
# occurring time of a particular id
for id in result:
    for time in result[id]:
        if result[id][time] == max(result[id].values()):
            print id, time, result[id][time]
Javier
  • 1,027
  • 14
  • 23
0

Another Counter()-based solution:

#!/usr/bin/env python
import sys
from collections import Counter, defaultdict

file = sys.stdin
next(file) # skip header line

# collect counts
counts = defaultdict(Counter) # ID -> (time, count) mapping
for id, time, count in (line.split() for line in file if line.strip()):
    counts[id] += Counter({time: int(count)})

# print most common timestamps
for id, c in counts.items():
    time, count = c.most_common(1)[0]
    print("{id} {time} {count}".format(**vars()))

Output

100005361 10 3
100000458 18 5
jfs
  • 399,953
  • 195
  • 994
  • 1,670
0

Prerequisite pandas version:

import pandas

d = pandas.read_table('test.txt', delimiter=r' *')
print d.groupby('ID').agg({'time': max, 'count': sum})

If you want the output to look exactly like you said, you need a little more work:

for (ID, i) in perid.iterrows():
    print [ID, i['time'], i['count']]
chthonicdaemon
  • 19,180
  • 2
  • 52
  • 66