0

For example, I have the following data as a list:

l = [['A', 'aa', '1', '300'],
     ['A', 'ab', '2', '30'],
     ['A', 'ac', '3', '60'],
     ['B', 'ba', '5', '50'],
     ['B', 'bb', '4', '10'],
     ['C', 'ca', '6', '50']]

Now for 'A', 'B', and 'C', I wanted to get their last occurrences, i.e.:

[['A', 'ab', '3', '30'],
 ['B', 'bb', '4', '10'],
 ['C', 'ca', '6', '50']]

or further, the third column in these occurrences, i.e.:

['3', '4', '6']

Currently, the way I deal with this is:

import pandas as pd
df = pd.DataFrame(l, columns=['u', 'w', 'y', 'z'])
df.set_index('u', inplace=True)
ll = []
for letter in df.index.unique():
    ll.append((df.ix[letter, 'y'][-1]))

Then I %timeit, it shows:

>> The slowest run took 27.86 times longer than the fastest. 
>> This could mean that an intermediate result is being cached.
>> 1000000 loops, best of 3: 887 ns per loop

Just wondering if there is a way to do this using less time than my code? Thanks!

Tonechas
  • 13,398
  • 16
  • 46
  • 80
Map
  • 399
  • 4
  • 9
  • What's the inefficient way you currently have? – jonrsharpe Jun 22 '16 at 15:30
  • 3
    Why is the last occurrence for `A` the second, not the third array? – Nils Gudat Jun 22 '16 at 15:32
  • Use reversed on your list and then - Possible duplicate of [What is the best way to get the first item from an iterable matching a condition?](http://stackoverflow.com/questions/2361426/what-is-the-best-way-to-get-the-first-item-from-an-iterable-matching-a-condition) – James Elderfield Jun 22 '16 at 15:42
  • @jonrsharpe I first converted this list to a pandas data frame setting the first column as the index and then iterated over unique index values to extract the last occurrence for each, which I don't think is efficient so I am looking for a better to do this. – Map Jun 22 '16 at 16:04
  • *"Better"* is hard to judge without: 1. what we're trying to get better than; and 2. how exactly you measure better. – jonrsharpe Jun 22 '16 at 16:04
  • @NilsGudat Yes, it should be the third array. Just fixed it. Thanks! – Map Jun 22 '16 at 16:05
  • @jonrsharpe By "better", I mean I trying to find a way to solve the problem with less time than the way I am currently using. – Map Jun 22 '16 at 16:07
  • Please [edit] to include **both** of those pieces of information (don't just ignore #1, *where is your code?*) What performance analysis have you done? – jonrsharpe Jun 22 '16 at 16:08
  • @jonrsharpe Thanks so much for this advice. I just added them. It's my first time asking a question here and, indeed I learned a great deal from you and this community;) – Map Jun 22 '16 at 17:12
  • @JamesElderfield I think I am probably asking a different question. Yes, the method described in the other post works, but I wanted to find a faster way here. – Map Jun 23 '16 at 03:28
  • Is there some particular problem with <1us timings? What are you aiming for? Have you profiled the code, are there any bottlenecks? – jonrsharpe Jun 23 '16 at 08:59

5 Answers5

2
l =  [['A', 'aa', '1', '300'],
  ['A', 'ab', '2', '30'],
  ['A', 'ac', '3', '60'],
  ['B', 'ba', '5', '50'],
  ['B', 'bb', '4', '10'],
  ['C', 'ca', '6', '50']]

import itertools
for key, group in itertools.groupby(l, lambda x: x[0]):
    print key, list(group)[-1]

With no comment on "efficiency" because you haven't explained your conditions at all. Assuming the list is sorted by first element of sublist in advance.

If the list is sorted, one run through should be enough:

def tidy(l):
    tmp = []
    prev_row = l[0]

    for row in l:
        if row[0] != prev_row[0]:
            tmp.append(prev_row)
        prev_row = row
    tmp.append(prev_row)
    return tmp

and this is ~5x faster than itertools.groupby in a timeit test. Demonstration: https://repl.it/C5Af/0

[Edit: OP has updated their question to say they're already using Pandas to groupby, which is possibly way faster already]

TessellatingHeckler
  • 27,511
  • 4
  • 48
  • 87
  • Sorry, edited this by mistake and now can't seem to remove it! Feel free to remove if you can, have added this to my answer now! – Nils Gudat Jun 22 '16 at 16:35
  • @NilsGudat it's ok, I rejected the edit. I expect the `itertools.groupby` approach to be slower because it's building GroupInfo objects, and new lists. It's quite possible to do this with one run through the list, assuming the list is sorted, I think it's quite Pythonic and more clearly expresses what it's doing. – TessellatingHeckler Jun 22 '16 at 16:57
1

Even though I'm not sure I understood your question, here's what you could do:

li = [l[i][0] for i in range(len(l))]
[l[j][2] for j in [''.join(li).rfind(i) for i in set(li)]]

Note that the output is [3,4,6], as the last occurrence of A seems to be the third, not the second array.

Edit as you seem very concerned about performance (although you don't say what you've tried and what qualifies as "good"):

%timeit li = [l[i][0] for i in range(len(l))]
%timeit [l[j][2] for j in [''.join(li).rfind(i) for i in set(li)]]
>> 1000000 loops, best of 3: 1.19 µs per loop
>> 100000 loops, best of 3: 2.57 µs per loop

%timeit [list(group)[-1][2] for key, group in itertools.groupby(l, lambda x: x[0])]
>> 100000 loops, best of 3: 5.11 µs per loop

So it seems the list comprehension is marginally faster than itertools (although I'm not an expert on benchmarks and there might be a better way to run the itertools one).

Nils Gudat
  • 13,222
  • 3
  • 39
  • 60
1

{l[0]: l[2] for l in vals} will get you a mapping of 'A', 'B', and 'C' to their last values

acushner
  • 9,595
  • 1
  • 34
  • 34
  • Hi, would mind explaining a bit your code? I don't quite understanding how to use it to get the result. By the way, what is 'vals‘? Thanks! – Map Jun 23 '16 at 03:17
  • `vals` is your input (your list of lists). as for the code itself, read up on dict comprehensions and you'll see how it works. – acushner Jun 23 '16 at 14:01
  • Would it be possible to have it return a list like `['3', '4', '6']` instead of dictionary? – Map Jun 24 '16 at 02:28
0

A not-very-pythonic approach: (note that Nils' solution is the most pythonic - using list comprehension)

def get_last_row(xs,q):
    for i in range(len(xs)-1,-1,-1):
        if xs[i][0] == q:
            return xs[i][2]

def get_third_cols(xs):
    third_cols = []
    for q in ["A","B","C"]:
        third_cols.append(get_last_row(xs,q))
    return third_cols

print get_third_cols(xs)

This prints ['3', '4', '6'] if that's what you meant by last occurrence.

Thom
  • 120
  • 4
0

This will generalize to any key / value location. note, the output will be in in the order that the first key was observed. It would wouldn't be hard to adjust so that the order of the output is the order that the output value was observed

import operator

l = [['A', 'aa', '1', '300'],
  ['A', 'ab', '2', '30'],
  ['A', 'ac', '3', '60'],
  ['B', 'ba', '5', '50'],
  ['B', 'bb', '4', '10'],
  ['C', 'ca', '6', '50']]

def getLast(data, key, value):
    f = operator.itemgetter(key,value)
    store = dict()
    keys = []
    for row in data:
        key, value = f(row)
        if key not in store:
            keys.append(key)
        store[key] = value
    return [store[k] for k in keys]

Now timing it,

%timeit getLast(l,0,2)

Gives:

The slowest run took 9.44 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 2.85 µs per loop

And the function Outputs:

['3', '4', '6']
michael_j_ward
  • 4,369
  • 1
  • 24
  • 25