4

I have a bit of code that runs many thousands of times in my project:

def resample(freq, data):
    output = []
    for i, elem in enumerate(freq):
        for _ in range(elem):
            output.append(data[i])
    return output

eg. resample([1,2,3], ['a', 'b', 'c']) => ['a', 'b', 'b', 'c', 'c', 'c']

I want to speed this up as much as possible. It seems like a list comprehension could be faster. I have tried:

def resample(freq, data):
   return [item for sublist in [[data[i]]*elem for i, elem in enumerate(frequencies)] for item in sublist]

Which is hideous and also slow because it builds the list and then flattens it. Is there a way to do this with one line list comprehension that is fast? Or maybe something with numpy?

Thanks in advance!

edit: Answer does not necessarily need to eliminate the nested loops, fastest code is the best

Luke Eller
  • 93
  • 6
  • 2
    List comprehensions are not faster than the equivalent for loops, because they do exactly the same operations. – Daniel Roseman Jun 29 '18 at 16:19
  • Just use `[e for i, e in enumerate(y) for j in range(x[i])]` – user3483203 Jun 29 '18 at 16:21
  • What sort of inputs are you talking about? If the numbers in `freq` are large then perhaps using `extend` in a single loop might be better than `append` – John Coleman Jun 29 '18 at 16:21
  • I don't agree with the closing @jonrsharpe. It is not a duplicate of that question. – Bharel Jun 29 '18 at 16:22
  • Yeah, I don't agree with the closing either. – nosklo Jun 29 '18 at 16:23
  • I understood the OP to be asking how to write a list comprehension equivalent of nested for loops, which the duplicate does cover. If not, could they please [edit] to clarify. – jonrsharpe Jun 29 '18 at 16:24
  • @jonrsharpe he is not asking that. He is asking how to make the `resample` function which repeats a char based on a list of numbers. My implementation has no nested loops – nosklo Jun 29 '18 at 16:25
  • In case you are trying to frequency weight data, note that `numpy` and `pandas` are able to deal with weights directly, e.g. to take an average https://docs.scipy.org/doc/numpy/reference/generated/numpy.average.html – Stuart Jun 29 '18 at 16:28
  • @Stuart, `np.average` doesn't work with flexible types like this – user3483203 Jun 29 '18 at 16:46

3 Answers3

5

I highly suggest using generators like so:

from itertools import repeat, chain
def resample(freq, data):
    return chain.from_iterable(map(repeat, data, freq))

This will probably be the fastest method there is - map(), repeat() and chain.from_iterable() are all implemented in C so you technically can't get any better.

As for a small explanation:

repeat(i, n) returns an iterator that repeats an item i, n times.

map(repeat, data, freq) returns an iterator that calls repeat every time on an element of data and an element of freq. Basically an iterator that returns repeat() iterators.

chain.from_iterable() flattens the iterator of iterators to return the end items.

No list is created on the way, so there is no overhead and as an added benefit - you can use any type of data and not just one char strings.

While I don't suggest it, you are able to convert it into a list() like so:

result = list(resample([1,2,3], ['a','b','c']))
Bharel
  • 23,672
  • 5
  • 40
  • 80
2
import itertools
def resample(freq, data):
    return itertools.chain.from_iterable([el]*n for el, n in zip(data, freq))

Besides faster, this also has the advantage of being lazy, it returns a generator and the elements are generated step by step

nosklo
  • 217,122
  • 57
  • 293
  • 297
2

No need to create lists at all, just use a nested loop:

[e for i, e in enumerate(data) for j in range(freq[i])]

# ['a', 'b', 'b', 'c', 'c', 'c']

You can just as easily make this lazy by removing the brackets:

(e for i, e in enumerate(data) for j in range(freq[i]))
user3483203
  • 50,081
  • 9
  • 65
  • 94