0

I needed a Python script to condense a list of values and wasn't sure of the best method to accomplish this. The functions below are a collection of the different approaches that were recommended to me and their associated runtimes. For reference, my source data for these benchmarks used 100 lists of approx 500,000 entries each

Thanks to everyone who pointed me in the right direction!

Sample Input:

values = [3, 4, 4, 5, 3, 2, 7, 8, 9, 9, 9, 9, 9, 4, 4, 2, 1, 1]

Desired output:

col_counts = [1, 2, 1, 1, 1, 1, 1, 5, 2, 1, 2]
col_values = [3, 4, 5, 3, 2, 7, 8, 9, 4, 2, 1]
Output String = 1*3 2*4 1*5 1*3 1*2 1*7 1*8 5*9 2*4 1*2 2*1

Summary Findings:

  • Using Pandas was very similar in speed to manually iterating over the list
  • Using itertools with (k, len(list(g))) was actually 2x faster than (k, sum(1 for i in g)
  • Using itertools w/ len was by far the fastest method (5x faster than pandas or manual loop)
  • The fastest method was also the most compact; it required only 2 lines to build the output string (vs. 28 in my original)

Code for Fastest Method:

def condense_values_2liner(values):
    grouped_values = [(len(list(g)), k) for k,g in groupby(values)]
    output_str = " ".join("%s*%s" % tup for tup in grouped_values)
    return output_str

Complete Code for other Methods:

# BENCHMARKED PERFORMANCE (100 LISTS OF APPROX. 500,000 values
# > condense_values_loop = 21.7s
# > condense_values_pandas = 21.2s
# > condense_values_itertools_sum = 10.4s
# > condense_values_itertools_len = 5.1s
# > condense_values_2liner = 4.8s

# > condense_values_loop = 21.7s
def condense_values_loop(values):
    prev_value = None
    cnt = 0
    value_counts = []
    first = True
    for value in values:
        if first: # First value
            cnt = 1
            prev_value = value
            first = False
        else:
            if value != prev_value:
                value_counts.append([cnt,prev_value])
                cnt = 1
                prev_value = value
            else:
                cnt += 1

    # Write final line to array (check for no values)
    value_counts.append([cnt,value])

    # Build output strings
    output_pairs = []
    for value_count in value_counts:
        entry = str(value_count[0]) + "*" + str(value_count[1])
        output_pairs.append(entry)
    output_str = ' '.join(output_pairs)
    return output_str


# > condense_values_pandas = 21.2s
def condense_values_pandas(values):
    df = pd.DataFrame({'value': values})
    df2 = (
        df
        .groupby(df['value'].ne(df['value'].shift()).cumsum())
        .count()
        .reset_index(drop=True)
        .assign(x=df.loc[df['value'].ne(df['value'].shift()), 'value'].tolist())
    )
    df2.columns = ['counts', 'value']
    df2['output_str'] = df2['counts'].astype(str) + '*' + df2['value'].astype(str)
    col_counts = df2['counts'].tolist()
    col_values = df2['value'].tolist()
    output_string = ' '.join(df2['output_str'])
    return output_string


# condense_values_itertools_sum = 10.4s
def condense_values_itertools_sum(values):
    grouped_values = [(k, sum(1 for i in g)) for k,g in groupby(values)]
    paired_values = []
    for pair in grouped_values:
        paired_values.append(str(pair[1]) + "*" + str(pair[0]))
    output_str = ' '.join(paired_values)
    return output_str


# condense_values_itertools_len = 5.1s
def condense_values_itertools_len(values):
    grouped_values = [(k, len(list(g))) for k,g in groupby(values)]
    paired_values = []
    for pair in grouped_values:
        paired_values.append(str(pair[1]) + "*" + str(pair[0]))
    output_str = ' '.join(paired_values)
    return output_str

# > condense_values_2liner = 4.8s
# > Note: 'join()' call required switching k/g from functions above
def condense_values_2liner(values):
    grouped_values = [(len(list(g)), k) for k,g in groupby(values)]
    output_str = " ".join("%s*%s" % tup for tup in grouped_values)
    return output_str
KKB
  • 23
  • 4
  • Can you use pandas? You didn't tag it, but one never knows... If speed matters, I would highly recommend it. – Alexander Jan 06 '20 at 21:54
  • do you have to keep all the content in memory? Generators might make a nicer solution – Marat Jan 06 '20 at 21:55
  • Note, don't use the accepted solution in the linked duplicate, it is unecessarily ineffecient (in a really silly way). Use [this solution](https://stackoverflow.com/a/7903001/5014455) – juanpa.arrivillaga Jan 06 '20 at 21:55
  • @juanpa.arrivillaga Indeed, though that answer unnecessarily constructs lists to just count items. I've updated the linked duplicate to a better answer then. – blhsing Jan 06 '20 at 22:00
  • @blhsing I bet it is still the fastest way. You'd be amazed how fast constructing a list and taking it's length is compared to something like `sum(1 for _ in g)` – juanpa.arrivillaga Jan 06 '20 at 22:01
  • @juanpa.arrivillaga True. Sometimes I feel that the built-in `len` function should support an iterable as an argument as long as the user is aware that it will be consumed after the call, so that we can make counting items in an iterable both efficient and readable. – blhsing Jan 06 '20 at 22:09
  • 1
    @blhsing yeah I wish at least `itertools._grouper` objects did, nothing preventing it from implementing `__len__` – juanpa.arrivillaga Jan 06 '20 at 22:12
  • Thanks all for your recommendations. I tested them all and updated the post to reflect the fastest method – KKB Jan 09 '20 at 17:45

0 Answers0