I needed a Python script to condense a list of values and wasn't sure of the best method to accomplish this. The functions below are a collection of the different approaches that were recommended to me and their associated runtimes. For reference, my source data for these benchmarks used 100 lists of approx 500,000 entries each
Thanks to everyone who pointed me in the right direction!
Sample Input:
values = [3, 4, 4, 5, 3, 2, 7, 8, 9, 9, 9, 9, 9, 4, 4, 2, 1, 1]
Desired output:
col_counts = [1, 2, 1, 1, 1, 1, 1, 5, 2, 1, 2]
col_values = [3, 4, 5, 3, 2, 7, 8, 9, 4, 2, 1]
Output String = 1*3 2*4 1*5 1*3 1*2 1*7 1*8 5*9 2*4 1*2 2*1
Summary Findings:
- Using Pandas was very similar in speed to manually iterating over the list
- Using itertools with (k, len(list(g))) was actually 2x faster than (k, sum(1 for i in g)
- Using itertools w/ len was by far the fastest method (5x faster than pandas or manual loop)
- The fastest method was also the most compact; it required only 2 lines to build the output string (vs. 28 in my original)
Code for Fastest Method:
def condense_values_2liner(values):
grouped_values = [(len(list(g)), k) for k,g in groupby(values)]
output_str = " ".join("%s*%s" % tup for tup in grouped_values)
return output_str
Complete Code for other Methods:
# BENCHMARKED PERFORMANCE (100 LISTS OF APPROX. 500,000 values
# > condense_values_loop = 21.7s
# > condense_values_pandas = 21.2s
# > condense_values_itertools_sum = 10.4s
# > condense_values_itertools_len = 5.1s
# > condense_values_2liner = 4.8s
# > condense_values_loop = 21.7s
def condense_values_loop(values):
prev_value = None
cnt = 0
value_counts = []
first = True
for value in values:
if first: # First value
cnt = 1
prev_value = value
first = False
else:
if value != prev_value:
value_counts.append([cnt,prev_value])
cnt = 1
prev_value = value
else:
cnt += 1
# Write final line to array (check for no values)
value_counts.append([cnt,value])
# Build output strings
output_pairs = []
for value_count in value_counts:
entry = str(value_count[0]) + "*" + str(value_count[1])
output_pairs.append(entry)
output_str = ' '.join(output_pairs)
return output_str
# > condense_values_pandas = 21.2s
def condense_values_pandas(values):
df = pd.DataFrame({'value': values})
df2 = (
df
.groupby(df['value'].ne(df['value'].shift()).cumsum())
.count()
.reset_index(drop=True)
.assign(x=df.loc[df['value'].ne(df['value'].shift()), 'value'].tolist())
)
df2.columns = ['counts', 'value']
df2['output_str'] = df2['counts'].astype(str) + '*' + df2['value'].astype(str)
col_counts = df2['counts'].tolist()
col_values = df2['value'].tolist()
output_string = ' '.join(df2['output_str'])
return output_string
# condense_values_itertools_sum = 10.4s
def condense_values_itertools_sum(values):
grouped_values = [(k, sum(1 for i in g)) for k,g in groupby(values)]
paired_values = []
for pair in grouped_values:
paired_values.append(str(pair[1]) + "*" + str(pair[0]))
output_str = ' '.join(paired_values)
return output_str
# condense_values_itertools_len = 5.1s
def condense_values_itertools_len(values):
grouped_values = [(k, len(list(g))) for k,g in groupby(values)]
paired_values = []
for pair in grouped_values:
paired_values.append(str(pair[1]) + "*" + str(pair[0]))
output_str = ' '.join(paired_values)
return output_str
# > condense_values_2liner = 4.8s
# > Note: 'join()' call required switching k/g from functions above
def condense_values_2liner(values):
grouped_values = [(len(list(g)), k) for k,g in groupby(values)]
output_str = " ".join("%s*%s" % tup for tup in grouped_values)
return output_str