- The answer from Cameron Riddell is the fastest tested, at 337 ms for 400k rows.
- My solution using a list-comprehension with
.map(tuple)
is the second fastest, at 391 ms for 400k rows
sample data
import pandas as pd
# test data
df_1 = pd.DataFrame({'x': ['a', 'b', 'c', 'd'], 'y': ['e', 'f', 'g', 'h']})
- These two options are faster than using
.to_string()
','.join([f'{v}' for v in (df_1.x + df_1.y).map(tuple).values]) + ';'
','.join([f'{v}' for v in (df_1.sum(axis=1)).map(tuple).values]) + ';'
- My original assumption was these two options would be fastest, because they don't use a loop or list comprehension, but apparently,
.to_string()
is relatively slow.
- Either with the entire dataframe, or using
.loc
to specify columns, use .sum(axis=1)
, map the sum to a tuple
, and output to a str
with .to_string(index=False)
.
- This results in
'(a, e)\n(b, f)\n(c, g)\n(d, h)'
so \n
is replaced with ,
.
# use .loc to specify specific columns
df_1.loc[:, ['x', 'y']].sum(axis=1).map(tuple).to_string(index=False).replace('\n', ',') + ';'
# use this option to sum all columns
df_1.sum(axis=1).map(tuple).to_string(index=False).replace('\n', ',') + ';'
# resulting output of each
'(a, e),(b, f),(c, g),(d, h);'
%%timeit
# sample data with 400k rows
df_1 = pd.DataFrame({'x': ['a', 'b', 'c', 'd'], 'y': ['e', 'f', 'g', 'h']})
df = pd.concat([df_1] * 100000).reset_index(drop=True)
# Cameron
%%timeit -r1 -n1 -q -o
cameron(df)
[out]:
<TimeitResult : 337 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>
# Trenton
%%timeit -r1 -n1 -q -o
','.join([f'{v}' for v in (df.sum(axis=1)).map(tuple).values]) + ';'
[out]:
<TimeitResult : 391 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>
# xxdil
%%timeit -r1 -n1 -q -o
xxdil(df)
[out]:
<TimeitResult : 5.36 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>
# ifly6
%%timeit -r1 -n1 -q -o
re.sub(r'[\[\] ]', '', ''.join(str([tuple(t) for _, t in df.iterrows()])) + ';')
[out]:
<TimeitResult : 34.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>
# Trenton
%%timeit -r1 -n1 -q -o
df.sum(axis=1).map(tuple).to_string(index=False).replace('\n', ',') + ';'
[out]:
<TimeitResult : 49.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>
Functions
def cameron(df_1):
all_pairs = []
for pair in zip(df_1["x"], df_1["y"]):
pair_str = "({})".format(",".join(pair))
all_pairs.append(pair_str)
return ",".join(all_pairs) + ";"
def xxdil(df_1):
ans = ""
for i in range(df_1.shape[0]):
ans += '(' + df_1['x'][i] + ',' + df_1['y'][i] + '),'
return ans[:-1] + ';'