Make a list of python pairs of two columns from dataframe in python

Question

I have pandas dataframe

df_1 = pd.DataFrame({'x' : [a, b, c, d], 'y' : [e, f, g, h]})

I need to get from string like this:

(first_element_from_first_row,first_element_from_second_row),
(second_element_from_first_row,second_element_from_second_row),
................................................................
(last_element_from_first_row,last_element_from_second_row);

at the end should be semicolon.

in my case the answer should be:

(a,e),(b,f),(c,g),(d,h);

How should I solve my problem?

score 2 · Answer 1 · answered Jan 08 '21 at 18:53

If I understand the question correctly- you want to apply the following transformation:

you can use zip to iterate over each element of column "x" and column "y" at the same time as a tuple of elements. You can join those elements so that they are a string and wrap that in parentheses to get the desired row-wise output. Then you store all of those in a larger list and turn that larger list into a string separated by commas, and add a semicolon at the end.

all_pairs = []

for pair in zip(df_1["x"], df_1["y"]):
    pair_str = "({})".format(",".join(pair))
    all_pairs.append(pair_str)

final_str = ",".join(all_pairs) + ";"

print(final_str)
'(a,e),(b,f),(c,g),(d,h);'

Yours is the fastest implementation tested. – Trenton McKinney Jan 08 '21 at 21:03 — Trenton McKinney, Jan 08 '21 at 21:03

score 1 · Answer 2 · answered Jan 08 '21 at 18:49

Try this:

import pandas as pd

df_1 = pd.DataFrame({'x' : ['a', 'b', 'c', 'd'], 'y' : ['e', 'f', 'g', 'h']})

ans = ""
for i in range(df_1.shape[0]):
    ans += '(' + df_1['x'][i] + ',' + df_1['y'][i] + '),'
    
ans = ans[:-1] + ';'

ans
'''
'(a,e),(b,f),(c,g),(d,h);'
'''

This is a pretty crude way but hey it works :)

score 1 · Answer 3 · answered Jan 08 '21 at 18:59

1

Convert to tuples.

s = ''.join(str([tuple(t) for _, t in df_1.iterrows()])) + ';'

If you want to strip the brackets and the spaces:

import re
s_new = re.sub(r'[\[\] ]', '', s)

answered Jan 08 '21 at 18:59

ifly6

5,003
2
24
47

Trenton McKinney · Answer 4 · 2021-01-08T21:06:49.853

The answer from Cameron Riddell is the fastest tested, at 337 ms for 400k rows.
My solution using a list-comprehension with .map(tuple) is the second fastest, at 391 ms for 400k rows

sample data

import pandas as pd

# test data
df_1 = pd.DataFrame({'x': ['a', 'b', 'c', 'd'], 'y': ['e', 'f', 'g', 'h']})

These two options are faster than using .to_string()

','.join([f'{v}' for v in (df_1.x + df_1.y).map(tuple).values]) + ';'

','.join([f'{v}' for v in (df_1.sum(axis=1)).map(tuple).values]) + ';'

My original assumption was these two options would be fastest, because they don't use a loop or list comprehension, but apparently, .to_string() is relatively slow.
Either with the entire dataframe, or using .loc to specify columns, use .sum(axis=1), map the sum to a tuple, and output to a str with .to_string(index=False).
- This results in '(a, e)\n(b, f)\n(c, g)\n(d, h)' so \n is replaced with ,.

# use .loc to specify specific columns
df_1.loc[:, ['x', 'y']].sum(axis=1).map(tuple).to_string(index=False).replace('\n', ',') + ';'

# use this option to sum all columns
df_1.sum(axis=1).map(tuple).to_string(index=False).replace('\n', ',') + ';'

# resulting output of each
'(a, e),(b, f),(c, g),(d, h);'

`%%timeit`

# sample data with 400k rows
df_1 = pd.DataFrame({'x': ['a', 'b', 'c', 'd'], 'y': ['e', 'f', 'g', 'h']})
df = pd.concat([df_1] * 100000).reset_index(drop=True)

# Cameron
%%timeit -r1 -n1 -q -o
cameron(df)
[out]:
<TimeitResult : 337 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>

# Trenton
%%timeit -r1 -n1 -q -o
','.join([f'{v}' for v in (df.sum(axis=1)).map(tuple).values]) + ';'
[out]:
<TimeitResult : 391 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>

# xxdil
%%timeit -r1 -n1 -q -o
xxdil(df)
[out]:
<TimeitResult : 5.36 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>

# ifly6
%%timeit -r1 -n1 -q -o
re.sub(r'[\[\] ]', '', ''.join(str([tuple(t) for _, t in df.iterrows()])) + ';')
[out]:
<TimeitResult : 34.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>

# Trenton
%%timeit -r1 -n1 -q -o
df.sum(axis=1).map(tuple).to_string(index=False).replace('\n', ',') + ';'
[out]:
<TimeitResult : 49.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>

Functions

def cameron(df_1):
    
    all_pairs = []

    for pair in zip(df_1["x"], df_1["y"]):
        pair_str = "({})".format(",".join(pair))
        all_pairs.append(pair_str)

    return ",".join(all_pairs) + ";"


def xxdil(df_1):
    ans = ""
    for i in range(df_1.shape[0]):
        ans += '(' + df_1['x'][i] + ',' + df_1['y'][i] + '),'

    return ans[:-1] + ';'

Make a list of python pairs of two columns from dataframe in python

4 Answers4

sample data

%%timeit

Functions

`%%timeit`