How to form tuple column from two columns in Pandas

Question

I've got a Pandas DataFrame and I want to combine the 'lat' and 'long' columns to form a tuple.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 205482 entries, 0 to 209018
Data columns:
Month           205482  non-null values
Reported by     205482  non-null values
Falls within    205482  non-null values
Easting         205482  non-null values
Northing        205482  non-null values
Location        205482  non-null values
Crime type      205482  non-null values
long            205482  non-null values
lat             205482  non-null values
dtypes: float64(4), object(5)

The code I tried to use was:

def merge_two_cols(series): 
    return (series['lat'], series['long'])

sample['lat_long'] = sample.apply(merge_two_cols, axis=1)

However, this returned the following error:

---------------------------------------------------------------------------
 AssertionError                            Traceback (most recent call last)
<ipython-input-261-e752e52a96e6> in <module>()
      2     return (series['lat'], series['long'])
      3 
----> 4 sample['lat_long'] = sample.apply(merge_two_cols, axis=1)
      5

...

AssertionError: Block shape incompatible with manager

How can I solve this problem?

Dale · Accepted Answer · 2016-11-07T16:58:03.140

368

Get comfortable with zip. It comes in handy when dealing with column data.

df['new_col'] = list(zip(df.lat, df.long))

It's less complicated and faster than using apply or map. Something like np.dstack is twice as fast as zip, but wouldn't give you tuples.

edited Nov 07 '16 at 16:58

answered Apr 17 '13 at 19:24

Dale

4,480
1
19
13

7

in python3, you have to use `list`. This should work: `df['new_col'] = list(zip(df.lat, df.long))` – paulwasit Nov 02 '16 at 08:06
@paulwasit ah yes, my love hate relationship with python 3's lazy behavior. thanks. – Dale Nov 07 '16 at 17:00
15

This method ```list(zip(df.lat, df.long))``` in 124ms is much more efficient than ```df[['lat', 'long']].apply(tuple, axis=1)``` in 14.2 s for 900k rows. The ratio is more than 100. – Pengju Zhao Aug 02 '17 at 04:21
1

I am trying to use this with a longer list of columns `df['new_col'] = list(zip(df[cols_to_keep]))` but keep getting an error: `Length of values does not match length of index` any advice? – seeiespi Apr 16 '18 at 22:53
`zip(df[cols_to_keep])` will iterate over the DataFrame, creating a list of columns instead of a list of Series. you need `zip( [df[c] for c in cols_to_keep])` – Peter Hansen Oct 14 '19 at 15:43
5

@PeterHansen's answer helped me but think it may have been missing an * to unpack the list first - i.e. `df['new_col'] = list(zip(*[df[c] for c in cols_to_keep])` – jedge Feb 12 '20 at 16:33
This one fails for me with: `TypeError: only integer scalar arrays can be converted to a scalar index`. – Zizzipupp Jul 14 '20 at 15:10

Wouter Overmeire · Answer 2 · 2013-04-18T20:42:06.173

101

In [10]: df
Out[10]:
          A         B       lat      long
0  1.428987  0.614405  0.484370 -0.628298
1 -0.485747  0.275096  0.497116  1.047605
2  0.822527  0.340689  2.120676 -2.436831
3  0.384719 -0.042070  1.426703 -0.634355
4 -0.937442  2.520756 -1.662615 -1.377490
5 -0.154816  0.617671 -0.090484 -0.191906
6 -0.705177 -1.086138 -0.629708  1.332853
7  0.637496 -0.643773 -0.492668 -0.777344
8  1.109497 -0.610165  0.260325  2.533383
9 -1.224584  0.117668  1.304369 -0.152561

In [11]: df['lat_long'] = df[['lat', 'long']].apply(tuple, axis=1)

In [12]: df
Out[12]:
          A         B       lat      long                             lat_long
0  1.428987  0.614405  0.484370 -0.628298      (0.484370195967, -0.6282975278)
1 -0.485747  0.275096  0.497116  1.047605      (0.497115615839, 1.04760475074)
2  0.822527  0.340689  2.120676 -2.436831      (2.12067574274, -2.43683074367)
3  0.384719 -0.042070  1.426703 -0.634355      (1.42670326172, -0.63435462504)
4 -0.937442  2.520756 -1.662615 -1.377490     (-1.66261469102, -1.37749004179)
5 -0.154816  0.617671 -0.090484 -0.191906  (-0.0904840623396, -0.191905582481)
6 -0.705177 -1.086138 -0.629708  1.332853     (-0.629707821728, 1.33285348929)
7  0.637496 -0.643773 -0.492668 -0.777344   (-0.492667604075, -0.777344111021)
8  1.109497 -0.610165  0.260325  2.533383        (0.26032456699, 2.5333825651)
9 -1.224584  0.117668  1.304369 -0.152561     (1.30436900612, -0.152560909725)

edited Apr 18 '13 at 20:42

answered Apr 16 '13 at 09:13

Wouter Overmeire

65,766
10
63
43

That's brilliant. Thank you. Clearly need to get my head around lambda functions. – elksie5000 Apr 16 '13 at 09:34
Did this work on your data? If so, can you share your pandas version and the data? I wonder why your code did not work, it should. – Wouter Overmeire Apr 16 '13 at 12:25
The version is 0.10.1_20130131. Excuse my ignorance, but what's the best way of uploading a section of the data for you? (Still a relative newbie). – elksie5000 Apr 16 '13 at 15:56
I failed to reproduce on 0.10.1. Best way of uploading? You can either create code that generates a frame holding random data, that has the same issue and share that code or pickle the frame above (sample) and transfer it via a free big file transfer service. How to pickle (in two lines, without ","): import pickle, with open('sample.pickle', 'w') as file: pickle.dump(sample, file) – Wouter Overmeire Apr 17 '13 at 06:21
I need exactly this, but in the opposite direction. I have a column with the lat_long tuple and need two columns with lat and long. How to unpack the tuple? – Balzer82 Oct 23 '14 at 18:48
There are several ways, see e.g http://stackoverflow.com/questions/22799300/how-to-unpack-a-series-of-tuples-in-pandas – Wouter Overmeire Oct 24 '14 at 06:55
I have ```df[["year", "month", "day"]].apply(tuple, axis=1)``` where "year", "month", "day" are just integers, and this fails to do anything. EDIT: This works for floats only, what a ***y language. – imrek Mar 25 '16 at 17:22
2

I have upvoted this as I need to zip 10 columns and don't want to give dataframe name 10 times. Just want to give Column names. – rishi jain Aug 16 '19 at 07:26

score 29 · Answer 3 · edited Mar 29 '19 at 10:40

29

Pandas has the itertuples method to do exactly this:

list(df[['lat', 'long']].itertuples(index=False, name=None))

edited Mar 29 '19 at 10:40

Calimo

7,510
4
39
61

answered Nov 06 '17 at 21:49

Ted Petrou

59,042
19
131
136

Yasser M · Answer 4 · 2021-07-31T17:17:02.873

10

You should try using pd.to_records(index=False):

import pandas as pd
df = pd.DataFrame({'language': ['en', 'ar', 'es'], 'greeting': ['Hi', 'اهلا', 'Hola']})
df

   language  greeting
0       en    Hi
1       ar    اهلا
2       es   Hola

df['list_of_tuples'] = list(df[['language', 'greeting']].to_records(index=False))
df['list_of_tuples']

0    [en, Hi]
1    [ar, اهلا]
2    [es, Hola]

enjoy!

edited Jul 31 '21 at 17:17

answered Jul 29 '21 at 09:10

Yasser M

654
7
9

user3820991 · Answer 5 · 2019-10-16T10:57:54.163

7

I'd like to add df.values.tolist(). (as long as you don't mind to get a column of lists rather than tuples)

import pandas as pd
import numpy as np

size = int(1e+07)
df = pd.DataFrame({'a': np.random.rand(size), 'b': np.random.rand(size)}) 

%timeit df.values.tolist()
1.47 s ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit list(zip(df.a,df.b))
1.92 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

edited Oct 16 '19 at 10:57

answered Oct 16 '19 at 10:46

user3820991

2,310
5
23
32

1

When you have more than just these two columns: `%timeit df[['a', 'b']].values.tolist()`. It's still much faster. – ChaimG Jan 03 '20 at 15:24
It's faster to create it, but any operations on that column will be faster in tuple form. For example, try calling `.value_counts()` on a column of lists vs a column of tuples. – ThatNewGuy Apr 22 '21 at 20:31
Now `df.to_numpy()` is recommended rather than `df.values`(see [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.values.html)), so it‘s better to use `df.to_numpy().tolist()` instead. – user343233 Aug 21 '23 at 15:21

score 1 · Answer 6 · answered Apr 27 '23 at 10:04

Suppose you have 2 columns 'A' and 'B':

import pandas as pd
df = pd.DataFrame({'A': ['one', 'two', 'three'], 'B': [1, 2, 3]})

print(df)

    A   B
0   x   1
1   y   2
2   z   3

Now you want to combine column A and B together you can do:

print(df[['A', 'B']].apply(list, axis=1))

0      [one, 1]
1      [two, 2]
2    [three, 3]
dtype: object

or if you want nested list then:

print(df[['A', 'B']].apply(list, axis=1).tolist())

#[['one', 1], ['two', 2], ['three', 3]]

How to form tuple column from two columns in Pandas

6 Answers6

Linked

Related