Mapping a few numerical columns into a new columns of tuples in Pandas

Question

For object data I can map two columns into a third, (object) column of tuples

>>> import pandas as pd
>>> df = pd.DataFrame([["A","b"], ["A", "a"],["B","b"]])
>>> df
   0  1
0  A  b
1  A  a
2  B  b

>>> df.apply(lambda row: (row[0], row[1]), axis=1)
0    (A, b)
1    (A, a)
2    (B, b)
dtype: object

(see also Pandas: How to use apply function to multiple columns).

However, when I try to do the same thing with numerical columns

>>> df2 = pd.DataFrame([[10,2], [10, 1],[20,2]])
df2.apply(lambda row: (row[0], row[1]), axis=1)
     0     1
0    10    2
1    10    1
2    20    2

so instead of a series of pairs (i.e. [(10,2), (10,1), (20,2)]) I get a DataFrame.

How can I force pandas to actually get a series of pairs? (Preferably, doing it nicer than converting to string and then parsing.)

The former behaviour appears to be a bug (and is fixed in the development branch, but not in 0.12). — Andy Hayden, Aug 23 '13 at 00:53
Why do you need a `Series` of `tuple`s? Having it as two columns in a `DataFrame` is *much* more flexible IMHO. — Phillip Cloud, Aug 23 '13 at 00:55
@PhillipCloud It is not for further storing - just I need a series where I have pairs (so I can series.value_counts() to make statistic for pairs - e.g. to calculate mutual information). — Piotr Migdal, Aug 23 '13 at 00:59

Andy Hayden · Accepted Answer · 2013-08-23T12:36:39.767

4

I don't recommend this, but you can force it:

In [11]: df2.apply(lambda row: pd.Series([(row[0], row[1])]), axis=1)
Out[11]:
         0
0  (10, 2)
1  (10, 1)
2  (20, 2)

Please don't do this.

Two columns will give you much better performance, flexibility and ease of later analysis.

Just to update with the OP's experience:

What was wanted was to count the occurrences of each [0, 1] pair.

In Series they could use the value_counts method (with the column from the above result). However, the same result could be achieved using groupby and found to be 300 times faster (for OP):

df2.groupby([0, 1]).size()

It's worth emphasising (again) that [11] has to create a Series object and a tuple instance for each row, which is a huge overhead compared to that of groupby.

edited Aug 23 '13 at 12:36

answered Aug 23 '13 at 00:55

Andy Hayden

359,921
101
625
535

@PhillipCloud I was wondering if the discouragement should be much much larger... – Andy Hayden Aug 23 '13 at 00:59
@AndyHayden Thanks. It is not for further analysis - just I want to calculate distribution of pairs (e.g. to calculate mutual information). Another option for me is to use `collections.Counter` and `map(lambda x, y: (x, y), df[0], df[1])` as in this use case I don't need index anymore; I was curious if I can manage within `pandas`. – Piotr Migdal Aug 23 '13 at 01:02
1

@PiotrMigdal With the column of tuples above, you can use `.value_counts()`. However, it'll be more efficient to use groupby on the original DataFrame: `df2.groupby([0, 1]).size()` – Andy Hayden Aug 23 '13 at 01:08
@AndyHayden Would it be more efficient (for my data there are typically ~100 other columns and I just want calculate mutual information between two columns)? – Piotr Migdal Aug 23 '13 at 01:15
@PiotrMigdal %timeit and see, my guess is yes, but would be interested to see either way! :) – Andy Hayden Aug 23 '13 at 01:17
@PiotrMigdal I make it around 40 times faster for a toy example of 1000 rows and 100 columns. – Andy Hayden Aug 23 '13 at 01:29
1

@AndyHayden For my data, `(3992, 77)`, `groupby` is 300x faster (sic!). Thanks! (It wasn't a bottleneck, but still great - now it is only 2.7ms.) – Piotr Migdal Aug 23 '13 at 01:47

Mapping a few numerical columns into a new columns of tuples in Pandas

1 Answers1

Please don't do this.

Just to update with the OP's experience: