2

For object data I can map two columns into a third, (object) column of tuples

>>> import pandas as pd
>>> df = pd.DataFrame([["A","b"], ["A", "a"],["B","b"]])
>>> df
   0  1
0  A  b
1  A  a
2  B  b

>>> df.apply(lambda row: (row[0], row[1]), axis=1)
0    (A, b)
1    (A, a)
2    (B, b)
dtype: object

(see also Pandas: How to use apply function to multiple columns).

However, when I try to do the same thing with numerical columns

>>> df2 = pd.DataFrame([[10,2], [10, 1],[20,2]])
df2.apply(lambda row: (row[0], row[1]), axis=1)
     0     1
0    10    2
1    10    1
2    20    2

so instead of a series of pairs (i.e. [(10,2), (10,1), (20,2)]) I get a DataFrame.

How can I force pandas to actually get a series of pairs? (Preferably, doing it nicer than converting to string and then parsing.)

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Piotr Migdal
  • 11,864
  • 9
  • 64
  • 86
  • The former behaviour appears to be a bug (and is fixed in the development branch, but not in 0.12). – Andy Hayden Aug 23 '13 at 00:53
  • Why do you need a `Series` of `tuple`s? Having it as two columns in a `DataFrame` is *much* more flexible IMHO. – Phillip Cloud Aug 23 '13 at 00:55
  • @PhillipCloud It is not for further storing - just I need a series where I have pairs (so I can series.value_counts() to make statistic for pairs - e.g. to calculate mutual information). – Piotr Migdal Aug 23 '13 at 00:59

1 Answers1

4

I don't recommend this, but you can force it:

In [11]: df2.apply(lambda row: pd.Series([(row[0], row[1])]), axis=1)
Out[11]:
         0
0  (10, 2)
1  (10, 1)
2  (20, 2)

Please don't do this.

Two columns will give you much better performance, flexibility and ease of later analysis.

Just to update with the OP's experience:

What was wanted was to count the occurrences of each [0, 1] pair.

In Series they could use the value_counts method (with the column from the above result). However, the same result could be achieved using groupby and found to be 300 times faster (for OP):

df2.groupby([0, 1]).size()

It's worth emphasising (again) that [11] has to create a Series object and a tuple instance for each row, which is a huge overhead compared to that of groupby.

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • @PhillipCloud I was wondering if the discouragement should be much much larger... – Andy Hayden Aug 23 '13 at 00:59
  • @AndyHayden Thanks. It is not for further analysis - just I want to calculate distribution of pairs (e.g. to calculate mutual information). Another option for me is to use `collections.Counter` and `map(lambda x, y: (x, y), df[0], df[1])` as in this use case I don't need index anymore; I was curious if I can manage within `pandas`. – Piotr Migdal Aug 23 '13 at 01:02
  • 1
    @PiotrMigdal With the column of tuples above, you can use `.value_counts()`. However, it'll be more efficient to use groupby on the original DataFrame: `df2.groupby([0, 1]).size()` – Andy Hayden Aug 23 '13 at 01:08
  • @AndyHayden Would it be more efficient (for my data there are typically ~100 other columns and I just want calculate mutual information between two columns)? – Piotr Migdal Aug 23 '13 at 01:15
  • @PiotrMigdal %timeit and see, my guess is yes, but would be interested to see either way! :) – Andy Hayden Aug 23 '13 at 01:17
  • @PiotrMigdal I make it around 40 times faster for a toy example of 1000 rows and 100 columns. – Andy Hayden Aug 23 '13 at 01:29
  • 1
    @AndyHayden For my data, `(3992, 77)`, `groupby` is 300x faster (sic!). Thanks! (It wasn't a bottleneck, but still great - now it is only 2.7ms.) – Piotr Migdal Aug 23 '13 at 01:47