0

I need to do something very similar to this question: Pandas convert dataframe to array of tuples

The difference is I need to get not only a single list of tuples for the entire DataFrame, but a list of lists of tuples, sliced based on some column value.

Supposing this is my data set:

         t_id  A    B
         ----- ---- -----
    0    AAAA     1   2.0
    1    AAAA     3   4.0
    2    AAAA     5   6.0
    3    BBBB     7   8.0
    4    BBBB     9  10.0
    ...

I want to produce as output:

        [[(1,2.0), (3,4.0), (5,6.0)],[(7,8.0), (9,10.0)]]

That is, one list for 'AAAA', another for 'BBBB' and so on.

I've tried with two nested for loops. It seems to work, but it is taking too long (actual data set has ~1M rows):

    result = []
    for t in df['t_id'].unique():
        tuple_list= []
        
        for x in df[df['t_id' == t]].iterrows():
            row = x[1][['A', 'B']]
            tuple_list.append(tuple(x))
        
        result.append(tuple_list)

Is there a faster way to do it?

2 Answers2

2

You can groupby column t_id, iterate through groups and convert each sub dataframe into a list of tuples:

[g[['A', 'B']].to_records(index=False).tolist() for _, g in df.groupby('t_id')]
# [[(1, 2.0), (3, 4.0), (5, 6.0)], [(7, 8.0), (9, 10.0)]]
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • It actually works and it is way faster than may for loops, tks. Can you please break down the `for _, g` part? What does `groupby` return? In dataframe documentation it returns only a `DataFrameGroupBy` object. – thatOldITGuy Aug 15 '21 at 11:57
  • `DataFrameGroupBy` inherits from `BaseGroupBy` which has an `__iter__` method defined: https://github.com/pandas-dev/pandas/blob/126a19d038b65493729e21ca969fbb58dab9a408/pandas/core/groupby/groupby.py#L756. Which means you can iterate through groups, process each separately. – Psidom Aug 15 '21 at 17:22
1

I think this should work too:

import pandas as pd
import itertools


df = pd.DataFrame({"A": [1, 2, 3, 1], "B": [2, 2, 2, 2], "C": ["A", "B", "C", "B"]})

tuples_in_df = sorted(tuple(df.to_records(index=False)), key=lambda x: x[0])
output = [[tuple(x)[1:] for x in group] for _, group in itertools.groupby(tuples_in_df, lambda x: x[0])]
print(output)

Out:

[[(2, 'A'), (2, 'B')], [(2, 'B')], [(2, 'C')]]
osint_alex
  • 952
  • 3
  • 16