1

I have a dataframe that looks like this:

df = pd.DataFrame(
    {'tod':    [[('a',10),('b',6),('h',3),('p',2)], 
                [('x',11),('e',2),('l',2)], 
                [('r',5),('l',5)], 
                [('n',15)]]})

                                 tod
0  [(a, 10), (b, 6), (h, 3), (p, 2)]
1          [(x, 11), (e, 2), (l, 2)]
2                   [(r, 5), (l, 5)]
3                          [(n, 15)]

I want to expand list of tuples to new columns to get:

                                 tod l1   n1  l2    n2 l3    n3  l4    n4
0  [(a, 10), (b, 6), (h, 3), (p, 2)]  a   10   b     6  h     3   p     2
1          [(x, 11), (e, 2), (l, 2)]  x   11   e     2  l     2   -   nan
2                   [(r, 5), (l, 5)]  r    5   l     5  -   nan   -   nan
3                          [(n, 15)]  n   15   -   nan  -   nan   -   nan

I want to end up with missing values in the corresponding columns if the tuples do not exist.

I'm having trouble as the length of each list (number of tuples) within each cell is different for each row, so I want to dynamically assign new column values as they appear. Also each cell contains a list of tuple pairs, not a single tuple.

I tried something like in this question, but this only allowed expansion of a single tuple into multiple columns (when you know the columns before hand).

Then i looked at this and this but again, the number of columns is unknown so I ended up with:

pd.DataFrame.from_records([{k: v for v, k in row} for row in df.tod])
Out[171]: 
    2    3    5    6    10   11   15
0    p    h  NaN    b    a  NaN  NaN
1    l  NaN  NaN  NaN  NaN    x  NaN
2  NaN  NaN    l  NaN  NaN  NaN  NaN
3  NaN  NaN  NaN  NaN  NaN  NaN    n

Then I looked at splitting cells which contain tuples here and here which looked at converting the tuples to a series, but again, this did not work, as these examples only deal with single tuples of known length, not lists of tuples

How can I approach this problem?

Note: I realise I've not put up a lot of code for "What have you tried" - my console is a heap of garbage that has yielded continuous errors. In the interests of keeping this clean I've left this off.

Chuck
  • 3,664
  • 7
  • 42
  • 76

2 Answers2

2

You can flatten tuples and then create columns names by generator, last join to original DataFrame:

#https://stackoverflow.com/a/45122198/2901002
def mygen(lst):
    for item in lst:
        yield 'l{}'.format(item)
        yield 'n{}'.format(item)

df1 = pd.DataFrame([[b for a in row for b in a] for row in df.tod])
df1.columns = list(mygen(range(1, len(df1.columns) // 2 + 1)))
print(df1)
  l1  n1    l2   n2    l3   n3    l4   n4
0  a  10     b  6.0     h  3.0     p  2.0
1  x  11     e  2.0     l  2.0  None  NaN
2  r   5     l  5.0  None  NaN  None  NaN
3  n  15  None  NaN  None  NaN  None  NaN

df = df.join(df1)
print (df)
                                 tod l1  n1    l2   n2    l3   n3    l4   n4
0  [(a, 10), (b, 6), (h, 3), (p, 2)]  a  10     b  6.0     h  3.0     p  2.0
1          [(x, 11), (e, 2), (l, 2)]  x  11     e  2.0     l  2.0  None  NaN
2                   [(r, 5), (l, 5)]  r   5     l  5.0  None  NaN  None  NaN
3                          [(n, 15)]  n  15  None  NaN  None  NaN  None  NaN
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Can you keep the original column? – Chuck Jul 04 '18 at 08:52
  • @Chuck - I think need `join` – jezrael Jul 04 '18 at 08:59
  • 1
    Works perfect - Thanks Jez – Chuck Jul 04 '18 at 09:01
  • Is there a way to auto cast the `n1,n2,n3...` to `int` while doing this? – Chuck Jul 04 '18 at 11:35
  • @Chuck - It is not possible, because `n` is string :( Only solution for prevent duplicated columns names is omit `df1.columns = list(mygen(range(1, len(df1.columns) // 2 + 1)))`. – jezrael Jul 04 '18 at 11:37
  • Oh sorry, I meant the values in the rows. At the moment, I end up with `'10'` `'6'` as strings in the column `'n1'`. Is there a way in the `b for a in row for b in a` to save them as `int`s? – Chuck Jul 04 '18 at 11:38
  • @Chuck - Unfortunately not :( And also problem with `NaN`s, possible solution for numeric `n` columns is `cols = df.filter(like='n').columns df[cols] = df[cols].astype(float)` – jezrael Jul 04 '18 at 11:42
0
n = max([len(row) for row in df.tod])
f = lambda l: sum(l, ()) + ('-', np.nan)*(n-len(l))
l = [list(f(row)) for row in df.tod]

ndf = pd.DataFrame(l,columns='l1   n1  l2    n2 l3    n3  l4    n4'.split())
#  l1  n1 l2   n2 l3   n3 l4   n4
#0  a  10  b  6.0  h  3.0  p  2.0
#1  x  11  e  2.0  l  2.0  -  NaN
#2  r   5  l  5.0  -  NaN  -  NaN
#3  n  15  -  NaN  -  NaN  -  NaN

df.join(ndf)
#
#                                 tod l1  n1 l2   n2 l3   n3 l4   n4
#0  [(a, 10), (b, 6), (h, 3), (p, 2)]  a  10  b  6.0  h  3.0  p  2.0
#1          [(x, 11), (e, 2), (l, 2)]  x  11  e  2.0  l  2.0  -  NaN
#2                   [(r, 5), (l, 5)]  r   5  l  5.0  -  NaN  -  NaN
#3                          [(n, 15)]  n  15  -  NaN  -  NaN  -  NaN
Sunitha
  • 11,777
  • 2
  • 20
  • 23