1

I just joined my first kaggle competition. I have a table like this:

| Person_id | Children_id | Person_Photos                      | Children_Photos        |
|-----------|-------------|------------------------------------|------------------------|
| pid1      | cid1        | pid1_0.jpg, pid1_1.jpg, pid1_2.jpg | cid1_0.jpg, cid1_1.jpg |
| pid2      | cid1        | pid2_0.jpg, pid2_1.jpg, pid1_2.jpg | cid1_0.jpg, cid1_1.jpg |

I want to generate a row for each Person_Photos and Children_Photos match. E.g:

| Person_id | Children_id | Person_Photo | Children_Photo |
|-----------|-------------|--------------|----------------|
| pid1      | cid1        | pid1_0.jpg   | cid1_0.jpg     |
| pid1      | cid1        | pid1_0.jpg   | cid1_1.jpg     |
| pid1      | cid1        | pid1_1.jpg   | cid1_0.jpg     |
| pid1      | cid1        | pid1_1.jpg   | cid1_1.jpg     |
| pid1      | cid1        | pid1_2.jpg   | cid1_0.jpg     |
| pid1      | cid1        | pid1_2.jpg   | cid1_1.jpg     |
| pid2      | cid1        | pid2_0.jpg   | cid1_0.jpg     |
| pid2      | cid1        | pid2_0.jpg   | cid1_1.jpg     |
| pid2      | cid1        | pid2_1.jpg   | cid1_0.jpg     |
| pid2      | cid1        | pid2_1.jpg   | cid1_1.jpg     |
| pid2      | cid1        | pid2_2.jpg   | cid1_0.jpg     |
| pid2      | cid1        | pid2_2.jpg   | cid1_1.jpg     |

How can i achieve this using pandas? Thanks in advance for any input!

Benoni
  • 103
  • 10
  • 1
    https://stackoverflow.com/questions/27263805/pandas-when-cell-contents-are-lists-create-a-row-for-each-element-in-the-list – Simon Rogers May 15 '19 at 09:21
  • @SimonRogers That answer only covers a single column. I have 2 columns with multiple values here :/ – Benoni May 15 '19 at 09:32

2 Answers2

0
s = pd.DataFrame({'Person_id':['pid1','pid2'],
                  'Children_id':['cid1','cid1'],
                 'Person_Photos':["'pid1_0.jpg', 'pid1_1.jpg', 'pid1_2.jpg'","'pid2_0.jpg', 'pid2_1.jpg', 'pid1_2.jpg'"],
                 'Children_Photos':["'cid1_0.jpg', 'cid1_1.jpg'","'cid1_0.jpg', 'cid1_1.jpg'"]})


one=s.set_index('Person_id').Person_Photos.str.split(',', expand=True).stack().reset_index('Person_id')
one.columns=['Person_id','Person_Photos']
# print(one)
# print('*'*50)
two=s.set_index('Children_id').Children_Photos.str.split(',', expand=True).stack().reset_index('Children_id')
two.columns=['Children_id','Children_Photos']
# print(two)

one['tmp'] = 1
two['tmp'] = 1

pd.merge(one, two, on=['tmp'])

Output

   Person_idPerson_Photos  tmp  Children_id Children_Photos
0   pid1    'pid1_0.jpg'    1   cid1    'cid1_0.jpg'
1   pid1    'pid1_0.jpg'    1   cid1    'cid1_1.jpg'
2   pid1    'pid1_0.jpg'    1   cid1    'cid1_0.jpg'
3   pid1    'pid1_0.jpg'    1   cid1    'cid1_1.jpg'
4   pid1    'pid1_1.jpg'    1   cid1    'cid1_0.jpg'
5   pid1    'pid1_1.jpg'    1   cid1    'cid1_1.jpg'
6   pid1    'pid1_1.jpg'    1   cid1    'cid1_0.jpg'
7   pid1    'pid1_1.jpg'    1   cid1    'cid1_1.jpg'
8   pid1    'pid1_2.jpg'    1   cid1    'cid1_0.jpg'
9   pid1    'pid1_2.jpg'    1   cid1    'cid1_1.jpg'
10  pid1    'pid1_2.jpg'    1   cid1    'cid1_0.jpg'
11  pid1    'pid1_2.jpg'    1   cid1    'cid1_1.jpg'
12  pid2    'pid2_0.jpg'    1   cid1    'cid1_0.jpg'
13  pid2    'pid2_0.jpg'    1   cid1    'cid1_1.jpg'
14  pid2    'pid2_0.jpg'    1   cid1    'cid1_0.jpg'
15  pid2    'pid2_0.jpg'    1   cid1    'cid1_1.jpg'
16  pid2    'pid2_1.jpg'    1   cid1    'cid1_0.jpg'
17  pid2    'pid2_1.jpg'    1   cid1    'cid1_1.jpg'
18  pid2    'pid2_1.jpg'    1   cid1    'cid1_0.jpg'
19  pid2    'pid2_1.jpg'    1   cid1    'cid1_1.jpg'
20  pid2    'pid1_2.jpg'    1   cid1    'cid1_0.jpg'
21  pid2    'pid1_2.jpg'    1   cid1    'cid1_1.jpg'
22  pid2    'pid1_2.jpg'    1   cid1    'cid1_0.jpg'
23  pid2    'pid1_2.jpg'    1   cid1    'cid1_1.jpg'
vrana95
  • 511
  • 2
  • 10
  • Hey, thanks! Unfortunately i get MemoryError: on line pd.merge(one, two, on=['tmp']). I have 16 GB but guess i need more... Any way i can optimize this? – Benoni May 15 '19 at 09:49
  • can you share the total data size you have ? Number of columns and rows ? – vrana95 May 15 '19 at 10:03
  • 3362 rows and parents have up to 9 children, each children has up to 6 photos. it gets big real quick. I can share the csv if you'd like. – Benoni May 15 '19 at 10:05
0

You can apply Series twice:

one = df.apply(lambda x: pd.Series(x['Person_Photos']), axis=1).stack().reset_index(level=1, drop=True)
df = df.drop('Person_Photos', axis=1)
one.name = 'Person_Photos'
df = df.join(one)
two = df.apply(lambda x: pd.Series(x['Children_Photos']), axis=1).stack().reset_index(level=1, drop=True)
df = df.drop('Children_Photos', axis=1)
two.name = 'Children_Photos'
df = df.join(two)
mobelahcen
  • 414
  • 5
  • 22