0

I have a 1475957 rows × 2 columns Pandas Dataframe looking like this :

|          |       movies      |       actors       |
|:--------:|:-----------------:|:------------------:|
| 0        | Army of Darkness  | Bruce Campbell     |
| 1        | Army of Darkness  | Embeth Davidtz     |
| 2        | Army of Darkness  | Marcus Gilbert     |
| 3        | Army of Darkness  | Ian Abercrombie    |
| 4        | Army of Darkness  | Richard Grove      |
| …        | …                 | …                  |
| 45917207 | Sólo para Mujeres | Jose Ron           |
| 46057608 | Sólo para Mujeres | Paulina Goto       |
| 46198009 | Sólo para Mujeres | Darren Espanto     |
| 46338410 | Sólo para Mujeres | Juan Karlos Labajo |
| 46478811 | Sólo para Mujeres | Daniela Romo       |

And I'd like to generate an edge list looking like that :

|     |      source     |      target     |      edge label  |
|:---:|:---------------:|:---------------:|:----------------:|
| 0   | Bruce Campbell  | Embeth Davidtz  | Army of Darkness |
| 1   | Bruce Campbell  | Marcus Gilbert  | Army of Darkness |
| 2   | Bruce Campbell  | Ian Abercrombie | Army of Darkness |
| 3   | Bruce Campbell  | Richard Grove   | Army of Darkness |
| 4   | Embeth Davidtz  | Marcus Gilbert  | Army of Darkness |
| 5   | Embeth Davidtz  | Ian Abercrombie | Army of Darkness |
| 6   | Embeth Davidtz  | Richard Grove   | Army of Darkness |
| 7   | Marcus Gilbert  | Ian Abercrombie | Army of Darkness |
| 8   | Marcus Gilbert  | Richard Grove   | Army of Darkness |
| 9   | Ian Abercrombie | Richard Grove   | Army of Darkness |
| ... | ...             | ...             | ...              |
crocefisso
  • 793
  • 2
  • 14
  • 29
  • You original data is 46M rows, so your target will be a **very very very big** dataframe. Are you sure it's what you want? Or rather, are you certain that you would have enough ram to store the result? – Quang Hoang Jun 29 '20 at 23:45
  • 1,5M rows actually (index must be messy). I think my PC can handle the computation, if not I'll use a server. – crocefisso Jun 29 '20 at 23:50
  • As you say so, this is essentially a self-merge `df.merge(df, on='movies').query('actors_x < actors_y')` and you can rename the columns as you wish. Remember that you are looking at the result with length roughly `len(df)**2/ (2 * num_movies**2)`. – Quang Hoang Jun 29 '20 at 23:54
  • Wow, impressive! It seems to have worked well. Surprisingly computation was quick. The result is a 11125332 rows × 3 columns Dataframe. Thanks! – crocefisso Jun 29 '20 at 23:59

0 Answers0