Perform inner_join() of R in Python

Question

I have a Pandas database Network with a network structure like this:

{'Sup': {0: 1002000157,
  1: 1002000157,
  2: 1002000157,
  3: 1002000157,
  4: 1002000157,
  5: 1002000157,
  6: 1002000157,
  7: 1002000157,
  8: 1002000157,
  9: 1002000157,
  10: 1002000157,
  11: 1002000157,
  12: 1002000157,
  13: 1002000382,
  14: 1002000382,
  15: 1002000382,
  16: 1002000382,
  17: 1002000382,
  18: 1002000382,
  19: 1002000382,
  20: 1002000382,
  21: 1002000382,
  22: 1002000382,
  23: 1002000382,
  24: 1002000382,
  25: 1002000382,
  26: 1002000382,
  27: 1002000382,
  28: 1002000382,
  29: 1002000382},
 'Cust': {0: 1002438313,
  1: 8039296054,
  2: 9003188096,
  3: 14900070991,
  4: 17005234747,
  5: 18006860724,
  6: 28000286091,
  7: 29009623382,
  8: 39000007702,
  9: 39004420023,
  10: 46000088397,
  11: 50000063751,
  12: 7000090017,
  13: 1900120936,
  14: 1900779883,
  15: 2000013994,
  16: 2001222824,
  17: 2003032125,
  18: 2900121723,
  19: 2900197555,
  20: 2902742641,
  21: 3000101113,
  22: 3000195031,
  23: 3000318054,
  24: 3900091301,
  25: 3911084436,
  26: 4900112325,
  27: 5900720933,
  28: 7000001703,
  29: 8000004881}}

I would like to reproduce this R command (possibly without kernel interrupting) in Python:

NodesSharingSupplier <- inner_join(Network, Network,  by=c('Sup'='Sup'))

This is an SQL-style inner join, thus I fear that it cannot be performed simply with an inner merge on Sup in Python.

How do I reproduce it in Python?

Can you provide the exact expected output? (maybe reduce a bit the input size for clarity) — mozway, Apr 19 '23 at 06:51

Corralien · Accepted Answer · 2023-04-19T07:48:44.427

2

IIUC, you are looking for merge:

NodesSharingSupplier = Network.merge(Network, on='Sup', how='inner')
print(NodesSharingSupplier)

# Output
            Sup      Cust_x       Cust_y
0    1002000157  1002438313   1002438313
1    1002000157  1002438313   8039296054
2    1002000157  1002438313   9003188096
3    1002000157  1002438313  14900070991
4    1002000157  1002438313  17005234747
..          ...         ...          ...
453  1002000382  8000004881   3911084436
454  1002000382  8000004881   4900112325
455  1002000382  8000004881   5900720933
456  1002000382  8000004881   7000001703
457  1002000382  8000004881   8000004881

[458 rows x 3 columns]

You can remove case where Cust_x == Cust_y by appending .query('Cust_x != Cust_y') after .merge(...).

Input:

data = {'Sup': {0: 1002000157, 1: 1002000157, 2: 1002000157, 3: 1002000157, 4: 1002000157, 5: 1002000157, 6: 1002000157, 7: 1002000157, 8: 1002000157, 9: 1002000157, 10: 1002000157, 11: 1002000157, 12: 1002000157, 13: 1002000382, 14: 1002000382, 15: 1002000382, 16: 1002000382, 17: 1002000382, 18: 1002000382, 19: 1002000382, 20: 1002000382, 21: 1002000382, 22: 1002000382, 23: 1002000382, 24: 1002000382, 25: 1002000382, 26: 1002000382, 27: 1002000382, 28: 1002000382, 29: 1002000382},
        'Cust': {0: 1002438313, 1: 8039296054, 2: 9003188096, 3: 14900070991, 4: 17005234747, 5: 18006860724, 6: 28000286091, 7: 29009623382, 8: 39000007702, 9: 39004420023, 10: 46000088397, 11: 50000063751, 12: 7000090017, 13: 1900120936, 14: 1900779883, 15: 2000013994, 16: 2001222824, 17: 2003032125, 18: 2900121723, 19: 2900197555, 20: 2902742641, 21: 3000101113, 22: 3000195031, 23: 3000318054, 24: 3900091301, 25: 3911084436, 26: 4900112325, 27: 5900720933, 28: 7000001703, 29: 8000004881}}
Network = pd.DataFrame(data)

More information: Pandas Merging 101

Update

The kernel dies after it. Consider that the Network database has 5 million observations

Use dask:

import dask.pandas as dd

NetworkDD = dd.from_pandas(Network, npartitions=Network['Sup'].nunique())
NodesSharingSupplier = dd.merge(NetworkDD, NetworkDD, on='Sup').compute()

edited Apr 19 '23 at 07:48

answered Apr 19 '23 at 06:58

Corralien

109,409
8
28
52

@mozway the kernel dies after it. Consider that the Network database has 5 million observations – Lusian Apr 19 '23 at 07:30
@Lusian. Maybe you can use `groupby` to split the process. I will update my answer. – Corralien Apr 19 '23 at 07:32
@Lusian. You can actually test with `dask`. I updated my answer, can you check it please? If the kernel dies again, it means you have not enough memory on your computer. In this case, you have to store partial results after processing each group. – Corralien Apr 19 '23 at 07:49
it says no space left on device. Is there a way maybe to store the result directly on dropbox or avoid computations on the local machine? – Lusian Apr 19 '23 at 09:33
When does this message appear? During the merge? – Corralien Apr 19 '23 at 09:49
So the error is OsError: No Space left on Device – Lusian Apr 19 '23 at 09:56
@Corralin OSError: [Errno 28] No space left on device I think the error is due to the compute that forces non lazy computation. But still non lazy computations is what I need... – Lusian Apr 19 '23 at 09:59
1

You don't answer when does this message appear? – Corralien Apr 19 '23 at 12:03
when I perform the .compute(). So NodesSharingSupplier = dd.merge(NetworkDD, NetworkDD, on='NiuSup') works, NodesSharingSupplier = dd.merge(NetworkDD, NetworkDD, on='NiuSup').compute leads to the error message – Lusian Apr 19 '23 at 13:31

score 0 · Answer 2 · answered Apr 19 '23 at 07:09

0

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

answered Apr 19 '23 at 07:09

user16514821

83
5

With Pandas, you can do the same with `left_on` and `right_on` parameters. – Corralien Apr 19 '23 at 07:14

Perform inner_join() of R in Python

2 Answers2