0

I have a dataframe df with column Items which contain item names in alphabetical order.

Items
-----
Apple
Ball
Bar
Cat

I want to join the data frame with itself to get two columns Item_x and Item_y such that in each row, the pairs of items (x,y) is unique in the sense that order in irrelevant i.e. the pair (Apple, Ball) will be considered a duplicate of the pair (Ball, Apple). So I only need to retain (Apple, Ball) because here the items are in alphabetical order and (Ball, Apple) is unwanted and must be deleted.

pd.merge(df,df, on='Items', how='outer')

does not work because it gives extra unwanted pairs such as (Apple, Apple) and (Ball, Apple)

Question: How to join data frame with itself on a column and retain only the rows with unique values the two columns which are in the correct alphabetical order?

Stacker
  • 271
  • 2
  • 8
  • I think you want a `cross` join – Jeremy Jul 31 '22 at 05:49
  • @Jeremy A cross join gives the simple cartesian product of Items with itself it does not eliminate unwanted records such as (Apple, Apple) and (Ball, Apple) – Stacker Jul 31 '22 at 06:01
  • Duplicate of [How to list each pair of tuple only once irrespective of column order in SQL and relational algebra?](https://stackoverflow.com/q/42674254/3404097) – philipxy Jul 31 '22 at 07:37
  • Please before considering posting: Pin down code issues via a [mre]. Read the manual/reference & google any error message & many clear, concise & precise phrasings of your question/problem/goal, with & without your particular names/strings/numbers, 'site:stackoverflow.com' & tags; read many answers. Reflect research in posts. SO/SE search is poor & literal & unusual, read the help. Google re googling/searching, including in Q&A at [meta] & [meta.se]. [How much research effort is expected of Stack Overflow users?](https://meta.stackoverflow.com/q/261592/3404097) [ask] [Help] – philipxy Jul 31 '22 at 07:38

2 Answers2

2

You're asking for combinations~

from itertools import combinations
out = pd.DataFrame(combinations(df.Items.unique(), 2), columns=['Item_x', 'Item_y'])
print(out)

Output:

  Item_x Item_y
0  Apple   Ball
1  Apple    Bar
2  Apple    Cat
3   Ball    Bar
4   Ball    Cat
5    Bar    Cat
BeRT2me
  • 12,699
  • 2
  • 13
  • 31
1

Another option is to compute the cross join and filter:

orig = pd.DataFrame.from_dict(
  {"Items": ["Apple", "Ball", "Bar", "Cat"]}
)

prod = pd.merge(orig, orig, how='cross')
new_df = pd.DataFrame()

l = len(orig.index)
rem_ind = [j+l*i for i in range(l) for j in range(i+1)]
combination = prod.loc[~prod.index.isin(rem_ind)]
print(combination)

Output:

   Items_x Items_y
1    Apple    Ball
2    Apple     Bar
3    Apple     Cat
6     Ball     Bar
7     Ball     Cat
11     Bar     Cat
Jeremy
  • 661
  • 7
  • 19