Eliminate Column Repetition in Pandas Dataframe

Question

I have a data frames where I am trying to find all possible combinations of itself and a fraction of itself. The following data frames is a much scaled down version of the one I am running. The first data frame (fruit1) is a fraction of the second data frame (fruit2).

FruitSubDF     FruitFullDF
apple           apple
cherry          cherry
banana          banana
                peach
                 plum

By running the following code

 df1 = pd.DataFrame(list(product(fruitDF.iloc[0:3,0], fruitDF.iloc[0:5,0])), columns=['fruit1', 'fruit2'])

the output is

    Fruit1 Fruit2
0    apple  banana
1    apple  apple
2    apple  cherry
3    apple  peach
4    apple  plum
5   cherry banana
6   cherry apple
7   cherry cherry
.
.
18   banana banana
19   banana peach
20   banana plum

My problem is I want to remove elements with the same two fruits regardless of which fruit is in which column as below. So I am considering (apple,cherry) and (cherry,apple) as the same but I am unsure of an efficient way instead of iterRows to weed out the unwanted data as most pandas functions I find will remove based on the order.

    Fruit1 Fruit2
 0   apple banana
 1   apple cherry
 2   apple apple
 3   apple peach
 4   apple plum
 5  cherry banana
 6  cherry cherry
 .
 .
 15  banana plum

I see that your output has 5 of each fruit in the first column, as expected. So we have apple in rows 0-4, cherry in rows 5-9, some gap and then banana in the last rows, I suppose. What's the output in between? I tried running something similar here but I got 15 rows right away. It still has duplicates, but this many rows got me confused. — leo.barddal, Aug 05 '20 at 18:29
sorry, that was a miscount on my part. you're correct as original should have 15 then the adjusted would be 12 — kl9537, Aug 05 '20 at 18:33

score 2 · Accepted Answer · answered Aug 05 '20 at 18:23

First, I created a piece of code to replicate your DataFrame. I took my code here :stack overflow

import pandas as pd


Fruit1=['apple', 'cherry', 'banana']
Fruit2=['banana', 'apple', 'cherry']



index = pd.MultiIndex.from_product([Fruit1, Fruit2], names = ["Fruit1", "Fruit2"])

df = pd.DataFrame(index = index).reset_index()

Then, you can use the lexicographial order to filter the dataframe.

df[df['Fruit1']<=df['Fruit2']]

I have the result you wanted to obtain.

EDIT : you edited your post but it seems to still do the job.

Can you tell us if this does work since your last update ? Thank you — Nathan e, Aug 05 '20 at 19:22

score 0 · Answer 2 · answered Aug 05 '20 at 17:54

0

You can use itertools to achieve it -

import itertools
fruits  = ['banana', 'cherry',  'apple']
pd.DataFrame((itertools.permutations(fruits, 2)), columns=['fruit1', 'fruit2'])

answered Aug 05 '20 at 17:54

Tom Ron

5,906
3
22
38

ah sorry. I reclarified my question above. so this method would not work. – kl9537 Aug 05 '20 at 18:06

Eliminate Column Repetition in Pandas Dataframe

2 Answers2