Is there a function or package in R for identifying similar rows within a dataframe?

Question

I had a list of four separate .csv files that contained records for appetizers, meals, sides, and appetizers, and combined the various combinations of the variables in each list in R.

My table looks like this:

|---------------------|------------------|------------------|----------------|
|      Entre 1        |     Entre 2      |       Side       | Appetizer      |
|---------------------|------------------|------------------|----------------|
|  Orange Chicken     | Sesame Chicken   |  Fried Rice      | Crab Rangoon   |
|---------------------|------------------|------------------|----------------|
|  Sesame Chicken     | Orange Chicken   |  Fried Rice      | Crab Rangoon   |
|---------------------|------------------|------------------|----------------|

The problem is that for meals where the customer can pick two entres, I get rows that are the same practically. The first row of the example - Orange Chicken/Sesame Chicken/Fried Rice/Crab Rangoon - is the same meal as the second-row combo.

Is there a function within R that can create a unique dataframe of combos?

I have usually done this in Excel, but now I have a dataset of 46,000+ rows that I am trying to simplify.

In the example above, I would like to remove the second row and keep the first row.

https://stackoverflow.com/questions/22980423/r-find-duplicated-rows-regardless-of-order ; https://stackoverflow.com/questions/9028369/removing-duplicate-combinations-irrespective-of-order — user20650, Dec 30 '18 at 22:50
Do you want one copy of each unique combination or do you also want to know how many there are for each unique combination? Also do you only care about exact matches (regardless of order) or near exact? — Elin, Dec 30 '18 at 23:37
This question is not a duplicate of "finding duplicate rows" in a data frame. Why are the suggestions pointing to these answers??? Perhaps it's a duplicate of: https://stackoverflow.com/questions/28460051/finding-similar-rows-not-duplicates-in-a-dataframe-in-r — Marian Minar, Dec 31 '18 at 01:12
@Alex I think you have two options. (1) You need to define a simple rule that will give you a simple way of identifying "similar" rows (perhaps two meals are "the same" if they differ by one ingredient? or ... if the "words" intersection of two meals is >90% for both?). (2) use `stringdist::clustering`, `quanteda::textstat_dist`, or similar to quantify the difference of two meals — Marian Minar, Dec 31 '18 at 01:19

Is there a function or package in R for identifying similar rows within a dataframe?

0 Answers0