R - Finding identical rows or rows that only differ by x columns

Question

I'm trying to use R on a large CSV file that for this example can be said to represent a list of people and forms of transportation. If a person owns that mode of transportation, this is represented by a X in the corresponding cell. Example data of this is as per below:

Type,Peter,Paul,Mary,Don,Stan,Mike
Scooter,X,X,X,,X,
Car,,,,X,,X
Bike,,,,,,
Skateboard,X,X,X,X,X,X
Boat,,X,,,,

The below image makes it easier to see what it represents:

What I'm after is to learn which persons have identical modes of transportation, or, ideally, where the modes of transportation differs by no more than one.

The format is a bit weird but, assuming the csv file is named example.csv, I can read it into a data frame and transpose it as per below (it should be fairly obvious that I'm a complete R noob)

ex <- read.csv('example.csv')
ext <- as.data.frame(t(ex))

This post explained how to find duplicates and it seems to work

duplicated(ext) | duplicated(ext[nrow(ext):1, ])[nrow(ext):1]
which(duplicated(ext) | duplicated(ext[nrow(ext):1, ])[nrow(ext):1])

This returns the following indexes:

1 2 4 5 6 7

That does indeed correspond with what I consider to be duplicate rows. That is, Peter has the same modes of transportation as Mary and Stan (indexes 2, 4 and 6); Don and Mike likewise share the same modes of transportation, indexes 5 and 7.

Again, that seems to work ok but if the modes of transportation and number of people are significant, it becomes really difficult finding/knowing not just which rows are duplicates, but which indexes actually matched. In this case that indexes 2, 4 and 6 are identical and that 5 and 7 are identical.

Is there an easy way of getting that information so that one doesn't have to try and find the matches manually?

Also, given all of the above, is it possible to alter the code in any way so that it would consider rows to match if there was only a difference in X positions (for example a difference of one is acceptable so as long as the persons in the above example have no more than one mode of transportation that is different, it's still considered a match)?

Happy to elaborate further and very grateful for any and all help.

Please add your data using `dput` i.e `dput(ex)` and show expected output for the shared data. — Ronak Shah, May 29 '20 at 12:25
So your ideal output would be a list of all the people (now rows after we transpose) and for each person a column that showed who they were identical to, and another column that showed who they were close to +/- 1? Roughly how many people and are they unique (only one Paul) and how many modes of transportation? — Chuck P, May 29 '20 at 13:51

Chuck P · Accepted Answer · 2020-06-01T14:29:11.290

1

library(dplyr)
library(tidyr)

ex <- read.csv(text = "Type,Peter,Paul,Mary,Don,Stan,Mike
Scooter,X,X,X,,X,
Car,,,,X,,X
Bike,,,,,,
Skateboard,X,X,X,X,X,X
Boat,,X,,,,", )

ext <- tidyr::pivot_longer(ex, -Type, names_to = "person")

# head(ext)
ext <- ext %>%
  group_by(person) %>%
  filter(value == "X") %>%
  summarise(Modalities = n(), Which = paste(Type, collapse=", ")) %>%
  arrange(desc(Modalities), Which) %>% 
  mutate(IdenticalGrp = rle(Which)$lengths %>% {rep(seq(length(.)), .)})


ext
#> # A tibble: 6 x 4
#>   person Modalities Which                     IdenticalGrp
#>   <chr>       <int> <chr>                            <int>
#> 1 Paul            3 Scooter, Skateboard, Boat            1
#> 2 Don             2 Car, Skateboard                      2
#> 3 Mike            2 Car, Skateboard                      2
#> 4 Mary            2 Scooter, Skateboard                  3
#> 5 Peter           2 Scooter, Skateboard                  3
#> 6 Stan            2 Scooter, Skateboard                  3

To get a membership list in any particular IndenticalGrp you can just pull like this.

ext %>% filter(IdenticalGrp == 3) %>% pull(person)
#> [1] "Mary"  "Peter" "Stan"

edited Jun 01 '20 at 14:29

answered May 29 '20 at 14:49

Chuck P

3,862
3
9
20

1

Hi Chuck P, Thanks heaps for your reply, much appreciated! You're spot on with your earlier comment but I don't even need to know the modes of transportation. With the example I posted, I can see which rows are duplicates but not that Peter, Paul and Stan matched. Names are distinct, as are modes of transportation. Basically you've nailed it with this " So your ideal output would be a list of all the people (now rows after we transpose) and for each person a column that showed who they were identical to, and another column that showed who they were close to +/- 1? " – c95mbq May 31 '20 at 11:48
Okay I'm going to edit my answer to solve the identical question. Without understanding your real dimensions (unique people by unique modes of transportation) I'm going to hold off on the plus or minus 1 question. You're probably going to better off just doing that on a summary matrix and then applying it. – Chuck P Jun 01 '20 at 14:26
Thanks Chuck P, much appreciated. I'll give your suggestion a try in a minute and post back - I really am so green with R. I think my examples were a poor attempt at simplifying things and I should have used a real example instead. This relates to an unwieldy security model with a large number of positions (unique) and security groups (also unique). Each position is associated with one or more security groups. There are positions that have exactly the same security groups and they should be consolidated. Also if they only differ by say one security group, maybe they too should be considered – c95mbq Jun 01 '20 at 21:24
Can't thank you enough Chuck P, that's absolutely brilliant. I'll try to find something on summary matrices and see if the plus/minus 1 is possible, that's a great tip. Again, thanks for all your help! – c95mbq Jun 01 '20 at 21:38
You're welcome. This will definitely find the identicals and help you sort the many's which should inform your thinking about consolidation – Chuck P Jun 01 '20 at 21:43

R - Finding identical rows or rows that only differ by x columns

1 Answers1