Self iteration in R

Question

I have a column in excel (created a sample data below) called Name that is in the format below

Name(Column Header)
A,Y - B,X
C,Q - W,R
and list goes on and
B,X - A,Y
W,R - C,Q
T,U - K,M

each row has corresponding 20 other columns. I need to remove all the rows that are present in Name column in reverse format. Such as "B,X - A,Y" and "W,R - C,Q" and all its corresponding column values but the rows "A,Y - B,X" and "C,Q - W,R" should not be deleted.

What does your data look like when you import it into R? Is it a data frame? How many column? Is `"A,Y"` one column and `"B,X"` the next or is `"A,Y - B,X"` one column? Or is it 4 columns? Please share sample data in valid R syntax. — Gregor Thomas, May 08 '18 at 18:10
Yes, I am planning to import the excel data in data frame. Regarding the value position, it can be anywhere in that column. Don't have a specific fixed value position. "A,Y - B,X" is one value — Joe, May 08 '18 at 18:11
I'm not asking *"are you planning to put the data in R?"*, I'm asking you to share the data in a way that makes it easy for people to help you: something **we** can copy/paste into R. See [how to make a reproducible example in R](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) if you need guidance - `dput()` or code to simulate the data makes it easy. — Gregor Thomas, May 08 '18 at 18:15
structure(list(Name = c("A,Y - B,X", "C,Q - W,R", "E,F - B,T", "E,T - F, B", "Q,P - P,O", "B,X - A,Y", "W,R - C,Q", "T,U - K,M" )), .Names = "Name", class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -8L)) — Joe, May 08 '18 at 18:58

Gregor Thomas · Accepted Answer · 2018-05-08T21:57:28.693

1

I'd do it like this:

library(tidyr)
library(dplyr)
dat %>% separate(Name, into = c("pair1", "pair2"), sep = " - ", remove = FALSE) %>%
    mutate(p1 = pmin(pair1, pair2), p2 = pmax(pair1, pair2)) %>%
    distinct(p1, p2, .keep_all = TRUE) %>%
    select(-pair1, -pair2, -p1, -p2)
# # A tibble: 6 x 1
#         Name
#        <chr>
# 1  A,Y - B,X
# 2  C,Q - W,R
# 3  E,F - B,T
# 4 E,T - F, B
# 5  Q,P - P,O
# 6  T,U - K,M

Separate the pairs, use pmin and pmax to sort them consistently, and finally deduplicate on the sorted pairs (keeping the first occurrence only).

edited May 08 '18 at 21:57

answered May 08 '18 at 19:11

Gregor Thomas

136,190
20
167
294

Thanks Gregor! A quick question, will it ignore the cases where Let's say we have a rows Q,T - B,N and A,Y - N,O – Joe May 08 '18 at 19:19
Yup. Only if both pairs match a previous row will it remove the row. But don't take my word for it - add another row to your sample data and see! – Gregor Thomas May 08 '18 at 19:26
Hi Gregor, I tried using the above code for my actual whole database, but it's not removing all the reversed values, just a few. – Joe May 08 '18 at 19:59
As long as the space is consistent, it shouldn't matter. Run up to each line and see that it looks to be working. If you need help debugging, share a reproducible example that illustrates the problem. – Gregor Thomas May 08 '18 at 20:02
I have updated the dput code in the question. For that data, i am not able to remove all the duplicates from the above code. For example- after running the code, Ut, Y - Pa, J still exists in the column. There are similar cases to this one in the final output – Joe May 08 '18 at 20:17
@Joe Well, I'm very confused about why the original didn't work. If I can't figure it out I may need to ask a question myself about it. But something about the `filter(!duplicated())` was a problem. Switching to `dplyr::distinct` seems to fix it. Answer updated. – Gregor Thomas May 08 '18 at 21:58

score 0 · Answer 2 · answered May 08 '18 at 18:28

0

You can use the strsplit function to create two new columns in your dataframe - one column of the stuff to the left of the "-" and one column for the stuff to the right. Then you just subset the dataframe to keep the rows that you want.

answered May 08 '18 at 18:28

Adrian Martin

780
7
21

I need to remove the reversed duplicates rows from that column. By splitting it, I am not sure how will I identify the rows which are reversed – Joe May 08 '18 at 18:33
You can then add an index column (starting at 1, to however many rows you have). Then you can say that if a row from column B (after the hyphen) is duplicated in column A (before the hyphen), drop which ever row has the higher index value. – Adrian Martin May 09 '18 at 18:50

Self iteration in R

2 Answers2