0

I have a column in excel (created a sample data below) called Name that is in the format below

Name(Column Header)
A,Y - B,X
C,Q - W,R
and list goes on and
B,X - A,Y
W,R - C,Q
T,U - K,M

each row has corresponding 20 other columns. I need to remove all the rows that are present in Name column in reverse format. Such as "B,X - A,Y" and "W,R - C,Q" and all its corresponding column values but the rows "A,Y - B,X" and "C,Q - W,R" should not be deleted.

Joe
  • 183
  • 5
  • 16
  • 2
    What does your data look like when you import it into R? Is it a data frame? How many column? Is `"A,Y"` one column and `"B,X"` the next or is `"A,Y - B,X"` one column? Or is it 4 columns? Please share sample data in valid R syntax. – Gregor Thomas May 08 '18 at 18:10
  • Yes, I am planning to import the excel data in data frame. Regarding the value position, it can be anywhere in that column. Don't have a specific fixed value position. "A,Y - B,X" is one value – Joe May 08 '18 at 18:11
  • 3
    I'm not asking *"are you planning to put the data in R?"*, I'm asking you to share the data in a way that makes it easy for people to help you: something **we** can copy/paste into R. See [how to make a reproducible example in R](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) if you need guidance - `dput()` or code to simulate the data makes it easy. – Gregor Thomas May 08 '18 at 18:15
  • structure(list(Name = c("A,Y - B,X", "C,Q - W,R", "E,F - B,T", "E,T - F, B", "Q,P - P,O", "B,X - A,Y", "W,R - C,Q", "T,U - K,M" )), .Names = "Name", class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -8L)) – Joe May 08 '18 at 18:58

2 Answers2

1

I'd do it like this:

library(tidyr)
library(dplyr)
dat %>% separate(Name, into = c("pair1", "pair2"), sep = " - ", remove = FALSE) %>%
    mutate(p1 = pmin(pair1, pair2), p2 = pmax(pair1, pair2)) %>%
    distinct(p1, p2, .keep_all = TRUE) %>%
    select(-pair1, -pair2, -p1, -p2)
# # A tibble: 6 x 1
#         Name
#        <chr>
# 1  A,Y - B,X
# 2  C,Q - W,R
# 3  E,F - B,T
# 4 E,T - F, B
# 5  Q,P - P,O
# 6  T,U - K,M

Separate the pairs, use pmin and pmax to sort them consistently, and finally deduplicate on the sorted pairs (keeping the first occurrence only).

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Thanks Gregor! A quick question, will it ignore the cases where Let's say we have a rows Q,T - B,N and A,Y - N,O – Joe May 08 '18 at 19:19
  • Yup. Only if both pairs match a previous row will it remove the row. But don't take my word for it - add another row to your sample data and see! – Gregor Thomas May 08 '18 at 19:26
  • Hi Gregor, I tried using the above code for my actual whole database, but it's not removing all the reversed values, just a few. – Joe May 08 '18 at 19:59
  • As long as the space is consistent, it shouldn't matter. Run up to each line and see that it looks to be working. If you need help debugging, share a reproducible example that illustrates the problem. – Gregor Thomas May 08 '18 at 20:02
  • I have updated the dput code in the question. For that data, i am not able to remove all the duplicates from the above code. For example- after running the code, Ut, Y - Pa, J still exists in the column. There are similar cases to this one in the final output – Joe May 08 '18 at 20:17
  • @Joe Well, I'm very confused about why the original didn't work. If I can't figure it out I may need to ask a question myself about it. But something about the `filter(!duplicated())` was a problem. Switching to `dplyr::distinct` seems to fix it. Answer updated. – Gregor Thomas May 08 '18 at 21:58
0

You can use the strsplit function to create two new columns in your dataframe - one column of the stuff to the left of the "-" and one column for the stuff to the right. Then you just subset the dataframe to keep the rows that you want.

Adrian Martin
  • 780
  • 7
  • 21
  • I need to remove the reversed duplicates rows from that column. By splitting it, I am not sure how will I identify the rows which are reversed – Joe May 08 '18 at 18:33
  • You can then add an index column (starting at 1, to however many rows you have). Then you can say that if a row from column B (after the hyphen) is duplicated in column A (before the hyphen), drop which ever row has the higher index value. – Adrian Martin May 09 '18 at 18:50