1

I want to keep the rows that have the same elements in a dataframe that are present in two given columns such as

df <- data.frame(BGC1 = c("BGC1", "BGC1", "BGC1", "BGC2", "BGC2", "BGC2", "BGC3", "BGC3", "BGC3", "BGC4", "BGC4", "BGC4"),
                                     BGC2 = c("BGC2", "BGC3", "BGC4", "BGC1", "BGC3", "BGC4", "BGC1", "BGC2", "BGC4", "BGC1", "BGC2", "BGC3"),
                                     Family1 = c("Strepto_10","Strepto_20","Strepto_30", "Strepto_20","Strepto_20", "Strepto_50", "Strepto_20", "Strepto_30", "Strepto_30", "Strepto_30", "Strepto_50", "Strepto_40")
                                   , Family2 = c("Strepto_10","Strepto_10","Strepto_10", "Strepto_20","Strepto_20", "Strepto_20", "Strepto_30", "Strepto_30", "Strepto_30", "Strepto_40", "Strepto_40", "Strepto_40"))

Example DF

BGC1  | BGC2  | Bacteria1    |   Bacteria2
BGC1    BGC2    Strepto_10       Strepto_10
BGC1    BGC3    Strepto_20       Strepto_10
BGC1    BGC4    Strepto_30       Strepto_10
BGC2    BGC1    Strepto_20       Strepto_20
BGC2    BGC3    Strepto_20       Strepto_20
BGC2    BGC4    Strepto_50       Strepto_20
BGC3    BGC1    Strepto_20       Strepto_30
BGC3    BGC2    Strepto_30       Strepto_30
BGC3    BGC4    Strepto_30       Strepto_30
BGC4    BGC1    Strepto_30       Strepto_40
BGC4    BGC2    Strepto_50       Strepto_40
BGC4    BGC3    Strepto_40       Strepto_40

I would want to keep those where Family1 and Family2 are the same for example

Expected Output

BGC1  | BGC2  | Bacteria1    |   Bacteria2
BGC1    BGC2    Strepto_10       Strepto_10
BGC2    BGC1    Strepto_20       Strepto_20
BGC2    BGC3    Strepto_20       Strepto_20
BGC3    BGC2    Strepto_30       Strepto_30
BGC3    BGC4    Strepto_30       Strepto_30
BGC4    BGC3    Strepto_40       Strepto_40
  • 1
    Please do not post an image of code/data/errors: it cannot be copied or searched (SEO), it breaks screen-readers, and it may not fit well on some mobile devices. Ref: https://meta.stackoverflow.com/a/285557 (and https://xkcd.com/2116/). Please just include the code, console output, or data (e.g., `dput(head(x))` or `data.frame(...)`) directly. – r2evans May 27 '20 at 08:08
  • 1
    `mydata[ mydata$col0 %in% c(dat1$col1, dat2$col2),]` – r2evans May 27 '20 at 08:10
  • I really don't know, your image is blurry, not R, and ... an image. I won't spend time trying to transcribe images to data that you have as actual data on your computer. – r2evans May 27 '20 at 08:18
  • Here are some good pointers on what to include in a question to make it reproducible and easy for others to "play" with: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. – r2evans May 27 '20 at 08:18
  • You don't have to *delete* the question, just remove the image and add usable data. – r2evans May 27 '20 at 08:25
  • What is your expected output? Do you want to keep rows where same values of `Family1` and `Family2` occur more than once? So something like this `df %>% group_by(Family1, Family2) %>% filter(n() >= 2)` ? – Ronak Shah May 27 '20 at 08:41
  • I just want to keep rows that have same matching values in family 1 and 2 in general, it does not matter on their number of occurrences :) – bioinformatics_student May 27 '20 at 08:44
  • For the data shared (`df`), can you update your post with expected output? – Ronak Shah May 27 '20 at 08:45
  • @ronaksha I have added my expected output, thanks once again – bioinformatics_student May 27 '20 at 08:48

2 Answers2

3

You can subset with [ where df$Family1 == df$Family2.

df[df$Family1 == df$Family2,]
#   BGC1 BGC2    Family1    Family2
#1  BGC1 BGC2 Strepto_10 Strepto_10
#4  BGC2 BGC1 Strepto_20 Strepto_20
#5  BGC2 BGC3 Strepto_20 Strepto_20
#8  BGC3 BGC2 Strepto_30 Strepto_30
#9  BGC3 BGC4 Strepto_30 Strepto_30
#12 BGC4 BGC3 Strepto_40 Strepto_40
GKi
  • 37,245
  • 2
  • 26
  • 48
1

You could subset where Bacteria1 and Bacteria2 are equal.

subset(df, Bacteria1 == Bacteria2)

#   BGC1 BGC2  Bacteria1  Bacteria2
#1  BGC1 BGC2 Strepto_10 Strepto_10
#4  BGC2 BGC1 Strepto_20 Strepto_20
#5  BGC2 BGC3 Strepto_20 Strepto_20
#8  BGC3 BGC2 Strepto_30 Strepto_30
#9  BGC3 BGC4 Strepto_30 Strepto_30
#12 BGC4 BGC3 Strepto_40 Strepto_40

Using dplyr's filter.

dplyr::filter(df, Bacteria1 == Bacteria2)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213