Finding duplicated words

Question

I have a data frame with several distinct columns.

Each column has several different gene names.

I would like to know:

if there are repeated gene names in the whole data frame,
if possible, how many times each gene is repeated.

This is part of my data frame:

DS_struct <-
  structure(
    list(
      `12941` = c("", "", "", "", ""),
      `14520` = c("ABAT",
                  "ABCA6", "ABCA8", "ABCB4", "ABCG2"),
      `22405` = c("ACSL4", "ADFP",
                  "ADH1A", "ADH1B", "ADH1C"),
      `25097` = c("AATF", "ABCB8", "ABLIM3",
                  "ACCN2", "ACSM3"),
      `33006` = c("ADAMTS1", "ADAMTS13", "ADGRA3",
                  "ADGRG7", "ADH1B"),
      `36376` = c("ACAA2", "ACACB", "ACAD11", "ACOT12",
                  "ACSL1"),
      `39791` = c("ABAT", "ACACB", "ACSL4", "ACSM5", "ADAMTSL2"),
      `41804` = c("A2M-AS1", "A2MP1", "AADAT", "ABCA8", "ACADL"),
      `46408` = c("A1CF", "A2M", "AADAT", "AASS", "ABAT"),
      `50579` = c("AASS",
                  "ABAT", "ABCA8", "ABCB10", "ABLIM2"),
      `55191` = c("", "",
                  "", "", ""),
      `57555` = c("", "", "", "", ""),
      `57957` = c("ACSL4",
                  "ACSM3", "ADAMTSL2", "ADGRG2", "ADH1B"),
      `57958` = c("",
                  "", "", "", ""),
      `58043` = c("", "", "", "", ""),
      `60502` = c("ABAT",
                  "ABCA6", "ABCA8", "ABCB4", "ABT1"),
      `62232` = c("AADAT",
                  "AASS", "AASS", "ABCA8", "ABCC4"),
      `76427` = c("ADGRG7",
                  "ADIRF", "ALPL", "ANXA10", "ASPDH"),
      `84005` = c("", "",
                  "", "", ""),
      `84402` = c("AADAT", "AASS", "ABAT", "ABCA6",
                  "ABCA8"),
      `89186` = c("", "", "", "", ""),
      `101685` = c("AADAT",
                   "AASS", "ABAT", "ABCA9", "ABCC4"),
      `101728` = c("5-??", "5_8S_rRNA",
                   "A1BG", "A2M", "AACS"),
      `113996` = c("", "", "", "", ""),
      `117361` = c("", "", "", "", ""),
      `121248` = c("ABI3BP",
                   "ACADL", "ACOT12", "ACSL4", "ACSM3"),
      `136247` = c("", "",
                   "", "", ""),
      `138178` = c("", "", "", "", ""),
      `166163` = c("",
                   "", "", "", "")
    ),
    row.names = 2:6,
    class = "data.frame"
  )

Please provide an example of your data set so that we have more to go on. — Phil, Mar 31 '22 at 14:19
Phil, I have inserted a table, as an example. So, please, see if now you could help me solve the issue. — Fábio Seiva, Mar 31 '22 at 17:26
[See here](https://stackoverflow.com/q/5963269/5325862) on making a reproducible example that is easier for folks to help with, including data we can work with, not a picture of a table. You should also include the type of calculation nyou're trying to do, because it's unclear from just your description — camille, Mar 31 '22 at 17:55
Roughly, I would do something like `mydf |> dplyr::mutate(id = row_number()) |> tidyr::pivot_longer(-id) |> dplyr::group_by(id) |> janitor::get_duplicates(value)` — Phil, Mar 31 '22 at 18:51
Camille, could you please inform me what else I need to provide, in order to get my question reopened? — Fábio Seiva, Apr 01 '22 at 14:55
@Phil ``get_dupes()`` doesn't recognise groups so you can just do a column-wise operation like ``janitor::clean_names(DS_struct) %>% janitor::get_dupes(everything())`` — user438383, Apr 01 '22 at 15:52
@Phil, thanks so much for your time and attention. I did as you suggested and had this output message: "No duplicate combinations found of: x12941, x14520, x22405, x25097, x33006, x36376, x39791, x41804, x46408, ... and 20 other variables". But as you can see in the data frame, there are duplicated gene names (for example, the gene AADAT appears in columns 25097, 36376, and 39791). Would you have another suggestion? — Fábio Seiva, Apr 01 '22 at 19:47
@FábioSeiva Try `DS_struct |> dplyr::mutate(id = dplyr::row_number()) |> tidyr::pivot_longer(-id) |> dplyr::count(value) |> dplyr::arrange(dplyr::desc(n))` — Phil, Apr 02 '22 at 01:53
`library(tidyverse) DS_struct %>% pivot_longer( cols = everything(), names_to = 'distinct_columns', values_to = 'gene_names' ) %>% filter(gene_names != "") %>% group_by(gene_names) %>% add_count() %>% distinct(gene_names, .keep_all = TRUE) %>% ggplot(aes(x=fct_reorder(gene_names, n), y=n, fill=distinct_columns)) + geom_col()+ coord_flip()+ xlab("duplicated gene names")+ geom_text(aes(label =n), hjust = -0.5)+ theme_classic()` — TarJae, Apr 02 '22 at 22:49

score 0 · Answer 1 · answered Apr 04 '22 at 02:19

First of all, let's convert those blank values to NAs. That way we won't be counting blanks as actual genes when we go to count them up.

DS_struct[which(DS_struct == "", arr.ind = T)] <- NA

Now we can look at how many of each gene name is in the data frame.

gene_counts <- sort(table(unlist(DS_struct)))
gene_counts

We can test if there are repeated gene names in the data frame.

repeated_genes <- length(gene_counts[which(gene_counts > 1)]) != 0
repeated_genes

And have a look at which gene names are repeated.

gene_counts[which(gene_counts > 1)]

Finding duplicated words

1 Answers1