How to parse a df to group elements and remove duplicates

Question

i have a df that looks like:

structure(list(A = c("KEGG 09100 PATHWAY Metabolism", "KEGG 09100 PATHWAY Metabolism", 
"KEGG 09100 PATHWAY Metabolism", "KEGG 09100 PATHWAY Metabolism", 
"KEGG 09100 PATHWAY Metabolism", "KEGG 09100 PATHWAY Metabolism", 
"KEGG 09100 PATHWAY Metabolism", "KEGG 09100 PATHWAY Metabolism"
), B = c("KEGG 09101 PATHWAY Carbohydrate metabolism", "KEGG 09101 PATHWAY Carbohydrate metabolism", 
"KEGG 09103 PATHWAY Lipid metabolism", "KEGG 09105 PATHWAY Amino acid metabolism", 
"KEGG 09108 PATHWAY Metabolism of cofactors and vitamins", "KEGG 09111 PATHWAY Xenobiotics biodegradation and metabolism", 
"KEGG 09111 PATHWAY Xenobiotics biodegradation and metabolism", 
"KEGG 09111 PATHWAY Xenobiotics biodegradation and metabolism"
), C = c("KEGG 00010 PATHWAY Glycolysis / Gluconeogenesis", "KEGG 00620 PATHWAY Pyruvate metabolism", 
"KEGG 00071 PATHWAY Fatty acid degradation", "KEGG 00350 PATHWAY Tyrosine metabolism", 
"KEGG 00830 PATHWAY Retinol metabolism", "KEGG 00625 PATHWAY Chloroalkane and chloroalkene degradation", 
"KEGG 00626 PATHWAY Naphthalene degradation", "KEGG 00980 PATHWAY Metabolism of xenobiotics by cytochrome P450"
), KO_DEFINITION = c("KO K00001 DEFINITION E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]", 
"KO K00001 DEFINITION E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]", 
"KO K00001 DEFINITION E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]", 
"KO K00001 DEFINITION E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]", 
"KO K00001 DEFINITION E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]", 
"KO K00001 DEFINITION E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]", 
"KO K00001 DEFINITION E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]", 
"KO K00001 DEFINITION E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]"
)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"))

#23k more rows...

I want to get in a new column or df a string for each unique KO_DEF, and keep all the uniques strings in A,B and C. This is an example:


structure(list(KO_DEFINITION = c("KO K00001 DEFINITION E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]", 
"KO K00002 DEFINITION AKR1A1, adh; alcohol dehydrogenase (NADP+) [EC:1.1.1.2]"
), new_string = c("KEGG 09100 PATHWAY Metabolism KEGG 09101 PATHWAY Carbohydrate metabolism KEGG 09103 PATHWAY Lipid metabolism KEGG 09105 PATHWAY Amino acid metabolism KEGG 09108 PATHWAY Metabolism of cofactors and vitamins KEGG 09111 PATHWAY Xenobiotics biodegradation and metabolism KEGG 00010 PATHWAY Glycolysis / Gluconeogenesis KEGG 00620 PATHWAY Pyruvate metabolism KEGG 00071 PATHWAY Fatty acid degradation KEGG 00350 PATHWAY Tyrosine metabolism KEGG 00830 PATHWAY Retinol metabolism KEGG 00625 PATHWAY Chloroalkane and chloroalkene degradation KEGG 00626 PATHWAY Naphthalene degradation KEGG 00980 PATHWAY Metabolism of xenobiotics by cytochrome P450 KEGG 00982 PATHWAY Drug metabolism - cytochrome P450", 
"KEGG 09100 PATHWAY Metabolism KEGG 09160 PATHWAY Human Diseases KEGG 09180 PATHWAY Brite Hierarchies KEGG 09101 PATHWAY Carbohydrate metabolism KEGG 09103 PATHWAY Lipid metabolism KEGG 09111 PATHWAY Xenobiotics biodegradation and metabolism KEGG 09161 PATHWAY Cancer: overview KEGG 09183 PATHWAY Protein families: signaling and cellular processes KEGG 00010 PATHWAY Glycolysis / Gluconeogenesis KEGG 00040 PATHWAY Pentose and glucuronate interconversions KEGG 00053 PATHWAY Ascorbate and aldarate metabolism KEGG 00620 PATHWAY Pyruvate metabolism KEGG 00561 PATHWAY Glycerolipid metabolism KEGG 00930 PATHWAY Caprolactam degradation KEGG 05208 PATHWAY Chemical carcinogenesis - reactive oxygen species KEGG 04147 PATHWAY Exosome [BR:ko04147]"
)), row.names = 1:2, class = "data.frame")

As you can see, only keep the unique cells in columns A,B and C that group under an unique KO_DEF.

I don't know a good way to do this but i have this idea: combine the text of ABC, group_by KO_DEF (from dplyr) and with other function search and remove duplicated patterns in the big string... Im sure there is a less complex way.

Thank you!

How are you anticipating using your output next as at the moment your have KEGGS across A, B, C that are sufficiently unique as subprocesses tied to a unique KO definition...Can it be done, likely. Will it actually help? — Chris, Aug 07 '23 at 21:07
@Chris this df is a output after several steps i have done. because my final objetive is to know each kegg (the map number) and pathway name associated to each KO (number and definition). with that i will enrich the annotation of my files. — Alexander Rivero, Aug 07 '23 at 22:26

William Wong · Accepted Answer · 2023-08-09T17:42:03.473

1

This can be done with nest() in tidyverse.

library(tidyverse)

matrix(data = c("M1", "P1", "K1", "G1", 
                        "M1", "P2", "K2", "G1", 
                        "M1", "P2", "K3", "G1", 
                        "M1", "P3", "K4", "G1", 
                        "M1", "P4", "K5", "G1",
                        "M2", "P5", "K6", "G1", 
                        "M2", "P6", "K7", "G1",
                        "M1", "P1", "K1", "G2", 
                        "M1", "P2", "K2", "G2", 
                        "M2", "P5", "K6", "G2"), nrow = 4) %>% 
  t() %>%
  as_tibble() %>% 
  dplyr::rename(A = V1, B = V2, C = V3, KO_DEF = V4) %>% 
  ## above are to prepare the sample dataset
  ## solution start here
  group_by(KO_DEF) %>% 
  nest() %>% 
  mutate(new_string = paste(unique(unlist(data)), collapse = " "))

edited Aug 09 '23 at 17:42

answered Aug 08 '23 at 00:41

William Wong

453
2
9

1

Thank you so much! the code works perfectly! Im commenting late because I was trying to understand very basic thing like I don't have to transpose my data with t() or rename. Running the code one step at time gave me a lot of insight in how it works! – Alexander Rivero Aug 09 '23 at 14:06
1

I am glad that you found it useful. This was my first trial of posting an answer, I have added some comment lines to make the answer apparent from the sample data. In general, it may be advisable to also post sample data to avoid confusion. Please see this thread https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – William Wong Aug 09 '23 at 17:45
Thank you for your suggestions! i added a MRE following the post guidance! – Alexander Rivero Aug 09 '23 at 19:32

How to parse a df to group elements and remove duplicates

1 Answers1