i have a df that looks like:
structure(list(A = c("KEGG 09100 PATHWAY Metabolism", "KEGG 09100 PATHWAY Metabolism",
"KEGG 09100 PATHWAY Metabolism", "KEGG 09100 PATHWAY Metabolism",
"KEGG 09100 PATHWAY Metabolism", "KEGG 09100 PATHWAY Metabolism",
"KEGG 09100 PATHWAY Metabolism", "KEGG 09100 PATHWAY Metabolism"
), B = c("KEGG 09101 PATHWAY Carbohydrate metabolism", "KEGG 09101 PATHWAY Carbohydrate metabolism",
"KEGG 09103 PATHWAY Lipid metabolism", "KEGG 09105 PATHWAY Amino acid metabolism",
"KEGG 09108 PATHWAY Metabolism of cofactors and vitamins", "KEGG 09111 PATHWAY Xenobiotics biodegradation and metabolism",
"KEGG 09111 PATHWAY Xenobiotics biodegradation and metabolism",
"KEGG 09111 PATHWAY Xenobiotics biodegradation and metabolism"
), C = c("KEGG 00010 PATHWAY Glycolysis / Gluconeogenesis", "KEGG 00620 PATHWAY Pyruvate metabolism",
"KEGG 00071 PATHWAY Fatty acid degradation", "KEGG 00350 PATHWAY Tyrosine metabolism",
"KEGG 00830 PATHWAY Retinol metabolism", "KEGG 00625 PATHWAY Chloroalkane and chloroalkene degradation",
"KEGG 00626 PATHWAY Naphthalene degradation", "KEGG 00980 PATHWAY Metabolism of xenobiotics by cytochrome P450"
), KO_DEFINITION = c("KO K00001 DEFINITION E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]",
"KO K00001 DEFINITION E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]",
"KO K00001 DEFINITION E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]",
"KO K00001 DEFINITION E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]",
"KO K00001 DEFINITION E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]",
"KO K00001 DEFINITION E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]",
"KO K00001 DEFINITION E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]",
"KO K00001 DEFINITION E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]"
)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"))
#23k more rows...
I want to get in a new column or df a string for each unique KO_DEF, and keep all the uniques strings in A,B and C. This is an example:
structure(list(KO_DEFINITION = c("KO K00001 DEFINITION E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]",
"KO K00002 DEFINITION AKR1A1, adh; alcohol dehydrogenase (NADP+) [EC:1.1.1.2]"
), new_string = c("KEGG 09100 PATHWAY Metabolism KEGG 09101 PATHWAY Carbohydrate metabolism KEGG 09103 PATHWAY Lipid metabolism KEGG 09105 PATHWAY Amino acid metabolism KEGG 09108 PATHWAY Metabolism of cofactors and vitamins KEGG 09111 PATHWAY Xenobiotics biodegradation and metabolism KEGG 00010 PATHWAY Glycolysis / Gluconeogenesis KEGG 00620 PATHWAY Pyruvate metabolism KEGG 00071 PATHWAY Fatty acid degradation KEGG 00350 PATHWAY Tyrosine metabolism KEGG 00830 PATHWAY Retinol metabolism KEGG 00625 PATHWAY Chloroalkane and chloroalkene degradation KEGG 00626 PATHWAY Naphthalene degradation KEGG 00980 PATHWAY Metabolism of xenobiotics by cytochrome P450 KEGG 00982 PATHWAY Drug metabolism - cytochrome P450",
"KEGG 09100 PATHWAY Metabolism KEGG 09160 PATHWAY Human Diseases KEGG 09180 PATHWAY Brite Hierarchies KEGG 09101 PATHWAY Carbohydrate metabolism KEGG 09103 PATHWAY Lipid metabolism KEGG 09111 PATHWAY Xenobiotics biodegradation and metabolism KEGG 09161 PATHWAY Cancer: overview KEGG 09183 PATHWAY Protein families: signaling and cellular processes KEGG 00010 PATHWAY Glycolysis / Gluconeogenesis KEGG 00040 PATHWAY Pentose and glucuronate interconversions KEGG 00053 PATHWAY Ascorbate and aldarate metabolism KEGG 00620 PATHWAY Pyruvate metabolism KEGG 00561 PATHWAY Glycerolipid metabolism KEGG 00930 PATHWAY Caprolactam degradation KEGG 05208 PATHWAY Chemical carcinogenesis - reactive oxygen species KEGG 04147 PATHWAY Exosome [BR:ko04147]"
)), row.names = 1:2, class = "data.frame")
As you can see, only keep the unique cells in columns A,B and C that group under an unique KO_DEF.
I don't know a good way to do this but i have this idea: combine the text of ABC, group_by KO_DEF (from dplyr) and with other function search and remove duplicated patterns in the big string... Im sure there is a less complex way.
Thank you!