0

I have a text file. This text file has data similar to the example given here down. I would like process the data in R in such a way that all value given in bracket should be added and keep the value under one catogeries as given in example. Kindly help, how to import and process the text file, to get my desire results, I am new to programming.
text file look like given below

Carbohydrate metabolism
00010 Glycolysis / Gluconeogenesis (27)
00020 Citrate cycle (TCA cycle) (22)
00030 Pentose phosphate pathway (19)
Energy metabolism
00190 Oxidative phosphorylation (68)
00710 Carbon fixation in photosynthetic organisms (16)
00720 Carbon fixation pathways in prokaryotes (10)

I nedd output in dtatfram, which should look like after adding values given in bracket under catogeris

V1                       V2
Carbohydrate metabolism  68
Energy metabolism        94
Umar
  • 117
  • 7
  • 1
    It’s unclear how your data are structured in R after you import the text file. Could you edit your question to include the output when you run `dput(your_data)`? – jpsmith Jan 08 '23 at 16:58
  • @jpsmith, I edited, kindly look into it – Umar Jan 08 '23 at 17:06
  • For each heading you want to retain in your dataframe (e. g. "Carbohydrate metabolism") there are several entries with values in parentheses. Which do you want to keep? The first per section? All, in one "cell"? All, spread over separate columns? ... – I_O Jan 08 '23 at 17:44
  • Welcome to Stack Overflow! Can you please read and incorporate elements from [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269/1082435). Especially the aspects of using `dput()` for the input and then an explicit example of your expected dataset? – wibeasley Jan 08 '23 at 17:49
  • @I_O, Its a output of KEGG analysis, under "Carbohydrate metabolism", three entries are coming, and similarly many different enteries is there under different class, and I have to add all enteries under class such as "Carbohydrate metabolism" etc and compare with other annotated genome but doing manually I found difficulties, since I have 100 or more text file, so I posted here in hope of getting solution – Umar Jan 08 '23 at 17:59

1 Answers1

0

This is tricky because you essentially have two dataframes stacked together. One way to achieve your goal is to 1) create a grouping variable if metabolism is energy or carbohydrate, 2) split up the string into the name of energy and the value (which is stuck inside parentheses, so we also need to get rid of those parentheses), and 3) use summarize() to sum everything up by group.

library(tidyverse)

tt <- read_delim("
Carbohydrate metabolism
00010 Glycolysis / Gluconeogenesis (27)
00020 Citrate cycle (TCA cycle) (22)
00030 Pentose phosphate pathway (19)
Energy metabolism
00190 Oxidative phosphorylation (68)
00710 Carbon fixation in photosynthetic organisms (16)
00720 Carbon fixation pathways in prokaryotes (10)", 
                 col_names = c("id", "name"))

tt <- tt %>%
  # create a grouping variable to "divide" your two dataframes
  mutate(meta = as.character(replace(id, str_detect(id, "^[:digit:]+$"), NA))) %>%
  fill(meta, .direction = "down") %>%
  # get rid of "column name" stuck in middle of dataframe
  filter(name != "metabolism") %>%
  # split up name of the metabolism and the value by the parenthesis
  extract("name", c("char", "value"), "(\\D*)(\\d.*)") %>%
  # get rid of parenthesis by subtracting last character in the column "value"
  mutate(value = as.numeric(substring(value, 1, nchar(value)-1))) %>% 
  # sum up by grouping variable
  group_by(meta) %>%
  summarise(sumvalue = sum(value))

print(tt)
# A tibble: 2 × 2
  meta         sumvalue
  <chr>           <dbl>
1 Carbohydrate       68
2 Energy             94
jrcalabrese
  • 2,184
  • 3
  • 10
  • 30