Redundant gene list simplification

Question

I need to convert this format when basically you have for 1 gene ID (first col) up to 5 possible values associated in the 3rd column:

TMCS09g1008676  fleshy  0.000234939
TMCS09g1008676  fleshy  1.38379E-05
TMCS09g1008676  fleshy  0.00331883
TMCS09g1008677  fleshy  0.0481578
TMCS09g1008678  fleshy  0.0350491
TMCS09g1008679  fleshy  0.0335639
TMCS09g1008680  fleshy  0.0167087
TMCS09g1008681  fleshy  0.00301089
TMCS09g1008682  fleshy  0.00519838
TMCS09g1008682  fleshy  0.0399833
TMCS09g1008682  fleshy  0.0122184
TMCS09g1008683  fleshy  0.00202427
TMCS09g1008683  fleshy  0.00199513
TMCS09g1008683  fleshy  0.0350491
TMCS09g1008683  fleshy  0.00331883
TMCS09g1008683  fleshy  0.0399833

to this where you have the all possible values associated to each gene ID (first col) separated by tab (so max 5 fields will be filled):

TMCS09g1008676  0.000234939 1.38379E-05 0.00331883      
TMCS09g1008677  0.0481578               
TMCS09g1008678  0.0350491               
TMCS09g1008679  0.0335639               
TMCS09g1008680  0.0167087               
TMCS09g1008681  0.00301089              
TMCS09g1008682  0.00519838  0.0399833   0.0122184       
TMCS09g1008683  0.00202427  0.00199513  0.0350491   0.00331883  0.0399833

I would very much appreciate some indications with R.

Is that a single column of values, or multiple columns? It will help if you make things clear in your post. — Gopala, May 02 '18 at 14:38

Gopala · Answer 1 · 2018-05-02T15:14:35.870

Here is one solution using dplyr package. You can adjust the code based on your column names, which you have not provided.

library(dplyr)

df <- read.table(text = "TMCS09g1008676  fleshy  0.000234939
    TMCS09g1008676  fleshy  1.38379E-05
    TMCS09g1008676  fleshy  0.00331883
    TMCS09g1008677  fleshy  0.0481578
    TMCS09g1008678  fleshy  0.0350491
    TMCS09g1008679  fleshy  0.0335639
    TMCS09g1008680  fleshy  0.0167087
    TMCS09g1008681  fleshy  0.00301089
    TMCS09g1008682  fleshy  0.00519838
    TMCS09g1008682  fleshy  0.0399833
    TMCS09g1008682  fleshy  0.0122184
    TMCS09g1008683  fleshy  0.00202427
    TMCS09g1008683  fleshy  0.00199513
    TMCS09g1008683  fleshy  0.0350491
    TMCS09g1008683  fleshy  0.00331883
    TMCS09g1008683  fleshy  0.0399833")

df %>% group_by(V1) %>% slice(1:5) %>% summarise(V3 = paste(V3, collapse = ' '))

# A tibble: 8 x 2
  V1             V3                                                  
  <fctr>         <chr>                                               
1 TMCS09g1008676 0.000234939 1.38379e-05 0.00331883                  
2 TMCS09g1008677 0.0481578                                           
3 TMCS09g1008678 0.0350491                                           
4 TMCS09g1008679 0.0335639                                           
5 TMCS09g1008680 0.0167087                                           
6 TMCS09g1008681 0.00301089                                          
7 TMCS09g1008682 0.00519838 0.0399833 0.0122184                      
8 TMCS09g1008683 0.00202427 0.00199513 0.0350491 0.00331883 0.0399833

In the above, slice is limiting values per group to 5.

EDIT: Seems like the desired output is multiple columns for each of the five values. Here is one solution for that:

library(tidyr)
library(dplyr)

df %>% select(-V2) %>% group_by(V1) %>% slice(1:5) %>% mutate(valueId = paste0('value', row_number())) %>% spread(valueId, V3)

This can be written out to a file with tab delimit, and read back as a tab delimited file.

thank you very much! to separate the results by tab in the V3? is that possible? — mightaskalot, May 02 '18 at 14:47
When you do, it shows '\t' in the data frame view representation. However, when you write the data frame to a file, it will preserve tabs. Example, try this: `write.table(df %>% group_by(V1) %>% slice(1:5) %>% summarise(V3 = paste(V3, collapse = '\t')), 'df.txt', row.names = FALSE, quote = FALSE)` — Gopala, May 02 '18 at 14:51
something strange happens, if i try to reload the data `df_summarized <- read.delim("df.txt", sep = "\t", header = FALSE, skip = 1)` then it doesnt read properly the file importing less fields and moving others — mightaskalot, May 02 '18 at 15:04
Looks like what you want is not a single column of tab separated values, but actually multiple columns, including the first column that you want to write to a tab delimited file. Once again, it helps to ask a more clear question with expected output clearly defined. In your case, what you need is my edited version. See above. — Gopala, May 02 '18 at 15:13

Redundant gene list simplification

1 Answers1