2

I have a output data, where in each row there are multiple isoforms for each gene. Isoforms are seperated by comma ','. When I import the table to R, data frame looks like as below.

Df:
gene isoform                sample1_read_number        p-value
A    'A1','A2','A3'         0:23,1:12,2:122            0.9,0.01,0.5
B    'B1','B2','B3'         0:3,1:45,2:76              0.43,0.001,0.12
C    'C1','C2','C3','C4'    0:5,1:56,2:166,3:7         0.004,0.002,0.23,0.12
D    'D1','D2'              0:43,1:100                 0.1,0.0003

For each gene, there are multiple isoforms. For each isoform, I have read numbers, seperated by comma (0:23 read for A1 meaning A1 read is 23) and p-values seperated by comma (p-value for A1 is 0.9 and A2 is 0.01). So everything is in an order by comma separation in each object.

For example when I call, df[1,2] the result is [1] 'A1','A2','A3''
or df[1,4] the result is [1] 0.9,0.01,0.5 as one object. I couldn't figure how to make R to separate those values in df[X,Y].

The reason I want to do this is because, I want to filter this data to based on p-value or read number. To be able to do that, first I should be able to break this data frame by each isoform and to do that I need to find a way to separate values on each spot.

Final data frame should be like that (only showing for gene A and B here):

Df_I:
gene isoform sample1_read_number  p-value 
A    A1      0:23                 0.9
A    A2      1:12                 0.01
A    A3      2:122                0.5
B    B1      0:3                  0.43
B    B2      1:45                 0.001
B    B3      2:76                 0.12

Anybody can give me ideas to make this second data frame? Any help would be appreciated a lot!

Cheers! A

AGG
  • 35
  • 4
  • Please provide example data. http://stackoverflow.com/q/5963269/4552295 – s_baldur Aug 05 '16 at 16:54
  • Your data violates "Normalization" logic in database concept. You should first enter each meaningful data to a separate cell. Perhaps, you should define more columns for isoform column, similarly for sample1_read_number and p-value columns. If you had obeyed the classical normalization rules, then transformation from state 1 to state 2 will be rather easier. – Erdogan CEVHER Aug 05 '16 at 16:58
  • 1
    Have you tried the solution I posted? – akrun Aug 05 '16 at 17:39

2 Answers2

3

There is the separate_rows function from tidyr package written for your case:

tidyr::separate_rows(df, isoform:p.value, sep = ",")

#   gene isoform sample1_read_number p.value
#1     A      A1                0:23     0.9
#2     A      A2                1:12    0.01
#3     A      A3               2:122     0.5
#4     B      B1                 0:3    0.43
#5     B      B2                1:45   0.001
#6     B      B3                2:76    0.12
#7     C      C1                 0:5   0.004
#8     C      C2                1:56   0.002
#9     C      C3               2:166    0.23
#10    C      C4                 3:7    0.12
#11    D      D1                0:43     0.1
#12    D      D2               1:100  0.0003
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • It can be used for as many columns as you want as long as all the columns have the same number of elements and the correspondence as you have specified. The data you are showing, for example, has 3 columns to separate. – Psidom Aug 05 '16 at 17:35
  • Very useful function! Thank you. However when I use it, it only separate `isoforms` and `p.value`, doesn't work on `sample1_read_number`. Sample1_read_number is still looks like this `0:23,1:12, 2:122` for each line for gene A. Any idea why? – AGG Aug 05 '16 at 17:38
  • Did you copy and paste my code? Notice the `:` between `isoform` and `p.value`, which gives the range of columns starting from `isoform` and ending with `p.value`. – Psidom Aug 05 '16 at 17:39
  • Okay, just understand the function of `:` between columns. That was the reason! Thanks a lot!! – AGG Aug 05 '16 at 17:43
  • The only issue I haven't resolve with this solution is, all the columns turn in to 'chracter' after the function, so I can not put filter for numeric values. – AGG Aug 05 '16 at 19:33
  • Find the solution for numeric: http://stackoverflow.com/questions/2288485/how-to-convert-a-data-frame-column-to-numeric-type – AGG Aug 05 '16 at 19:38
  • Besides, `filter(as.numeric(col)...)` should work for a few of columns. – Psidom Aug 05 '16 at 19:40
1

This can be easily done with cSplit from splitstackshape

library(splitstackshape)
na.omit(cSplit(Df, 2:ncol(Df), ",", "long"))
#    gene isoform sample1_read_number p.value
# 1:    A      A1                0:23  0.9000
# 2:    A      A2                1:12  0.0100
# 3:    A      A3               2:122  0.5000
# 4:    B      B1                 0:3  0.4300
# 5:    B      B2                1:45  0.0010
# 6:    B      B3                2:76  0.1200
# 7:    C      C1                 0:5  0.0040
# 8:    C      C2                1:56  0.0020
# 9:    C      C3               2:166  0.2300
#10:    C      C4                 3:7  0.1200
#11:    D      D1                0:43  0.1000
#12:    D      D2               1:100  0.0003
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Actually, this even solved my issue better! Thank you for teaching me this function! – AGG Aug 05 '16 at 17:47
  • Do you know if there is any way to get rid of `0:` part in sample1_read_number column? I want to put a filter based on read number but since there is isoform identifier (0:,1:,2:,4:) I can't treat them as number. – AGG Aug 05 '16 at 18:55
  • I actually have just realize this solution didn't work. It gives me only the first isoform for each gene. – AGG Aug 05 '16 at 19:30