How to quickly split values in column to create a table for plotting in R

Question

I was wondering if anyone could offer any advice on speeding
the following up in R.

I’ve got a table in a format like this

chr1, A, G, v1,v2,v3;w1w2w3, ...
...

The header is

chr, ref, alt, sample1, sample2 ...(many samples)

In each row for each sample I’ve got 3 values for v and 3 values for w, separated by “;"

I want to extract v1 and w1 for each sample make a table that can be plotted using ggplot, it would look like this

chr, ref, alt, sam, v1, w1

I am doing this by strsplit and rbind one by one like the following

varsam <- c()
for(i in 1:n.var){
    chrm <- variants[i,1]
    ref <- as.character(variants[i,3])
    alt <- as.character(variants[i,4])
    amp <- as.character(variants[i,5])
    for(j in 1:n.sam){
        vs <- strsplit(as.character(vcftable[i,j+6]), split=":")[[1]
        vsc <- strsplit(vs[1], split=",")[[1]]
        vsp <- strsplit(vs[2], split=",")[[1]]
        varsam <- rbind(varsam, c(chrm, pos, ref, j, vsc[1], vsp[1]))
}

This is very slow as you would expect. Any idea how to speed this up?

Use vectorization, e.g. `table$v1 <- strsplit(table$sample1, split=",")[[1]]` — Christian Borck, Feb 24 '15 at 11:08
Please read [how to make a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — RockScience, Feb 24 '15 at 11:21
The [processing] tag should only be used for questions about Processing. — Kevin Workman, Feb 24 '15 at 13:03
Thanks for your replies and sorry for not properly following rules. — Zhihao Ding, Feb 24 '15 at 14:18

score 0 · Answer 1 · answered Feb 24 '15 at 13:29

As noted by others, the first thing you need is some timings, so that you can compare performance if you intend to optimize performance. This would be my first step:

Create some timings
Play around with different aspects of your code to see where the main time is being used.
Basic timing analysis can be done with system.time() method to help with performance analysis

Beyond that, there are some candidates you might like to consider to improve performance - but importantly, it is important to get the timings first so that you have something to compare against.

the dplyr library contains a mutate function which can be used to create new columns, e.g. mynewtablewithextracolumn <- mutate(table, v1 = whatever you want it to be). In the previous statement, simply insert how to calculate each column value where v1 is a new column. There are lots of examples on the internet.
In order to use dplyr, you would need to perform a call to library(dplyr) in your code.
You may need to install.packages("dplyr") if not already installed.
In order to use dplyr, you might be best converting your table into the appropriate type of table for dplyr, e.g. if your current table is data frame, then use table = tbl_df(df) to create a table

As noted, these are just some possible areas. The important thing is to get timings and explore the performance to try to get a handle on where the best place to focus is and to make sure you can measure the performance improvement.

Thanks very much for your reply. mutate doesn't quite solve this as my table contains many columns, each representing one of my sample. I could mutate for one column of course but efficiently do that for a large number of columns, I don't know. Maybe I should try melt before mutate. — Zhihao Ding, Feb 24 '15 at 14:21

score 0 · Accepted Answer · answered Feb 24 '15 at 15:54

Thanks for the comments. I think I've found way to improve this. I used melt in "reshape" to firstly convert my input table to

chr, ref, alt, variable

I can then use apply to modify "variable", each row for which contains a concatenated string. This achieves good speed.

How to quickly split values in column to create a table for plotting in R

2 Answers2