-1

I was wondering if anyone could offer any advice on speeding
the following up in R.

I’ve got a table in a format like this

chr1, A, G, v1,v2,v3;w1w2w3, ...
...

The header is

chr, ref, alt, sample1, sample2 ...(many samples)

In each row for each sample I’ve got 3 values for v and 3 values for w, separated by “;"

I want to extract v1 and w1 for each sample make a table that can be plotted using ggplot, it would look like this

chr, ref, alt, sam, v1, w1

I am doing this by strsplit and rbind one by one like the following

varsam <- c()
for(i in 1:n.var){
    chrm <- variants[i,1]
    ref <- as.character(variants[i,3])
    alt <- as.character(variants[i,4])
    amp <- as.character(variants[i,5])
    for(j in 1:n.sam){
        vs <- strsplit(as.character(vcftable[i,j+6]), split=":")[[1]
        vsc <- strsplit(vs[1], split=",")[[1]]
        vsp <- strsplit(vs[2], split=",")[[1]]
        varsam <- rbind(varsam, c(chrm, pos, ref, j, vsc[1], vsp[1]))
}

This is very slow as you would expect. Any idea how to speed this up?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129

2 Answers2

0

As noted by others, the first thing you need is some timings, so that you can compare performance if you intend to optimize performance. This would be my first step:

  • Create some timings
  • Play around with different aspects of your code to see where the main time is being used.
  • Basic timing analysis can be done with system.time() method to help with performance analysis

Beyond that, there are some candidates you might like to consider to improve performance - but importantly, it is important to get the timings first so that you have something to compare against.

  • the dplyr library contains a mutate function which can be used to create new columns, e.g. mynewtablewithextracolumn <- mutate(table, v1 = whatever you want it to be). In the previous statement, simply insert how to calculate each column value where v1 is a new column. There are lots of examples on the internet.
  • In order to use dplyr, you would need to perform a call to library(dplyr) in your code.
  • You may need to install.packages("dplyr") if not already installed.
  • In order to use dplyr, you might be best converting your table into the appropriate type of table for dplyr, e.g. if your current table is data frame, then use table = tbl_df(df) to create a table

As noted, these are just some possible areas. The important thing is to get timings and explore the performance to try to get a handle on where the best place to focus is and to make sure you can measure the performance improvement.

Mike Curry
  • 1,609
  • 1
  • 9
  • 12
  • Thanks very much for your reply. mutate doesn't quite solve this as my table contains many columns, each representing one of my sample. I could mutate for one column of course but efficiently do that for a large number of columns, I don't know. Maybe I should try melt before mutate. – Zhihao Ding Feb 24 '15 at 14:21
0

Thanks for the comments. I think I've found way to improve this. I used melt in "reshape" to firstly convert my input table to

chr, ref, alt, variable

I can then use apply to modify "variable", each row for which contains a concatenated string. This achieves good speed.