0

I'm trying to optimize for-loop in my R-code.

Summary: I've a data frame (say, df) with 19 million rows including genes, and 2 columns('Chromosome' including corresponding chromosome, and 'Position' including corresponding position for each of those 19 mil genes). Now I want to create a new column 'chr_pos' being alternative name of each gene as Chromosome_Position. Example: a gene A located at chromosome 1, and position 123456 => Alternative name of the gene A would be 1_123456.

Here my code to do this:

for (i in nrow(df)){df$chr_pos[i] = paste0(df$Chromosome[i],"_",df$Position[i])}

I tried optimising using vectorisation but it's still ineffective.

Can this be optimised further?

Phil
  • 7,287
  • 3
  • 36
  • 66
Huy Nguyen
  • 61
  • 5
  • 1
    `paste0` is already vectorized so you don't need `for` loop. `df$chr_pos <- paste0(df$Chromosome, "_",df$Position)` Or with `paste` `df$chr_pos <- paste(df$Chromosome,df$Position, sep = '_')` – Ronak Shah Dec 07 '20 at 03:04
  • Okay so I will run my code without for loop, but will it speed up my code well? – Huy Nguyen Dec 07 '20 at 03:06
  • Oh, If I combine paste0 and apply, it will definitely speed up my code – Huy Nguyen Dec 07 '20 at 03:10
  • 2
    You don't need `apply` here as well. Did you try the code in my comment above? – Ronak Shah Dec 07 '20 at 03:12
  • 1
    Oh wow, I am really sorry for not trying your solution first. It has done in several seconds. Thanks a lot. Please post your solution again to let me mark it ! – Huy Nguyen Dec 07 '20 at 03:16

0 Answers0