0

I am currently processing on a large amount of data using R in a Unix environment:

nrow = 32 793 730
ncol = 17

The tasks performed by the script are the loading of the data into R, the selection of certain amount of variables and the creation of other variables using the packages data.table and RDS format.

Everything goes well until I run this line of code:

all.RDS <- file[ ,`:=`(TIME=ifelse(TIME==100000, "10:00",
                                   ifelse(nchar(TIME)==5, 
                                          paste0("0",substr(TIME,1,1),
                                                 ":",substr(TIME,2,3)),
                                          paste0(substr(TIME,1,2),":",
                                                 substr(TIME,3,4)))))]
all.RDS[,`:=`(DATETIME=paste0(format(DATE,format="%d%b%Y"),":",TIME))]

This line is only supposed to transform a time variable from the format 100000 to the format 10:00:00. However, I obtain the error

Error: segfault from C stack overflow

I am supposed to produce a data.table with around 600 rows and 5800 columns. I have tried several things, following the answers I have found in the related pages on this platform: segfault from C stack overflow and segfault from C stack overflow in R using data.table.

I have tried to reload the packages, in case the installation was corrupted. I don't think it is a space problem. The script has been developed under Windows environment (R 3.2.2) and this line of code is running fine there. It seems related to the fact that I am running the script in Unix environment (R 3.2.0). However, it does not make sense since my Unix environment is much more powerful in terms of computational capabilities and available memory than my Windows environment.

I am obviously constrained by the fact that I have to use R and Unix to develop my solution.

Do you know, dear community, a solution for this problem?

Many thanks in advance!

Community
  • 1
  • 1
Trancavel
  • 11
  • 2

1 Answers1

0

You haven't given a reproducible example, and your code is a mess, so it's hard to follow. I'm not sure where your segfault is coming from (have you installed the latest development version of data.table?), but here's another approach to your problem:

file[ , TIME := sprintf('%06d', TIME)]
file[ , TIME := gsub('(.{2})(.{2})(.{2})', '\\1:\\2:\\3', TIME)]

Hopefully that averts the issue of the segfault. Otherwise you'll have to help us narrow down where the error's coming from by providing a reproducible example.

Community
  • 1
  • 1
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • Thank you very much for your answer. It helps a lot. It permitted to narrow down the segfault issue. `all.RDS[,`:=`(DATETIME=paste0(format(DATE,format="%d%b%Y"),":",TIME))]` is the cause of the segfault issue. I tried to separate the format and the paste action. Both seem to fail. – Trancavel Apr 26 '17 at 07:42