36

I have a problem running some R scripts on our cluster. The problems appeared suddenly (all the scripts were working just fine but one day they started giving a caught segfault error). I cannot provide reproducible code because I can't even reproduce the error on my own computer - it only happens on the cluster. I am also using the same code for two sets of data - one is quite small and runs fine, the other one works with bigger data frames (about 10 million rows) and collapses at certain points. I am only using packages from CRAN repository; R and all the packages should be up to date. The error shows up at totally unrelated actions, see the examples below:

Session info:

R version 3.4.3 (2017-11-30)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Writing variable to NetCDF file

# code snippet
library(ncdf4)
library(reshape2)

input <- read.csv("input_file.csv")
species <- "no2"
dimX <- ncdim_def(name="x", units = "m", vals = unique(input$x), unlim = FALSE)
dimY <- ncdim_def(name="y", units = "m", vals = unique(input$y), unlim = FALSE)
dimTime <- ncdim_def(name = "time", units = "hours", unlim = TRUE)

varOutput <- ncvar_def(name = species, units = "ug/m3",
                dim = list(dimX, dimY, dimTime), missval = -9999, longname = species)

nc_file <- nc_create(filename = "outFile.nc", vars = list(varOutput), force_v4 = T)

ncvar_put(nc = nc_file, varid = species, vals = acast(input, x~y), start = c(1,1,1),
      count = c(length(unique(input$x)), length(unique(input$y)), 1))

At this point, I get the following error:

 *** caught segfault ***
address 0x2b607cac2000, cause 'memory not mapped'

Traceback:
 1: id(rev(ids), drop = FALSE)
 2: cast(data, formula, fun.aggregate, ..., subset = subset, fill = fill,     drop = drop, value.var = value.var)
 3: acast(result, x ~ y)
 4: ncvar_put(nc = nc_file, varid = species, vals = acast(result, x ~     y), start = c(1, 1), count = c(length(unique(result$x)),     length(unique(result$y))))
An irrecoverable exception occurred. R is aborting now ...
/opt/sge/default/spool/node10/job_scripts/122270: line 3: 13959 Segmentation fault      (core dumped)

Complex code with parallel computation

 *** caught segfault ***
address 0x330d39b40, cause 'memory not mapped'

Traceback:
 1: .Call(gstat_fit_variogram, as.integer(fit.method), as.integer(fit.sills),     as.integer(fit.ranges))
 2: fit.variogram(experimental_variogram, model = vgm(psill = psill,     model = model, range = range, nugget = nugget, kappa = kappa),     fit.ranges = c(fit_range), fit.sills = c(fit_nugget, fit_sill),     debug.level = 0)
 3: doTryCatch(return(expr), name, parentenv, handler)
 4: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 5: tryCatchList(expr, classes, parentenv, handlers)
 6: tryCatch(expr, error = function(e) {    call <- conditionCall(e)          if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call)[1L]        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        msg <- conditionMessage(e)        sm <- strsplit(msg, "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && identical(getOption("show.error.messages"),         TRUE)) {        cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
 7: try(fit.variogram(experimental_variogram, model = vgm(psill = psill,     model = model, range = range, nugget = nugget, kappa = kappa),     fit.ranges = c(fit_range), fit.sills = c(fit_nugget, fit_sill),     debug.level = 0), TRUE)
 8: getModel(initial_sill - initial_nugget, m, initial_range, k,     initial_nugget, fit_range, fit_sill, fit_nugget, verbose = verbose)
 9: autofitVariogram(lmResids ~ 1, obsDf, model = "Mat", kappa = c(0.05,     seq(0.2, 2, 0.1), 3, 5, 10, 15), fix.values = c(NA, NA, NA),     start_vals = c(NA, NA, NA), verbose = F)
10: main_us(obsDf[obsDf$class == "rural" | obsDf$class == "rural-nearcity" |     obsDf$class == "rural-regional" | obsDf$class == "rural-remote",     ], grd_alt, grd_pop, lm_us, fitvar_us, logTransform, plots,     "RuralSt", period, preds)
11: doTryCatch(return(expr), name, parentenv, handler)
12: tryCatchOne(expr, names, parentenv, handlers[[1L]])
13: tryCatchList(expr, classes, parentenv, handlers)
14: tryCatch(main_us(obsDf[obsDf$class == "rural" | obsDf$class ==     "rural-nearcity" | obsDf$class == "rural-regional" | obsDf$class ==     "rural-remote", ], grd_alt, grd_pop, lm_us, fitvar_us, logTransform,     plots, "RuralSt", period, preds), error = function(e) e)
15: eval(.doSnowGlobals$expr, envir = .doSnowGlobals$exportenv)
16: eval(.doSnowGlobals$expr, envir = .doSnowGlobals$exportenv)
17: doTryCatch(return(expr), name, parentenv, handler)
18: tryCatchOne(expr, names, parentenv, handlers[[1L]])
19: tryCatchList(expr, classes, parentenv, handlers)
20: tryCatch(eval(.doSnowGlobals$expr, envir = .doSnowGlobals$exportenv),     error = function(e) e)
21: (function (args) {    lapply(names(args), function(n) assign(n, args[[n]], pos = .doSnowGlobals$exportenv))    tryCatch(eval(.doSnowGlobals$expr, envir = .doSnowGlobals$exportenv),         error = function(e) e)})(quote(list(timeIndex = 255L)))
22: do.call(msg$data$fun, msg$data$args, quote = TRUE)
23: doTryCatch(return(expr), name, parentenv, handler)
24: tryCatchOne(expr, names, parentenv, handlers[[1L]])
25: tryCatchList(expr, classes, parentenv, handlers)
26: tryCatch(do.call(msg$data$fun, msg$data$args, quote = TRUE),     error = handler)
27: doTryCatch(return(expr), name, parentenv, handler)
28: tryCatchOne(expr, names, parentenv, handlers[[1L]])
29: tryCatchList(expr, classes, parentenv, handlers)
30: tryCatch({    msg <- recvData(master)    if (msg$type == "DONE") {        closeNode(master)        break    }    else if (msg$type == "EXEC") {        success <- TRUE        handler <- function(e) {            success <<- FALSE            structure(conditionMessage(e), class = c("snow-try-error",                 "try-error"))        }        t1 <- proc.time()        value <- tryCatch(do.call(msg$data$fun, msg$data$args,             quote = TRUE), error = handler)        t2 <- proc.time()        value <- list(type = "VALUE", value = value, success = success,             time = t2 - t1, tag = msg$data$tag)        msg <- NULL        sendData(master, value)        value <- NULL    }}, interrupt = function(e) NULL)
31: slaveLoop(makeSOCKmaster(master, port, timeout, useXDR))
32: parallel:::.slaveRSOCK()
An irrecoverable exception occurred. R is aborting now ...

Is it likely that there is an issue with the cluster rather than the code (or R)? I don't know if it could be related, but since some time ago we've been getting error messages like these:

Message from syslogd@master1 at Mar  8 13:51:37 ...
 kernel:[Hardware Error]: MC4 Error (node 1): DRAM ECC error detected on the NB.

Message from syslogd@master1 at Mar  8 13:51:37 ...
 kernel:[Hardware Error]: Error Status: Corrected error, no action required.

Message from syslogd@master1 at Mar  8 13:51:37 ...
 kernel:[Hardware Error]: CPU:4 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c08400067080a13

Message from syslogd@master1 at Mar  8 13:51:37 ...
kernel:[Hardware Error]: MC4_ADDR: 0x000000048f32b490

Message from syslogd@master1 at Mar  8 13:51:37 ...
 kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

I have tried to uninstall and reinstall packages based on this question but it didn't help.

Janina
  • 441
  • 1
  • 4
  • 13

5 Answers5

14

The problem is a mismatch between currently installed shared libraries and the libraries that were built to install R or packages.

I got this error for the first time today. See below. I've solved it, can explain situation.

This is an Ubuntu system that was recently upgraded from 17.10 to 18.04, running R-3.4.4. A lot of C and C++ libraries were replaced. But not all programs were replaced. Immediately I noticed that lots of programs were getting segmentation faults. Anything that touched the tidyverse was a fail. The stringi package could not find the shared libraries with which it was compiled.

The example here is a bit interesting because it happens when running the "R CMD check" for a package, which, at least in theory, should be safe. I found the fix was to remove the packages "RCurl" and "url" and rebuild them.

Here's the symptom, anyway

* checking for file ‘kutils.gitex/DESCRIPTION’ ... OK
* preparing ‘kutils’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* looking to see if a ‘data/datalist’ file should be added
* re-saving image files
* building ‘kutils_1.40.tar.gz’
Warning: invalid uid value replaced by that for user 'nobody'
Warning: invalid gid value replaced by that for user 'nobody'

Run check: OK? (y or n)y
* using log directory ‘/home/pauljohn/GIT/CRMDA/software/kutils/package/kutils.Rcheck’
* using R version 3.4.4 (2018-03-15)
* using platform: x86_64-pc-linux-gnu (64-bit)
* using session charset: UTF-8
* using option ‘--as-cran’
* checking for file ‘kutils/DESCRIPTION’ ... OK
* checking extension type ... Package
* this is package ‘kutils’ version ‘1.40’
* checking CRAN incoming feasibility ...
 *** caught segfault ***
address 0x68456, cause 'memory not mapped'

Traceback:
 1: curlGetHeaders(u)
 2: doTryCatch(return(expr), name, parentenv, handler)
 3: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 4: tryCatchList(expr, classes, parentenv, handlers)
 5: tryCatch(curlGetHeaders(u), error = identity)
 6: .fetch(u)
 7: .check_http_A(u)
 8: FUN(X[[i]], ...)
 9: lapply(urls[pos], .check_http)
10: do.call(rbind, lapply(urls[pos], .check_http))
11: check_url_db(url_db_from_package_sources(dir), remote = !localOnly)
12: doTryCatch(return(expr), name, parentenv, handler)
13: tryCatchOne(expr, names, parentenv, handlers[[1L]])
14: tryCatchList(expr, classes, parentenv, handlers)
15: tryCatch(check_url_db(url_db_from_package_sources(dir), remote = !localOnly),     error = identity)
16: .check_package_CRAN_incoming(pkgdir, localOnly)
17: check_CRAN_incoming(!check_incoming_remote)
18: tools:::.check_packages()
An irrecoverable exception occurred. R is aborting now ...
Segmentation fault
pauljohn32
  • 2,079
  • 21
  • 28
7

It's not really an explanation of the problem or a satisfactory answer but I examined the codes more closely and figured out that in the first example, the problem appears when using acast from the reshape2 package. I deleted it in this case because I realized it's not actually needed there but it can be replaced with reshape from the reshape package (as shown in another question): reshape(input, idvar="x", timevar="y", direction="wide")[-1].

As for the second example, it's not easy to find the exact cause of the problem but as a workaround in my case helped to set a smaller number of cores used for parallel computation - the cluster has 48, I was using only 15 since even before this issue R was running out of memory if the code was run using all 48 cores. When I reduced the number of cores to 10 it suddenly started working like before.

Janina
  • 441
  • 1
  • 4
  • 13
  • 1
    I reduced the number of cores per your advice, and that is the only thing that worked for me here. Thanks a lot! – Kim Jan 28 '19 at 19:38
1

To add to @pauljohn32's response, this can also happen if you are using sourceRcpp to source a C++ code say A.cpp that relies on C++ code, say B.cpp and C.cpp, that was compiled against an older/different library.

An easy solution, in Linux, is to remove B.o and C.o files before running sourceRcpp("A.cpp"). This seems to also automatically recompile the dependent files, assuming you have the headers included in A.cpp.


EDIT with more details in response to Matt Nolan: Regarding the original question, there the problem is most likely similar, with shared libraries having been compiled for an older version of the OS or a different system. What I am saying here is that even if you have written and compiled the entire project yourself, this could still happen if you forget to clean up outdated files.

To give an analogy relevant to the question: Digging into the source code for ncdf4 package referenced in the question, we find the following snippet in src\ncdf.c

#include <stdio.h>
#include <netcdf.h>
#include <string.h>
#include <stdlib.h>

#include <Rdefines.h>
#include <R_ext/Rdynload.h>

Let's say the file R_ext/Rdynload.h is part of the microsof-r-open project. This is a header file and the corresponding Rdynload.c can be found here.

Suppose ncdf4 and microsof-r-open were all part of a single project and you have compiled the files in open/blob/master/source/src/main already which would have produced object file Rdynload.o there among other things. Then, before compiling src\ncdf.c, you upgrade the operating system (not sure if this would necessarily cause a problem) or copy the entire source code including the object files created so far to a different machine. This can inadvertently happen.

For example, you have automatic sync going on and the directory is synced with a different machine. On this different machine then you try to compile and link src\ncdf.c. The compiler/linker does not recompile Rdynload.c since the object file Rdynload.o is already there. It complies src\ncdf.c to produce src\ncdf.o and then links it with Rdynload.o to build a final executable.

I am not an expert here, but since perhaps Rdynload is a dynamically linked library, the linking goes OK with no errors. But at runtime, you get the segmentation fault due to a mismatch in version between the object code for complied library Rdynload and the object code ncdf (?). Someone with better knowledge of the low-level machine execution can correct me here.

The solution is to purge all the object files, i.e., the files with extension *.o in all the source directories and let the compiler recompile everything from scratch. The *.o extension is assuming you are on a Linux machine. Other operating systems perhaps use a different extension.

In the case of a project you don't own, perhaps the solution is to reinstall the relevant libraries (assuming that they are not precompiled and get recomplied on the new machine at installation).

passerby51
  • 835
  • 2
  • 9
  • 23
0

For me the issue was a discrepancy in quotation types. I had a when R wanted a ". Fixing this solved the issue.

haff
  • 918
  • 2
  • 9
  • 20
-5

It is highly recommended to clean the workspace, probably it is the core problem:

unlink(".RData")

ecp
  • 319
  • 1
  • 6
  • 18
  • Could you clarify at which point I should clean the workspace? I never save my workspace. I run the scripts from terminal using `R --no-save < script.R` – Janina Mar 12 '18 at 14:51
  • You should be able to do this at the terminal. – ecp Mar 12 '18 at 15:02
  • Sorry, this doesn't work in any way... But I think I've tracked the problem to the package `reshape2`, I'll look more into it. – Janina Mar 13 '18 at 08:07
  • 1
    Segfaults are "by definition" a reason to contact the package maintainer. – IRTFM Mar 14 '18 at 18:30
  • 2
    @42- even if the problem is not reproducible? And in case of the second example, where I solved the problem by reducing the number of cores for parallel computation, maintainer of _which_ package should I contact? :-) – Janina Mar 15 '18 at 08:40
  • In my case (using BLINK, or rMVP package) it helped. It seems that indeed `.RData` causes this problem – boczniak767 Jul 22 '20 at 16:40