18

The following consistently crashes my R session.
Tested on two machines, Ubuntu and Mac OS X with similar results on both.

Brief Description:
Calling write.table on a data.frame with factor column of all NA's.

The original data set is rather large, and I've managed to isolate the offending column and then create a similar vector, named PROBLEM_DATA below, which causes the same crash.

Interestingly, sometimes R crashes outright, othertimes it simply throws the following error:

Error in write.table(x, file, nrow(x), p, rnames, sep, eol, na, dec, as.integer(quote),  : 
  'getCharCE' must be called on a CHARSXP

Any thoughts as to the cause of the crash or should it be submitted as a bug?

Offending data and call:

PROBLEM_DATA <- structure(114:116, .Label = c("String1", "String2", "String3", "String4", "String5", "String6", 
                   "String7", "String8", "String9", "String10", "String11", "String12", "String13", "String14", "String15"), class = "factor")

# This will cause a crash
write.table(PROBLEM_DATA, file=path.expand("~/test.csv"))

# This will also crash
write.table(PROBLEM_DATA, file=path.expand("~/test.csv"), fileEncoding="UTF-8")

SESSION INFO OF EACH MACHINE

UBUNTU

R version 2.15.3 (2013-03-01)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C            LC_COLLATE=C        
 [5] LC_MONETARY=C        LC_MESSAGES=C        LC_PAPER=C           LC_NAME=C           
 [9] LC_ADDRESS=C         LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gdata_2.12.0     ggplot2_0.9.3    stringr_0.6.1    RMySQL_0.9-3     DBI_0.2-5       
[6] data.table_1.8.8

loaded via a namespace (and not attached):
 [1] MASS_7.3-23        RColorBrewer_1.0-5 colorspace_1.2-0   dichromat_1.2-4   
 [5] digest_0.5.2       grid_2.15.3        gtable_0.1.1       gtools_2.7.0      
 [9] labeling_0.1       munsell_0.4        plyr_1.7.1         proto_0.3-9.2     
[13] reshape2_1.2.1     scales_0.2.3       tools_2.15.3

Mac OS X

R version 2.15.3 (2013-03-01)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • 1
    I can only report that it also happens for me with R version 3.0.0: `> head(PROBLEM_DATA) [1] 15 Levels: String1 String2 String3 String4 String5 String6 String7 ... String15 > write.table(PROBLEM_DATA, file=path.expand("~/test.csv")) *** caught segfault *** address 0x1, cause 'memory not mapped' Traceback: 1: write.table(PROBLEM_DATA, file = path.expand("~/test.csv")) Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace ` – vodka Apr 04 '13 at 16:35
  • 2
    Same here on 2.15.1 on Linux - also happens with a smaller vector `PD=structure(11:12,.Label=c("Foo","Bar"),class="factor")`. I say check the changelog and nightly R and then report as a bug. – Spacedman Apr 04 '13 at 16:38
  • Same problem (alternating crashing and complaining about 'getCharCE') on R 2.15.2 64 bits @ Win7 – Ferdinand.kraft Apr 04 '13 at 17:38
  • Same here on Win7 R64 2.15.3 . It worked first time; output file contained column named "x" with values 78,79, 80 . Next time, got error message, file contained "x" and the value 1. Can all you commenters post what output, if any, showed up? – Carl Witthoft Apr 04 '13 at 17:58
  • 1
    I'm guessing the fact that you force levels 114:116 to exist but only define labels for levels 1:15 has a lot to do with it. Take a look at `as.numeric(PROBLEM_DATA)` as well as `as.numeric(as.character(PROBLEM_DATA))` (per the R_FAQ). You end up with a bunch of levels which have the same (nonexistent) name. – Carl Witthoft Apr 04 '13 at 18:11
  • 1
    I don't get a crash under vanilla R, but I did get one after having loaded the `gdata` package. Perhaps related to http://stackoverflow.com/q/10939516/892313 ? In general, crashes are always bugs. The question is if it is a bug in `gdata` or base R. – Brian Diggs Apr 04 '13 at 20:06
  • I am also able to reproduce a crash on R-3.0.0, but now it is giving the above error message sometimes... these are big tables with a 1.1 to 1.8 million rows. I'm using --vanilla and it still crashes... very frustrating - It was working fine until this morning - I don't know what has changed at all! – Sean Apr 30 '13 at 09:54
  • ah. for me it seems to be related to write.csv() on a data.table rather on a plain data.frame – Sean Apr 30 '13 at 10:50
  • @Sean, very interesting. Did you use `rbindlist` at some point in creating the larget DT? – Ricardo Saporta Apr 30 '13 at 13:50
  • @RicardoSaporta yes indeed I did... I can't reproduce it for tiny data.tables though, even with rbindlist() – Sean Apr 30 '13 at 13:56
  • @Sean Can you post your code to a new question? (you can link back to this one) – Ricardo Saporta Apr 30 '13 at 14:01
  • @RicardoSaporta I posted as requested at http://stackoverflow.com/questions/16315551/r-crash-on-write-csv-for-a-data-table – Sean May 01 '13 at 09:13

2 Answers2

9

This is a nice reproducible bug and should be reported to R-devel or using bug.report(). FWIW on

> sessionInfo()
R version 3.0.0 Patched (2013-04-03 r62485)
Platform: x86_64-unknown-linux-gnu (64-bit)

If on Linux I configure R with CFLAGS="-g -O0" I can

R -d gdb
(gdb) break Rf_error
(gdb) run

then paste your lines above and end up at

> write.table(PROBLEM_DATA, file=path.expand("~/test.csv"))

Breakpoint 1, Rf_error (format=0x7ffff7a8f0f0 "'%s' must be called on a CHARSXP") at /home/mtmorgan/src/R-3-0-branch/src/main/errors.c:753
753     RCNTXT *c = R_GlobalContext;
(gdb) up 3
#3  0x00007ffff1b9bfb3 in EncodeElement2 (x=0x31ccf50, indx=113, quote=TRUE, qmethod=TRUE, buff=0x7fffffffbdc0, cdec=46 '.')
    at /home/mtmorgan/src/R-3-0-branch/src/library/utils/src/io.c:938
938     p0 = translateChar(STRING_ELT(x, indx));
(gdb) call Rf_PrintValue(x)
 [1] "String1"  "String2"  "String3"  "String4"  "String5"  "String6" 
 [7] "String7"  "String8"  "String9"  "String10" "String11" "String12"
[13] "String13" "String14" "String15"
(gdb) p indx
$1 = 113

which shows R trying to print out the 114th element of the factor names -- clearly things have gone wrong because the factor has integer values beyond the length of its levels.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
1

Not an answer, but a long commment:

PROBLEM_DATA <- structure(c(1:5,114:116), .Label = c("String1", "String2", "String3",'string4','str5','str6','str7'),class='factor')
Rgames> as.numeric(PROBLEM_DATA)
[1]   1   2   3   4   5 114 115 116
Rgames> as.numeric(as.character(PROBLEM_DATA))
[1] NA NA NA NA NA NA NA NA
Warning message:
NAs introduced by coercion 
Rgames> levels(PROBLEM_DATA)
[1] "String1" "String2" "String3" "string4" "str5"    "str6"    "str7"   
Rgames> write.table(PROBLEM_DATA, file=path.expand("~/ctest.csv"))
Error in write.table(x, file, nrow(x), p, rnames, sep, eol, na, dec, as.integer(quote),  : 
  'getCharCE' must be called on a CHARSXP

ctest.csv contains: (each line is a single cell so far as Excel is concerned)

x
1 "String1"
2 "String2"
3 "String3"
4 "string4"
5 "str5"
6

So you can see something going bad when there's a 'gap' in the levels' underlying numbering. Hope this provides a clue to someone who understands factors a lot more than I do.

Carl Witthoft
  • 20,573
  • 9
  • 43
  • 73