8

Can fread from "data.table" be forced to successfully use "." as a sep value?

I'm trying to use fread to speed up my concat.split functions in "splitstackshape". See this Gist for the general approach I'm taking, and this question for why I want to make the switch.

The problem I'm running into is treating a dot (".") as a value for sep. Whenever I do so, I get an "unexpected character" error.

The following simplified example demonstrates the problem.

library(data.table)

y <- paste("192.168.1.", 1:10, sep = "")

x1 <- tempfile()
writeLines(y, x1)
fread(x1, sep = ".", header = FALSE)
# Error in fread(x1, sep = ".", header = FALSE) : Unexpected character (
# 192) ending field 2 of line 1

The workaround I have in my current function is to substitute "." with another character that is hopefully not present in the original data, say "|", but that seems risky to me since I can't predict what is in someone else's dataset. Here's the workaround in action.

x2 <- tempfile()
z <- gsub(".", "|", y, fixed=TRUE)
writeLines(z, x2)
fread(x2, sep = "|", header = FALSE)
#      V1  V2 V3 V4
#  1: 192 168  1  1
#  2: 192 168  1  2
#  3: 192 168  1  3
#  4: 192 168  1  4
#  5: 192 168  1  5
#  6: 192 168  1  6
#  7: 192 168  1  7
#  8: 192 168  1  8
#  9: 192 168  1  9
# 10: 192 168  1 10

For the purposes of this question, assume that the data are balanced (each line will have the same number of "sep" characters). I'm aware that using a "." as a separator is not the best idea, but I'm just trying to account for what other users might have in their datasets, based on other questions I've answered here on SO.

Community
  • 1
  • 1
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • I haven't looked much at the source for `fread`, so not to ask the obvious, but have you tried escaping the `\\.` ? – Ricardo Saporta Oct 08 '13 at 05:16
  • @RicardoSaporta, yes. You'll get an error: `Error in fread(x1, sep = "\\.", header = FALSE) : 'sep' must be 'auto' or a single character`. – A5C1D2H2I1M1N2O1R2T1 Oct 08 '13 at 05:17
  • I just noticed that after my comment. hmmm... I have no idea. Maybe @MattDowle can chime in? – Ricardo Saporta Oct 08 '13 at 05:19
  • @RicardoSaporta, that's what I'm hoping--then I can also ask him whether `fread` would support a `text` argument like `read.table` does :) – A5C1D2H2I1M1N2O1R2T1 Oct 08 '13 at 05:20
  • it's unclear to me whether this should be read as 4 columns or 2 columns (of doubles), but either way seems like a bug - file a bug report? – eddi Oct 08 '13 at 16:16
  • If the sep "character" is allowed be a string of multiple characters then you can make your workaround more robust by `z <- gsub(".", "|||||", y, fixed=TRUE)` `fread(x2, sep = "|||||", header = FALSE) – Dean MacGregor Feb 04 '14 at 23:00

2 Answers2

3

Now implemented in v1.9.5 on GitHub.

> input = paste( paste("192.168.1.", 1:5, sep=""), collapse="\n")
> cat(input,"\n")
192.168.1.1
192.168.1.2
192.168.1.3
192.168.1.4
192.168.1.5 

Setting sep='.' results in ambiguity with the new argument dec (by default '.') :

> fread(input,sep=".")
Error in fread(input, sep = ".") : 
  The two arguments to fread 'dec' and 'sep' are equal ('.')

Therefore choose something else for dec :

> fread(input,sep=".",dec=",")
    V1  V2 V3 V4
1: 192 168  1  1
2: 192 168  1  2
3: 192 168  1  3
4: 192 168  1  4
5: 192 168  1  5

You may get a warning :

> fread(input,sep=".",dec=",")
     V1  V2 V3 V4
 1: 192 168  1  1
 2: 192 168  1  2
 3: 192 168  1  3
 4: 192 168  1  4
 5: 192 168  1  5
Warning message:
In fread(input, sep = ".", dec = ",") :
  Run again with verbose=TRUE to inspect... Unable to change to a locale
  which provides the desired dec. You will need to add a valid locale name
  to getOption("datatable.fread.dec.locale"). See the paragraph in ?fread.

Either ignore or suppress the warning, or read the paragraph and set the option :

options(datatable.fread.dec.locale = "fr_FR.utf8")

This ensures there can be no ambiguity.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
0

The issue seams to be related to the numeric value of the text itself:

library(data.table)

y <- paste("Hz.BB.GHG.", 1:10, sep = "")

xChar <- tempfile()
writeLines(y, xChar)
fread(xChar, sep = ".", header = FALSE)
#     V1 V2  V3 V4
#  1: Hz BB GHG  1
#  2: Hz BB GHG  2
#  3: Hz BB GHG  3
#  4: Hz BB GHG  4
#  5: Hz BB GHG  5
#  6: Hz BB GHG  6
#  7: Hz BB GHG  7
#  8: Hz BB GHG  8
#  9: Hz BB GHG  9
# 10: Hz BB GHG 10

However, trying with the original value, again gives the same error:

fread(x1, sep = ".", header = FALSE, colClasses="numeric", verbose=TRUE)
fread(x1, sep = ".", header = FALSE, colClasses="character", verbose=TRUE)

 Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
 Looking for supplied sep '.' on line 10 (the last non blank line in the first 'autostart') ... found ok
 Found 4 columns
 First row with 4 fields occurs on line 1 (either column names or first row of data)
 Error in fread(x1, sep = ".", header = FALSE, colClasses = "character",  : 
   Unexpected character (192.) ending field 2 of line 1

This however, does work:

read.table(x1, sep=".")
#     V1  V2 V3 V4
# 1  192 168  1  1
# 2  192 168  1  2
# 3  192 168  1  3
# 4  192 168  1  4
# ... <cropped>
Nimantha
  • 6,405
  • 6
  • 28
  • 69
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • Hmmm. This is interesting. By extension, if we had `y <- paste("Hz.BB.GHG.", 1:10, 11:20, sep = "")`, again we would get an error. Any idea why? – A5C1D2H2I1M1N2O1R2T1 Oct 08 '13 at 05:29
  • Regarding your edit (`read.table`), that is what I presently use in one of the versions of `concat.split`. See `splitstackshape:::read.concat`. – A5C1D2H2I1M1N2O1R2T1 Oct 08 '13 at 05:38
  • 1
    It's almost 7am in London, I don't know what Matt is doing not being on stackoverflow ;) Good luck with this, I am off to bed (i'll delete this answer in the morning) – Ricardo Saporta Oct 08 '13 at 05:39
  • 1
    I'm guessing that the problem with numeric values is that it assumes the `.` is a decimal. – A5C1D2H2I1M1N2O1R2T1 Oct 08 '13 at 06:28
  • Using any of `%+-` as separators instead of `.` works fine. Using `y <- paste(".A.168.1.", 1:10, sep = "", collapse="\n")`, i.e. with an extra `.` at the beginning, also produces the same error, so it might not be related to numeric treatment at all. In my error message there is a line break after `Unexpected character (`, so there could be some non-printing character creeping in from somewhere? – Peter Oct 08 '13 at 10:41
  • @AnandaMahto `fread(x1, sep = ".", header = FALSE, colClasses="numeric", verbose=TRUE)` works fine with me (`data.table` 1.8.10). Also (sorry prob it's stupid) if `fread(x1, sep = ".", header = FALSE)` is already working for both of you what is the real problem? – Michele Oct 08 '13 at 11:29
  • @Michele, with which `x1`? The one from Ricardo's answer or the one from my question? Ricardo, I'm editing your answer to avoid confusion. – A5C1D2H2I1M1N2O1R2T1 Oct 08 '13 at 11:39
  • `data.table 1.8.10` here as well, 1006 tests (latest on CRAN). Using `y <- paste("192.168.1.", 1:10, sep = "")` fails on both R 2.15.3 (Revolution) and 3.0.1 with the error message as above. Using `y <- paste(".A.168.1.", 1:10, sep = "", collapse="\n")` and passing to `fread()` as a text rather than file, gives `... Unexpected character ( .A.1) ending field 4 of line 1`, (with the line break before `.A.1`, hence my comment about not being related to `.` as decimal (two decimals in my error message). – Peter Oct 08 '13 at 11:49
  • @AnandaMahto oh sorry, I mean the Ricardo's. (all of them work). In particular: in the one where `colClasses="numeric"` it says normally `Column 1 ('V1') has been detected as type 'character'. Ignoring request from colClasses to read as 'numeric' (a lower type) since NAs would result.` and where `colClasses="character"` it says: `Column 4 ('V4') was detected as type 'integer' but bumped to 'character' as requested by colClasses`. – Michele Oct 08 '13 at 11:50