2

In the post below,

aggregation using ffdfdply function in R

There is a line like this.

splitby <- as.character(data$Date, by = 250000)

Just out of curiosity, I wonder what by argument means. It seems to be related to ff dataframe but I'm not sure. Google search and R documentation of as.character and as.vector provided no useful information.

I tried some examples but the codes below give the same results.

d <- seq.Date(Sys.Date(), Sys.Date()+10000, by = "day")
as.character(d, by=1)
as.character(d, by=10)
as.character(d, by=100)

If anybody could tell me what it is, I'd appreciate it. Thank you in advance.

Community
  • 1
  • 1
dixhom
  • 2,419
  • 4
  • 20
  • 36

2 Answers2

3

Since as.character.ff works using the default as.character internally, and in view of the fact that df vectors can be larger than RAM, the data needs to be processed in chunks. The partition into chunks is facilitated by the chunk function. In this case, the relevant method is chunk.ff_vector. By default, this will calculate the chunk size by dividing getOption("ffbatchbytes") by the record size. However, this behaviour can be overridden by supplying the chunk size using by.

In the example you give, the ff vector will be converted to character 250000 members at a time.

The end result will be the same for any by or without by at all. Larger values will lead to greater temporary use of RAM but potentially quicker operation.

Nick Kennedy
  • 12,510
  • 2
  • 30
  • 52
1

First, that function is ffbase::as.character, not plain old base::as.character

See http://www.inside-r.org/packages/cran/ffbase/docs/as.character.ff which says

as.character((x, ...))

Arguments:
x: a ff vector
...: other parameters passed on to chunk

So the by argument is being passed through to some chunk function. Then you need to figure out which package's chunk function is being used. Type ?chunk, tell us which one, then go read its doc to see what its by argument does.

smci
  • 32,567
  • 20
  • 113
  • 146
  • The `by` is being passed to ´chunk.default´, so the documentation for it in `?chunk` is as accurate as it gets: `by: increment of the sequence` (to be used as in `?seq`). – SimonG Jun 27 '15 at 20:48
  • Then it's just being ignored. In R all unknown or unused args get ignored, silently. This might have been a cut-and-paste typo by the original poster, e.g. from copying a seq(...) command. – smci Jun 27 '15 at 20:54
  • I don't think it is being ignored. Both `chunk.ff_vector` and `chunk.ffdf` capture `...` and pass it on to `chunk.default` via `do.call`. If `by` is specified, then `chunk.default` will use it (as the documentation suggests, in a similar way as `seq` uses it). – SimonG Jun 27 '15 at 20:56
  • Ok, but I can't find a documentation page for `ffbase::chunk.default` , they could do with adding one... – smci Jun 27 '15 at 21:09
  • The documentation for this function (`as.character.ff`) isn't particularly helpful in this regard. Neither your answer nor that of @smci really explain what the end result is of varying `by` in this case. To really understand what it was doing I had to look at the code for this function as well as the code for two methods of `chunk`! – Nick Kennedy Jun 27 '15 at 21:12
  • @NickK yes, I couldn't find the source for `ffbase::chunk.default`, I've never used it and I wasn't going to install yet another package just to answer this question. Really someone needs to add the missing doc pages for that package. A common story in R packages... I did point the OP towards how to solve it for themselves. – smci Jun 27 '15 at 21:16
  • @smci entirely agree that the problem is the documentation! The relevant function is in the `bit` package. The problem is that even with the documentation for that package, it's not clear what passing `by` to `chunk.ff_vector` does since it's not mentioned in the help page. Also `?chunk` doesn't give the full answer because it's the `ff_vector` method rather than `default` one that gets called (although that in turn calls `chunk.default`!) – Nick Kennedy Jun 27 '15 at 21:21
  • `ffbase::chunk.default` does not exist because `chunk.default` is in `base`. – SimonG Jun 27 '15 at 21:21
  • @SimonG no it's specifically not, that's what the OP's issue is here. They're in [`bit::chunk`](http://www.inside-r.org/packages/cran/bit/docs/chunk) and `bit::chunk.default`. Evidently `bit` package is a dependency of `ffbase` which gets imported. Like I said there is also a `base::chunk` and that is **not** what's getting called here. – smci Jun 27 '15 at 23:09