1

Given data.table dt:

dt <- structure(list(V1 = c("1544018118438041139", "1544018118466235879", 
"1544018118586849680", "1544018118601169211", "1544018118612947335", 
"1544018118614422179"), V2 = c("162", "162", "161.05167", "158.01309", 
"157", "157"), V3 = c("38", "38", "36.051697", "33.01306", "32", 
"32"), V4 = c("0.023529414", "0.025490198", "0.023529414", "0.027450982", 
"0.03137255", "0.03137255"), V5 = c("1", "1", "1", "1", "1", 
"1"), V6 = c("2131230815", "2131230815", "2131230815", "2131230815", 
"2131230815", "2131230815"), V7 = c("1", "0", "0", "0", "0", 
"-1")), class = c("data.table", "data.frame"), row.names = c(NA, 
-6L), .internal.selfref = <pointer: 0x2715f60>)

I want the first column to be bit64::as.integer64() and the rest of the columns as.numeric()

I am trying to do this:

dt <- dt[ ,V1 := bit64::as.integer64(V1)]

dt[, lapply(.SD, as.numeric), .SDcols = -c("V1")]

But it doesn't seem to do what I want, please advise how to change specific columns to class A(integer64) and the rest to another class B (say as.numeric())?

smci
  • 32,567
  • 20
  • 113
  • 146
SteveS
  • 3,789
  • 5
  • 30
  • 64
  • But I want it to be first column and rest of the columns, I can't guarantee that I will get these columns names always, I want it to be general. @markus – SteveS Dec 14 '18 at 22:24
  • Again @markus first column not "V1" – SteveS Dec 14 '18 at 22:27
  • 1
    `.SDcols = setdiff(names(dt), names(dt)[1])` – markus Dec 14 '18 at 22:28
  • Still doesn't work, it shows the first column as int64 but the rest still char, maybe I need to inject with = FALSE? @markus – SteveS Dec 14 '18 at 22:30
  • 1
    You need to do `dt <- dt[, lapply(...]` but not in the first case because with `:=` you change the `V1` column by reference. – markus Dec 14 '18 at 22:34
  • @markus would you be so kind to explain what does it mean by change column by reference? – SteveS Dec 14 '18 at 22:37
  • 1
    https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reference-semantics.html under **b) The `:=` operator** – markus Dec 14 '18 at 22:39
  • And it still doesn't work....please post the whole solution that works for you. I am getting V2 - V7 without V1 @markus – SteveS Dec 14 '18 at 22:39
  • Take a look at [Apply function on a subset of columns (.SDcols) whilst applying a different function on another column (within groups)](https://stackoverflow.com/questions/20459519/apply-function-on-a-subset-of-columns-sdcols-whilst-applying-a-different-func) – markus Dec 14 '18 at 22:50
  • @markus tried it already, it didn't work. dt[, c(list(V1 = as.integer64(V1), lapply(.SD, as.numeric))), .SDcols = setdiff(names(dt), names(dt)[1])] – SteveS Dec 14 '18 at 22:57
  • Does this work? `dt[, names(dt)[-1] := lapply(.SD, as.numeric), .SDcols = names(dt)[-1]]` – Mike H. Dec 14 '18 at 23:02
  • @MikeH. yes sure, it works - all cols besides the first one became dbl, the first one is still char – SteveS Dec 14 '18 at 23:04
  • @MikeH. please explain the solution – SteveS Dec 14 '18 at 23:12
  • @MikeH. Please be so kind and help me to understand how to deal with these big 19 digits numbers in R, I have tried gmp package and bit64 but when I run it my R crash. – SteveS Dec 14 '18 at 23:28

3 Answers3

3

From the comments above it seems like you want to be able to do this all in one step rather than convert the first to integer64 and then the rest to double. One way you can do this is with:

dt[, names(dt) := Map(function(fun, x) fun(x), rep(list(bit64::as.integer64, as.numeric), times = c(1,length(.SD) - 1)), .SD), .SDcols = names(dt)]

The Map function iterates through your inputs together. That is, it takes the first elements of your first and second vectors and pass them as arguments to our function. Then it takes the second elements of both vectors and passes those to the function.

In our Map call we have:

  • A main function to apply. This is an anonymous function which takes two arugments (1) fun, and (2) x. The result of our function is the result of applying fun to x or fun(x). For a concrete example try:

    myfun <- function(fun, x){
      fun(x)
    }
    res<-myfun(as.numeric, c("1","1")); class(res)
    
  • A list of functions to pass to our main function. These will be used as fun in our main function. In this case its list(as.integer64, as.numeric, as.numeric,...)

  • A list of vectors to pass to our main function. These will be used as x in our main function. In this case each column of our dt.

A quick and dirty visual aid of how this works is (assuming custom_function takes two arguments):

enter image description here

Mike H.
  • 13,960
  • 2
  • 29
  • 39
  • 1
    @steves let me know if its unclear. I think I might add a graphic about `Map` – Mike H. Dec 14 '18 at 23:27
  • it super clear!!!! You definitely solved this kind of issues I have but the biggest issue for me is still to deal with these big 19 digits timestamp numbers. I have tried almost anything but nothing helped me to deal with these big numbers without loosing the precision. Maybe I can show you my issues somehow, maybe you've already saw same issues. – SteveS Dec 14 '18 at 23:33
  • @steves im sorry its hard to say without seeing specifics. This might be better as a new question. Also I don't have as much experience storing large numbers. Maybe consider taking the log or some transformation as well? – Mike H. Dec 14 '18 at 23:51
  • These are nanosecond timestamps. All transformation, including casting to double, loose information. – Dirk Eddelbuettel Dec 15 '18 at 02:24
  • @DirkEddelbuettel please have a look on my answer, I managed to deal with it. – SteveS Dec 15 '18 at 16:57
2

It looks to me that you have a data.table object with integer64 nanosecond timestamps since the epoch. I use the same at work to represent high-resolution timestamps.

The good news is that data.table supports this -- by relying on our package nanotime which itself uses bit64 for the integer64 type. However, I create my timestamps differently, typically from compiled code where I retrieve the data.

I described this in some detail at the Rcpp Gallery in this post . So some good news: this can be done. Some bad news: I don't think you can do it the way you want it because we can only go via double which has only 16 decimals precision, not 19. But maybe I am missing trick so if simpler solution exists I'd be all ears. (And I keep forgetting if there is a 'parse int64 from string approach'. I never went that route because you can't do that at scale -- I deal with pretty sizeable data sets too.)

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
1

Thanks guys, @dirk_eddelbuettel I managed to do this:

1) Load all the JSON files (in my case) and use

bigint_as_char=TRUE

within fromJSON command.

2) Now you have a big table with all columns as characters.

3) Convert timestamp column to bit64::as.integer64() - you get the numbers I want.

4) Convert the rest to desired types.

5) When I want to perform calculations, for example timestamp - lag(timestamp) I am adding the lag_timestamp = lag(timestamp) (with dplyr::mutate) as new column and add diff_column = storing it as.character()

6) You are almost done - the new diff column stores the value I want as string / character and now you can convert it to as.numeric() where needed or apply ifelse() to deal with non relevant values.

7) That's all, it works perfectly for me and don't crash R Studio.

Before applying my solution R Studio crashed.

SteveS
  • 3,789
  • 5
  • 30
  • 64
  • JSON? _shudders_ Textconversion is slow, in essence your answer is the same as mine as you to have to resort to a helper package. – Dirk Eddelbuettel Dec 15 '18 at 17:02
  • Just a little promo, I am a data scientist that perform first research using files in AWS S3. Afterwards I don't need to deal with it because ETL professionals provide me with relational db with all columns I need. I just connect to it and skip the data preparation stage. – SteveS Dec 15 '18 at 17:16