15

Question:

I'm working in R. I want the shared columns of 2 data.tables (shared meaning same column name) to have matching classes. I'm struggling with a way to generically convert an object of unknown class to the unknown class of another object.


More context:

I know how to set the class of a column in a data.table, and I know about the as function. Also, this question isn't entirely data.table specific, but it comes up often when I use data.tables. Further, assume that the desired coercion is possible.

I have 2 data.tables. They share some column names, and those columns are intended to represent the same information. For the column names shared by table A and table B, I want the classes of A to match those in B (or other way around).


Example data.tables:

A <- structure(list(year = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), stratum = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L)), .Names = c("year", "stratum"), row.names = c(NA, -45L), class = c("data.table", "data.frame"))

B <- structure(list(year = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), stratum = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L), bt = c(-9.95187702337873, -9.48946944434626, -9.74178662514147, -5.36167545158338, -4.76405522202426, -5.41964239804882, -0.0807951335119085, 0.520481719699774, 0.0393874225863578, 5.40557402913123, 5.47927931969583, 5.37228402911139, 9.82774396910091, 9.89629694010177, 9.98105260936272, -9.82469892896284, -9.42530210357904, -9.66171049964775, -5.17540952901709, -4.81859082470115, -5.3577146169737, -0.0685310909609001, 0.441383303157166, -0.0105897444321987, 5.24205882775199, 5.65773605162835, 5.40217185632441, 9.90299445851434, 9.78883672575814, 9.98747998379124, -9.69843398105195, -9.31530717395811, -9.77406601252698, -4.83080164375344, -4.89056304189872, -5.3904000267275, -0.121508487954861, 0.493798577602088, -0.118550709142654, 5.23654772583187, 5.87760447006892, 5.22478092346285, 9.90949768116403, 9.85433376398086, 9.91619307289277), yr = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)), .Names = c("year", "stratum", "bt", "yr"), row.names = c(NA, -45L), class = c("data.table", "data.frame"), sorted = c("year", "stratum"))

Here's what they look like:

> A  
    year stratum
 1:    1       1
 2:    1       2
 3:    1       3
 4:    1       4

> B
    year stratum          bt yr
 1:    1       1 -9.95187702  1
 2:    1       2 -9.48946944  1
 3:    1       3 -9.74178663  1
 4:    1       4 -5.36167545  1

Here are the classes:

> sapply(A, class)
     year   stratum 
"integer" "integer"

> sapply(B, class)
     year   stratum        bt        yr 
"numeric" "integer" "numeric" "numeric"

Manually, I can accomplish the desired task through the following:

A[,year:=as.numeric(year)]

This is easy when there's only 1 column to change, you know that column ahead of time, and you know the desired class ahead of time. If desired, it's also pretty easy to to convert arbitrary columns to a given class. I also know how to convert arbitrary columns to any given class.


My Failed Attempt:

(EDIT: This actually works; see my answer)

s2c <- function (x, type = "list") 
{
    as.call(lapply(c(type, x), as.symbol))
}

# In this case, I can assume all columns of A can be found in B
# I am also able to assume that the desired conversion is possible
B.class <- sapply(B[,eval(s2c(names(A)))], class) 
for(col in names(A)){
    set(A, j=col, value=as(A[[col]], B.class[col]))
}

But this still returns the year column as "integer", not "numeric":

> sapply(A, class)
     year   stratum 
"integer" "integer" 

The problem in the above example is that class(as(1L, "numeric")) still returns "integer". On the other hand, class(as.numeric(1L)) returns "numeric"; however, I don't know ahead of time that need as.numeric is needed.


Question, Restated:

How do I make the column classes match, when neither columns nor the to/from classes are known ahead of time?


Additional Thoughts:

In a way, the question is mostly about arbitrary class matching. I run into this issue often with data.table because it's very vocal about class matching. E.g., I run into similar problems when needed to insert NA of the appropriate type (NA_real_ vs NA_character_, etc), depending on the class of the column (see related question/ issue in This Question).

Again, this question can be seen as a general issue of converting between arbitrary classes that aren't known in advance. In the past, I've written functions using switch to do something like switch(class(x), double = as.numeric(...), character = as.character(...), ..., but that seems a big ugly. The only reason I'm bringing this up in the context of data.table is because it's where I most often encounter the need for this type of functionality.

Community
  • 1
  • 1
rbatt
  • 4,677
  • 4
  • 23
  • 41
  • Maybe do `lapply(A, . %>% as.character %>% type.convert)` or similar on each of them. (Without library(magrittr), this is `lapply(A, function(x) type.convert(as.character(x)))`). This is a very crude way, though, and will fail with fancy classes. – Frank Dec 04 '15 at 15:45
  • @Frank I don't necessarily want them to both share an arbitrary class (character), I need A to match the class of B. That's different from your suggestion, right? – rbatt Dec 04 '15 at 16:02
  • The idea behind my suggestion was to give them the same class, not necessarily character. `type.convert` is the function used when reading in data from a text file to determine the class of each column (e.g., in data.table's `fread`). I have a variant of this idea, but it's longer so I'll put it in an answer – Frank Dec 04 '15 at 16:05
  • How can one know which column has to be converted? If `df1$A` is of class `X` and `df2$A` is `Y`, should I convert `df1$A` to `Y` or the other way around? – nicola Dec 04 '15 at 16:15
  • 2
    @nicola I think the OP is saying that one of them is given primacy over the other, yeah. Like their attempted function, switches A to have B's classes (I think). – Frank Dec 04 '15 at 16:17
  • 2
    How about `storage.mode`? For instance `storage.mode(df1$A)<-storage.mode(df2$A)` (or similar). – nicola Dec 04 '15 at 16:25
  • 3
    Rbaat, a FR on GitHub page pointing this link would be great! Perhaps we can export a function to make these operations effortless.. – Arun Dec 04 '15 at 16:39
  • 2
    If these are actually in files, you can read the first file `A` then try `fread(file, colClasses = sapply(A, class)[match(names(B), names(A))])` on B. This worked when I tried it. – Rich Scriven Dec 04 '15 at 17:23
  • @RichardScriven Good suggestion that might cover some cases, but it doesn't happen to work here. Sometimes A and B originate from files, but they also original from output from statistical analyses, or from simulations with complicated output structure. – rbatt Dec 04 '15 at 17:31
  • Here's an issue to watch out for if an approach ends up relying on as(): http://stackoverflow.com/questions/34093056/asx-double-and-as-doublex-are-inconsistent – rbatt Dec 04 '15 at 17:47

3 Answers3

5

This is one very crude way to ensure common classes:

library(magrittr)

cols = intersect(names(A), names(B))
r    = rbindlist(list(A = A, B = B[, ..cols]), idcol = TRUE)
r[, (cols) := lapply(.SD, . %>% as.character %>% type.convert), .SDcols=cols]
B[, (cols) := r[.id=="B", ..cols]]
A[, (cols) := r[.id=="A", ..cols]]

sapply(A, class); sapply(B, class)
#      year   stratum 
# "integer" "integer" 
#      year   stratum        yr 
# "integer" "integer" "numeric" 

I don't like this solution:

  • I routinely use all-integer codes for IDs (like "00001", "02995"), and this would coerce those to actual integers, which is bad.
  • Who knows what this will do to fancy classes like Date or factor? This won't matter so much if you do this col-classes normalization as soon as you read data in, I suppose.

Data:

# slightly tweaked from OP
A <- setDT(structure(list(year = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), stratum = c(1L, 2L, 
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 2L, 3L, 4L, 5L, 
6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 
9L, 10L, 11L, 12L, 13L, 14L, 15L)), .Names = c("year", "stratum"), row.names = 
c(NA, -45L), class = c("data.frame")))

B <- setDT(structure(list(year = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
3, 3, 3, 3), stratum = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 
14L, 15L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L), yr = c(1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)), .Names = c("year", "stratum", 
"yr"), row.names = c(NA, -45L), class = c("data.frame")))

Comment. If you have something against magrittr, use function(x) type.convert(as.character(x)) in place of the . %>% bit.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Frank
  • 66,179
  • 8
  • 96
  • 180
  • So, I'm waiting to see a better idea. Just expanding on my original comment on the question. – Frank Dec 04 '15 at 16:14
  • 1
    For reference, [here is the C source underlying `type.convert`](https://github.com/wch/r-source/blob/b156e3a711967f58131e23c1b1dc1ea90e2f0c43/src/library/utils/src/io.c#L550-L759), should anyone seek inspiration – rbatt Dec 04 '15 at 16:19
  • 1
    The "character" intermediate makes me a bit uneasy. I have a particular situation that requires A to perfectly match B, and I know that they *should* match (A is indirectly derived from B, somehow); so your point about "weird" classes rings true for me here. But I'm going to try it, because I really like `type.convert`, and I didn't know about it previously; maybe it'll work out. – rbatt Dec 04 '15 at 16:26
  • Does this use `showMethods(coerce)`? Given all the options from that, it seems like it's possible to construct a pretty generic approach to the conversion, without necessarily using the character intermediate. Still thinking. – rbatt Dec 04 '15 at 17:43
5

Not very elegant but you may 'build' the as.* call like this:

for (x in colnames(A)) { A[,x] <- eval( call( paste0("as.", class(B[,x])), A[,x]) )}
Tensibai
  • 15,557
  • 1
  • 37
  • 57
  • `data.table` approach (?): `for (col in names(A)) set(A, j=col, value=eval(call(paste0("as.",B.class[col]), A[[col]])))`. See my question for the definitions of `B.class` and the `s2c` function – rbatt Dec 04 '15 at 16:45
  • 1
    FWIW, I would treat this as preliminary work on the dataset, I've absolutely no clue if doing it with a DT call would be really faster (no object growing there, and unless you have very huge amount of observations it should not take too much time). But I may be absolutely wrong :) – Tensibai Dec 04 '15 at 16:49
1

Based on the discussion in this question, and comments in this answer, I'm thinking I may have had it right, and just landed on an odd exception.

Note that the class doesn't change, but the technicality is that it doesn't matter (for my particular use-case that prompted the question). Below I show my "failed approach", but by following through to the merge, and the classes of the columns in the merged data.table, we can see why the approach works: integers will just get promoted.

s2c <- function (x, type = "list") 
{
    as.call(lapply(c(type, x), as.symbol))
}

# In this case, I can assume all columns of A can be found in B
# I am also able to assume that the desired conversion is possible
B.class <- sapply(B[,eval(s2c(names(A)))], class)
for(col in names(A)){
    set(A, j=col, value=as(A[[col]], B.class[col]))
}

# Below here is new from what I tried in question
AB <- data.table:::merge.data.table(A, B, all=T, by=c("stratum","year"))

sapply(AB, class)
  stratum      year        bt        yr 
"integer" "numeric" "numeric" "numeric" 

Although the problem in the question isn't solved by this answer, I figured I'd post to point out that the failure to convert "integer" to "numeric" might not be a problem in many situations, so this is a straightforward, albeit circumstantial, solution.

Community
  • 1
  • 1
rbatt
  • 4,677
  • 4
  • 23
  • 41