0

Say I have a dataframe or datatable.

For example:

try <- data.frame(AA=c(1,2,3,1,2,3,4,5,NA),BB=c(1,2,2,NA,
2,1,2,2,NA), CC=c("A","B", NA, NA, "A","B", "A","C","B"))
setDT(try)

 AA BB CC
 1  1   A
 2  2   B
 3  2   NA
 1 NA   NA
 2  2   A
 3  1   B
 4  2   A
 5  2   C
 NA NA  B

I want to summarize the values in order to export them to an Excel file for further manipulation later.

I could create a table for each column but, in real life, some variable could have too many different values (such as the weight or DOB of people).

I can get the first six value for a single column with:

table(try$BB,  useNA ="ifany") 

1    2 <NA> 
2    5    2 

But when I try to do it automatically for all the columns at once it doesn't work as expected:

try[,lapply(.SD,function(x) table(x,useNA="ifany")[1:6] )]

because the table() command generates a 2 rows result and only one is used to create the final summary table.

What procedure do you suggest to keep that information?

For example I could try to convert that single-variable tables to something like

"1":2    "2":5 "NA":2

But I don't know how to do it. Maybe converting it to factors, maybe pasting the values. I'm not even able to extract the rows of the table for further manipulation. Any solution with base data.frame or date.table is welcome.

Or I could even order that table to get the most common values first.

PD: I want somethin like this:

AA  "1":2   "2":2   "3":2   "4":1   "5":1   "NA": 1
BB  "1":2   "2":5   "NA": 2         
CC  "A":3   "B":3   "C":1   "NA": 2     

PD2: I've tried this

try[,lapply(.SD, function(x) { tmp <- table(x, 
     useNA ="ifany") ;  mapply(paste0, names( tmp ), 
     rep(":", length(tmp)), tmp )} )
   ]

But it's too long and it doesn't work well

 AA   BB   CC
1:2  1:2  A:3
2:2  2:5  B:3
3:2 NA:2  C:1
4:1  1:2  NA:2
5:1  1:2  A:3
NA:1  2:5  B:3

It fills the last values with fake values.

Another option would be to interleave the names and the values.

In this example I should get:

  AA   BB   CC
 "1:2"  "1:2"  "A:3"
 "2:2"  "2:5"  "B:3"
 "3:2"  "NA:2" "C:1"
 "4:1"    NA   "NA:2"
 "5:1"    NA    NA
 "NA:1"   NA    NA

The problem is that the list is converted internally to a datatable by the command as.data.table.list() and the different size vectors are recycled instead of filled with NAs.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
skan
  • 7,423
  • 14
  • 59
  • 96
  • So what exactly would the output look like with multiple columns? You are trying to create a plain text file with one row for each column? – MrFlick Dec 07 '16 at 21:31
  • `lapply(try, function(x) table(x, useNA="ifany"))` might be closer to what you want. Then decide how to make the data tabular or write each list element to a separate worksheet. – lmo Dec 07 '16 at 21:32
  • I agree it's not clear what you want, but I'd consider `melt(try)[, .N, by=.(variable, value)]` – Frank Dec 07 '16 at 21:33
  • A long format result is nice but it doesn't fit with my already existing Excel table that describes the variables, each row contains information of a single variable, and I want to add this result on the right of the existing ones. A list has the same problem. I've updated my question. – skan Dec 07 '16 at 23:17
  • I've tried to give an answer but it's not completly well. – skan Dec 08 '16 at 00:22
  • It's still not clear what output you actually want, as an R object. I think you'll need to figure that out and either dput it or provide some code to get there. – Frank Dec 08 '16 at 00:26
  • To extend the length without recycling, you can use `length(x) <- n` and the extra entries will be NAs. To avoid the conversion to data.table, maybe `lapply(DT, ...)` instead of `DT[, lapply(.SD, ...)]`. – Frank Dec 08 '16 at 00:45
  • I'm fine having a data.table. Thanks for the trick to change the length, I was trying something much more complex such as ifelse( length(x) – skan Dec 08 '16 at 00:55
  • length(x) <- n it's ok because it feels with NA if it's too short and it also cuts the vector if it's too long and I want to keep some values. If I'm in the case of cutting... how can I remove some values but forcing it to keep the NA:xx values? Where in my code should I use it? – skan Dec 08 '16 at 00:58

2 Answers2

1

You can get your desired output with

library(magrittr)
tab = try %>% lapply(table, useNA = "ifany")
len = max(lengths(tab))

tab %>% lapply(
    . %>% 
    { paste0(names(.), ":", .) } %>% 
    `length<-`(len)
) %>% setDF %>% print


    AA   BB   CC
1  1:2  1:2  A:3
2  2:2  2:5  B:3
3  3:2 NA:2  C:1
4  4:1 <NA> NA:2
5  5:1 <NA> <NA>
6 NA:1 <NA> <NA>

I haven't learned purrr, but if you like using pipes, that might offer somewhat cleaner code.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • I didn't really understand the most recent comments under the OP so figured I'd post an answer... – Frank Dec 08 '16 at 01:27
1

This is my data.table solution with some ideas from Frank.

siz <- 6
try[,lapply(.SD, function(x) { tmp <- table(x, 
useNA ="ifany") ;  tmp2 <- c(tmp[is.na(names(tmp))],
rev(sort(tmp[!is.na(names(tmp))]))); 
tmp3 <- mapply(paste0, names( tmp2 ),rep(":",
length(tmp2)),tmp2); length(tmp3)<-siz; tmp3})
]

It places the NAs always at the beginning and order the other elements from the most common to the least common.

Maybe there are some simpler ways to summarize the information.

skan
  • 7,423
  • 14
  • 59
  • 96
  • Re your last sentence, you're presumably exporting to Excel because you prefer to do your analysis there... makes more sense to do that step after exporting, I think. – Frank Dec 08 '16 at 03:00
  • I'm doing my analysis in R with data.table because my data is very large, a 3GB csv file. I'm just using Excel to export the results, all kind of summaries and use them for further manipulation, such as selecting columns of the original csv that fulfill some criteria. – skan Dec 08 '16 at 13:17
  • How can I force it to show first the number of NAs in that lists? – skan Dec 12 '16 at 20:21
  • I don't think you can without contortions. While NA can be a factor level, it always appears last. http://stackoverflow.com/q/27195956/ – Frank Dec 12 '16 at 20:41