0

I am using R and reading a CSV file to summarise group of columns in the file where values are zeros and ones to see whether they have got allergic reaction or not. This file contains 538 variables initially these variables are integers so I am converting all integers into factor variables which solves my purpose. But I am only able to use table function to summarise the values on all factor columns but I need to group the columns and apply them to table function for group by group summary. Could anyone please help me in this regard?

My code is as follows....

egg1 <-read.csv("egg.csv",header = TRUE)

str(egg1)

egg1[sapply(egg1, is.integer)] <- lapply(egg1[sapply(egg1, is.integer)], as.factor)

lapply(egg1, function(egg1) {
   if (is.factor(egg1)) return(table(egg1))
  })

Here in table I am looking to pass range of variable of CSV file group by group. Please have a look at my sample CSV which contains 3 groups I have coloured for better understanding. Q1: I want to calculate distribution of yes/no (1/0) for dose1,dose2 and dose3 respectively where 3 symptoms are listed for each. Q2: Then compare symptoms of all 3 doses.

table does well by showing summary of all columns but I need group wise summary.

sample data

lmo
  • 37,904
  • 9
  • 56
  • 69
Usman
  • 29
  • 5
  • 1
    It's generally inadvisable to turn numbers into factors, as it has the potential to introduce bugs down the line (unless you're very careful) due to the fact that factors are stored as integers. If you're just trying to make a table of each column, all you need is `lapply(egg1, table)` – alistaire Jul 03 '16 at 03:04
  • 1
    I need to make a table of almost each column but in groups. Obviously I would need to skip some columns for example date of birth and weight. But grouping matters the most for me because that particular group would belong to particular section in csv file – Usman Jul 03 '16 at 04:39
  • You're not talking about a CSV anymore, you're talking about a data.frame. Really, though all I can do at this point is speculate about what you need; you need to [read this](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) about how to make a minimal (obviously not with 538 columns, but a representative subset) reproducible (with data!) example and edit your question. – alistaire Jul 03 '16 at 04:56
  • @Usman, it's courteous to mark one of the answers as accepted by selecting the checkmark next to the appropriate answer. If your issue is not satisfactorily resolved, please comment. – r2evans Jul 05 '16 at 05:20
  • Sorry, I am new to this forum, I did not know the procedure. – Usman Jul 06 '16 at 18:36

2 Answers2

0

As @alistaire said, we're missing a reproducible example, but perhaps this will sufficiently guess at the structure and your intent.

I'll fabricate some data, I hope it's closely reminiscent of your real data. Instead of factors, I think you should be able to work with logical, since you said the columns of interest were one of 0 or 1.

set.seed(4)
egg1 <- data.frame(
  v1 = sample(0:1, size=20, replace=TRUE),
  v2 = sample(0:1, size=20, replace=TRUE),
  v3 = sample(c('a','b','c'), size=20, replace=TRUE),
  v4 = sample(0:1, size=20, replace=TRUE),
  stringsAsFactors = FALSE)
str(egg1)
# 'data.frame': 20 obs. of  4 variables:
#  $ v1: int  1 0 0 0 1 0 1 1 1 0 ...
#  $ v2: int  1 1 1 0 1 1 0 1 1 1 ...
#  $ v3: chr  "c" "a" "b" "a" ...
#  $ v4: int  1 0 1 1 0 1 0 1 1 1 ...

(I included v3 with the assumption that not all columns are 0/1 boolean.)

This is a first attempt:

sapply(Filter(is.numeric, egg1),
       function(egg) table(egg == 1))
#       v1 v2 v4
# FALSE  9  7 10
# TRUE  11 13 10

Unfortunately, it has one slight flaw: it assumes all results are of the same length, which is not always true:

set.seed(105966)
egg1 <- data.frame(
  v1 = sample(0:1, size=20, replace=TRUE),
  v2 = sample(0:1, size=20, replace=TRUE),
  v3 = sample(c('a','b','c'), size=20, replace=TRUE),
  v4 = sample(0:1, size=20, replace=TRUE),
  stringsAsFactors = FALSE)
sapply(Filter(is.numeric, egg1),
       function(egg) table(egg == 1))
# $v1
# FALSE  TRUE 
#     9    11 
# $v2
# FALSE  TRUE 
#     8    12 
# $v4
# TRUE 
#   20 

(That is, it's returning a list because not all returned elements are of length 2: v4 had all 1s.) The fix is to ensure you always count at least one of each level and then make sure to not count that in your results:

sapply(Filter(is.numeric, egg1),
       function(egg) table(c(TRUE, FALSE, egg == 1)) - 1)
#       v1 v2 v4
# FALSE  9  8  0
# TRUE  11 12 20
r2evans
  • 141,215
  • 6
  • 77
  • 149
0

Using your screenshot sample, consider reshaping your data frame. First, melt() dose symptom columns from wide to long, then dcast() to migrate no/yes into separate columns. You can even split the dose_symp columns to separate dose and symp fields for two groupings:

library(reshape2)

df <- read.csv("Input.csv", stringsAsFactors = FALSE)

# MELT (LEAVING OUT TIME COLS)
mdf <- melt(df[!grepl("time", names(df))], id.vars = c("id", "DOB", "weight"), 
            variable.name = "symp_type")
mdf$key <- 1    
# CAST (FOR NO/YES COLUMNS, SUMMED ON KEY)
mdf <- dcast(mdf, id + DOB + weight + symp_type ~ value, sum, value.var = "key")

# UPDATE COLUMNS
names(mdf)[5:6] <- c("no", "yes")

mdf$symp_type <- as.character(mdf$symp_type)
mdf$dose <- sapply(strsplit(as.character(mdf$symp_type),"_"), "[", 1)
mdf$symp <- sapply(strsplit(as.character(mdf$symp_type),"_"), "[", 2)
mdf$symp_type <- NULL

# GROUP AGGREGATION (DATA REPEATS DUE TO REPLICATED DATA IN SAMPLE)
aggdf <- aggregate(.~symp, mdf[c("symp", "no", "yes")], FUN = sum)
aggdf
#    symp no yes
# 1 symp1 18  12
# 2 symp2 18  12
# 3 symp3 18  12

aggdf <- aggregate(.~dose, mdf[c("dose", "no", "yes")], FUN = sum)
aggdf
#    dose no yes
# 1 dose1 18  12
# 2 dose2 18  12
# 3 dose3 18  12

aggdf <- aggregate(.~symp + dose, mdf[c("symp", "dose", "no", "yes")], FUN = sum)
aggdf
#    symp  dose no yes
# 1 symp1 dose1  6   4
# 2 symp2 dose1  6   4
# 3 symp3 dose1  6   4
# 4 symp1 dose2  6   4
# 5 symp2 dose2  6   4
# 6 symp3 dose2  6   4
# 7 symp1 dose3  6   4
# 8 symp2 dose3  6   4
# 9 symp3 dose3  6   4
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thanks guys for your help, I will give your codes a try. Although I am new to R but I will try to implement it. – Usman Jul 03 '16 at 21:55