0

I want to convert a table that has several categorical variables, as well as summary of the result of a binary experiment to long format to easily run a logistic regression model.

Is there an easy way to do this that does not involve just making a bunch of vectors with rep() and then combining those into a dataframe? Ideally, I'd like one function that does this automatically, but maybe I'll just need to make my own.

For example, if I start with this summary table:

test   group    success  n 
A      control  1        2
A      treat    2        3
B      control  3        5
B      treat    1        3

I want to be able to switch it back to the following format:

test   group     success
A      control   1
A      control   0
A      treat     1
A      treat     1
A      treat     0
B      control   1
B      control   1
B      control   1
B      control   0
B      control   0
B      treat     1
B      treat     0
B      treat     0

Thanks!

user3658457
  • 275
  • 3
  • 11
  • Related: [Elegantly convert rate summary rows into long binary-response rows?](http://stackoverflow.com/questions/18791355/elegantly-convert-rate-summary-rows-into-long-binary-response-rows/18793811#18793811). – Henrik Jan 11 '17 at 22:41
  • Thanks, the technique for doing the weighted regression on the initial dataframe is useful, but I'm still wondering how to switch the format for cases where I'd want to create other types of models. – user3658457 Jan 12 '17 at 00:42

1 Answers1

3

The reshape package is your friend, here. In this case, melt() and untable() are useful for normalizing the data.

If the example summary data.frame is in a variable called df, an abbreviated answer is:

# replace total n with number of failures
df$fail = df$n - df$success
df$n = NULL

# melt and untable the data.frame
df = melt(df)
df = untable(df, df$value)

# recode the results, e.g., here by creating a new data.frame
df = data.frame(
  test = df$test, 
  group = df$group, 
  success = as.numeric(df$variable == "success")
)

This is a great example of a very general problem. The idea is to back calculate the list of data that underlies a cross-tabulation. Given the cross-tabulation, the back-calculated list of data has one row for each datum and contains the attributes of each datum. Here is a post to the inverse of this question.

In "data geek" parlance, this is a question of putting tabulated data in First Normal Form -- if that is helpful to anyone. You can google data normalization, which will help you design agile data.frames that can be cross-tabulated and analyzed in many different ways.

In detail, for melt() and untable() to work here, the raw data need to be tweaked a bit to include fail (number of failures) rather than total n, but that is simple enough:

df$fail <- df$n - df$success
df$n <- NULL

which gives:

  test   group success fail
1    A control       1    1
2    A   treat       2    1
3    B control       3    2
4    B   treat       1    2

Now we can "melt" the table. melt() can back-calculate the original list of data that was used to create a cross tabulation.

df <- melt(df)

In this case, we get new column called variable that contains either "success" or "fail", and a column called value that contains the datum from the original success or fail column.

  test   group variable value
1    A control  success     1
2    A   treat  success     2
3    B control  success     3
4    B   treat  success     1
5    A control     fail     1
6    A   treat     fail     1
7    B control     fail     2
8    B   treat     fail     2

The untable() function repeats each row of a table according to the value of a numeric "count" vector. In this case, df$value is the count vector, because it contains the number of successes and fails.

df <- untable(df, df$value)

which will yield one record for each datum, either a "success" or a "fail":

    test   group variable value
1      A control  success     1
2      A   treat  success     2
2.1    A   treat  success     2
3      B control  success     3
3.1    B control  success     3
3.2    B control  success     3
4      B   treat  success     1
5      A control     fail     1
6      A   treat     fail     1
7      B control     fail     2
7.1    B control     fail     2
8      B   treat     fail     2
8.1    B   treat     fail     2

This is the solution. If required, the data can now be recoded to replace "success" with 1 and "fail" with 0 (and get rid of the the extraneous value and variable columns...)

  df <- data.frame(
    test = df$test, 
    group = df$group, 
    success = as.numeric(df$variable == "success")
  )

This returns the requested solution, tho the rows are sorted differently:

   test   group success
1     A control       1
2     A   treat       1
3     A   treat       1
4     B control       1
5     B control       1
6     B control       1
7     B   treat       1
8     A control       0
9     A   treat       0
10    B control       0
11    B control       0
12    B   treat       0
13    B   treat       0

Obviously, the data.frame can be resorted, if necessary. How to sort a data.frame in R.

Community
  • 1
  • 1
Geoffrey Poole
  • 1,102
  • 8
  • 10