Creating a particular cross-tab in R?

Question

I have a dataframe that contains information about financial contributions to political candidates (denoted in the data by "cand") and political organizations (denoted in the data by "comm"). The dataframe also includes a unique ID for each contributor, with each row in the data denoting one contribution made. What I want to do is get a crosstab that shows, for each political (non-candidate) organization, how many donors to those organizations also contributed to each political candidate in the dataframe. The dataframe looks like this:

 contributor ID .      organization
 1                     cand1
 2                     cand2
 3                     comm1
 3                     cand1
 4                     cand1
 5                     cand2
 5                     cand1
 5                     comm2

What I want to be able to create is something like this:

  Comm .              Cand
               Cand1 .     Cand2
  Comm1        1           0
  Comm2        1           1

(Because 1 person -- ID #3 -- contributed to both comm1 and cand1, and 1 person -- ID #5 -- contributed to comm1, cand1, and cand2.)

I have thought about ways to do this using aggregate or dplyr, but I'm not sure. Does anyone have any tips?

If memory is available, you could start with `crossprod(table(dat))` -as in [here](http://stackoverflow.com/questions/19891278/r-table-of-interactions-case-with-pets-and-houses)- and subset accordingly like `crossprod(table(dat))[startsWith(levels(dat$org), "comm"), startsWith(levels(dat$org), "cand")]` — alexis_laz, Nov 26 '16 at 22:03
Thanks for this. I get the following error with this code: Error in table(table) : attempt to make a table with >= 2^31 elements. Do you have any suggestions? — hans91, Nov 26 '16 at 22:37
Along the same lines, you might give the sparse alternative a try -- `library(Matrix); tab = xtabs( ~ contributorID + organization, dat, sparse = TRUE); crossprod(tab[, startsWith(colnames(tab), "comm")], tab[, startsWith(colnames(tab), "cand")])` — alexis_laz, Nov 27 '16 at 11:38
Did you solve the problem? If so, could you accept one of the answers? — Joe, Jul 30 '19 at 14:34

score 0 · Answer 1 · answered Nov 26 '16 at 22:41

Yoou need to use something like tidyr. You need to make a variable for each committee and one for each candidate. Your data is already in long format, but what you now need to do is create a wide data frame using organization and the donor ID as the unique id. The you can do crosstabs.

Gregory Demin · Answer 2 · 2016-11-26T23:48:28.857

dfs = read.table(text = "contributor organization
1 cand1
2 cand2
3 comm1
3 cand1
4 cand1
5 cand2
5 cand1
5 comm2", sep = " ", stringsAsFactors = FALSE, header = TRUE)

# select only comms with their contributor
comms = dfs[grep("^comm", dfs$organization), ]
colnames(comms)[2] = "comms"
# select only cands
cands = dfs[grep("^cand", dfs$organization), ]
colnames(cands)[2] = "cands"

# combine comms and candidates
new_dfs = merge(comms, cands, all = TRUE)
with(new_dfs, table(comms, cands))

Update. Try to avoid creation of large matrix with `table`

library(tidyr)
library(dplyr)
dfs = read.table(text = "contributor organization
1 cand1
2 cand2
3 comm1
3 cand1
4 cand1
5 cand2
5 cand1
5 comm2", sep = " ", stringsAsFactors = FALSE, header = TRUE)

# select only comms with their contributor
comms = dfs %>% filter(grepl("^comm", organization))

# select only cands
cands = dfs %>% 
    filter(grepl("^cand", organization)) %>% 
    mutate(
        value = 1
    ) %>% 
    spread(key  = organization, value = value, fill = 0)

left_join(comms, cands)

I understand the intuition behind this code, but am running into a similar as with other suggestions: Error in table(comms, cands) : attempt to make a table with >= 2^31 elements — hans91, Nov 26 '16 at 23:19

Joe · Answer 3 · 2016-11-27T00:28:45.737

Here's one possible solution using tidyr, dplyr and table(). First we calculate a summary of the number of contributors to a cand and a com.

library(tidyr)
library(dplyr)

df_summary <- 
df %>% mutate(ct = 1) %>% spread(organization, ct) %>% 
transmute(
  comm1_cand1 = ifelse(cand1 + comm1 > 0, 1, 0),
  comm2_cand1 = ifelse(cand1 + comm2 > 0, 1, 0),
  comm1_cand2 = ifelse(cand2 + comm1 > 0, 1, 0),
  comm2_cand2 = ifelse(cand2 + comm2 > 0, 1, 0)) %>%
gather() %>%
separate(key, into = c("comm", "cand"), sep = "_")

This gives a two way categorical data frame that looks like this:

#    comm  cand value
#1  comm1 cand1    NA
#2  comm1 cand1    NA
#3  comm1 cand1     1
#4  comm1 cand1    NA
#5  comm1 cand1    NA
#6  comm2 cand1    NA
#7  comm2 cand1    NA
# etc

Now we make our two-way table from the data.

table(df_summary)

#   cand
#comm    cand1 cand2
#  comm1     1     0
#  comm2     1     1

This solution seems great, but I did get the following error: "Error: Duplicate identifiers for rows..." Do you have any thoughts on this? — hans91, Nov 28 '16 at 04:20
OK. Can you provide a bigger sample of your data that causes this problem? I only used your original 8 observations thinking it would be scaleable, but it seems not. — Joe, Nov 28 '16 at 11:14

Creating a particular cross-tab in R?

3 Answers3

Update. Try to avoid creation of large matrix with table

Update. Try to avoid creation of large matrix with `table`