0

I have a dataframe that contains information about financial contributions to political candidates (denoted in the data by "cand") and political organizations (denoted in the data by "comm"). The dataframe also includes a unique ID for each contributor, with each row in the data denoting one contribution made. What I want to do is get a crosstab that shows, for each political (non-candidate) organization, how many donors to those organizations also contributed to each political candidate in the dataframe. The dataframe looks like this:

 contributor ID .      organization
 1                     cand1
 2                     cand2
 3                     comm1
 3                     cand1
 4                     cand1
 5                     cand2
 5                     cand1
 5                     comm2

What I want to be able to create is something like this:

  Comm .              Cand
               Cand1 .     Cand2
  Comm1        1           0
  Comm2        1           1

(Because 1 person -- ID #3 -- contributed to both comm1 and cand1, and 1 person -- ID #5 -- contributed to comm1, cand1, and cand2.)

I have thought about ways to do this using aggregate or dplyr, but I'm not sure. Does anyone have any tips?

hans91
  • 1
  • 2
  • If memory is available, you could start with `crossprod(table(dat))` -as in [here](http://stackoverflow.com/questions/19891278/r-table-of-interactions-case-with-pets-and-houses)- and subset accordingly like `crossprod(table(dat))[startsWith(levels(dat$org), "comm"), startsWith(levels(dat$org), "cand")]` – alexis_laz Nov 26 '16 at 22:03
  • Thanks for this. I get the following error with this code: Error in table(table) : attempt to make a table with >= 2^31 elements. Do you have any suggestions? – hans91 Nov 26 '16 at 22:37
  • Along the same lines, you might give the sparse alternative a try -- `library(Matrix); tab = xtabs( ~ contributorID + organization, dat, sparse = TRUE); crossprod(tab[, startsWith(colnames(tab), "comm")], tab[, startsWith(colnames(tab), "cand")])` – alexis_laz Nov 27 '16 at 11:38
  • Did you solve the problem? If so, could you accept one of the answers? – Joe Jul 30 '19 at 14:34

3 Answers3

0

Yoou need to use something like tidyr. You need to make a variable for each committee and one for each candidate. Your data is already in long format, but what you now need to do is create a wide data frame using organization and the donor ID as the unique id. The you can do crosstabs.

Elin
  • 6,507
  • 3
  • 25
  • 47
0
dfs = read.table(text = "contributor organization
1 cand1
2 cand2
3 comm1
3 cand1
4 cand1
5 cand2
5 cand1
5 comm2", sep = " ", stringsAsFactors = FALSE, header = TRUE)

# select only comms with their contributor
comms = dfs[grep("^comm", dfs$organization), ]
colnames(comms)[2] = "comms"
# select only cands
cands = dfs[grep("^cand", dfs$organization), ]
colnames(cands)[2] = "cands"

# combine comms and candidates
new_dfs = merge(comms, cands, all = TRUE)
with(new_dfs, table(comms, cands))

Update. Try to avoid creation of large matrix with table

library(tidyr)
library(dplyr)
dfs = read.table(text = "contributor organization
1 cand1
2 cand2
3 comm1
3 cand1
4 cand1
5 cand2
5 cand1
5 comm2", sep = " ", stringsAsFactors = FALSE, header = TRUE)

# select only comms with their contributor
comms = dfs %>% filter(grepl("^comm", organization))

# select only cands
cands = dfs %>% 
    filter(grepl("^cand", organization)) %>% 
    mutate(
        value = 1
    ) %>% 
    spread(key  = organization, value = value, fill = 0)

left_join(comms, cands)
Gregory Demin
  • 4,596
  • 2
  • 20
  • 20
  • I understand the intuition behind this code, but am running into a similar as with other suggestions: Error in table(comms, cands) : attempt to make a table with >= 2^31 elements – hans91 Nov 26 '16 at 23:19
0

Here's one possible solution using tidyr, dplyr and table(). First we calculate a summary of the number of contributors to a cand and a com.

library(tidyr)
library(dplyr)

df_summary <- 
df %>% mutate(ct = 1) %>% spread(organization, ct) %>% 
transmute(
  comm1_cand1 = ifelse(cand1 + comm1 > 0, 1, 0),
  comm2_cand1 = ifelse(cand1 + comm2 > 0, 1, 0),
  comm1_cand2 = ifelse(cand2 + comm1 > 0, 1, 0),
  comm2_cand2 = ifelse(cand2 + comm2 > 0, 1, 0)) %>%
gather() %>%
separate(key, into = c("comm", "cand"), sep = "_")

This gives a two way categorical data frame that looks like this:

#    comm  cand value
#1  comm1 cand1    NA
#2  comm1 cand1    NA
#3  comm1 cand1     1
#4  comm1 cand1    NA
#5  comm1 cand1    NA
#6  comm2 cand1    NA
#7  comm2 cand1    NA
# etc

Now we make our two-way table from the data.

table(df_summary)

#   cand
#comm    cand1 cand2
#  comm1     1     0
#  comm2     1     1
Joe
  • 8,073
  • 1
  • 52
  • 58
  • This solution seems great, but I did get the following error: "Error: Duplicate identifiers for rows..." Do you have any thoughts on this? – hans91 Nov 28 '16 at 04:20
  • OK. Can you provide a bigger sample of your data that causes this problem? I only used your original 8 observations thinking it would be scaleable, but it seems not. – Joe Nov 28 '16 at 11:14