0

I have two datasets (that might become three or four later) containing survey data, with some identical items that were put to respondents with a 30-year gap in between. I now want to produce graphs to compare the findings. I browsed some articles on Pew Research for inspiration and I think the most suitable way to present the data is with pairs of bars of standardized height/size (one at time = t1, and below it the other at t2), with differently colored segments representing proportions. I would make them horizontal so that I have ample space for labeling each pair of bars. So it would be, or at least it could look like a common geom_bar(position="fill") + coord_flip() graph.

Here is some sample data:

country <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)
var <- c(1, 3, 2, 1, 2, 2, 3, 3, 1, 3, NA, NA)
wght <- c(0.8, 0.9, 1.2, 1.5, 0.5, 1, 0.7, 0.9, 0.8, 1.1, 1, 0.8)

df <- cbind.data.frame(country, var, wght)
df2 <- cbind.data.frame(country, var, wght)

'Country' is a country code, 'var' is the variable I'm interested in, and 'wght' would be a post-stratification weight supplied with the dataset. In this example here, the two datasets are identical of course and there's no real point in visualizing the data for comparison, but it should not make a difference for my question.

The simplest graph I would want to make is a country-specific one that contains two horizontal bars, one for the weighted responses at t1 and the other at t2. Later, I'd also want to make more complex ones, such as having in one graph pairs of bars for all countries, or within one country the responses separated by gender, age categories, education level, etc. For the most basic one, if there were no weights, I would do the following:

df$time <- 1
df2$time <- 2

varfull <- c(df$var, df2$var)
timefull <- c(df$time, df2$time)

newdf <- cbind.data.frame(timefull, varfull)
newdf$varfull <- as.factor(newdf$varfull)

ggplot(newdf, aes(time, fill=varfull)) + geom_bar(position="fill") + coord_flip()

The graph would still need to be formatted, but the general structure is there. But the data is unweighted, and I can only think of very tedious ways to get to a graph using the weights (add up each individual's weight grouped by original response values, then calculate the proportions per group of the total sum).

If anyone can help in adding the weights in an easier fashion, I'd be grateful!

SpecialK201
  • 111
  • 7
  • 1
    Just a note that the `cbind` is completely unnecessary. `df <- cbind.data.frame(country, var, wght)` is a long way to write `df <- data.frame(country, var, wght)`. – Gregor Thomas Apr 19 '23 at 15:35
  • 1
    Generally, if you want similar geoms on your data sets, it makes the most sense to combine them into a single data set with a new column to identify the source. If you have multiple data sets I'd strongly suggest importing them [into a `list` of data frames](https://stackoverflow.com/questions/17499013/how-do-i-make-a-list-of-data-frames) rather than separate objects. This makes combining them very easy - especially if you use `dplyr::bind_rows()` or `data.table::rbindlist()`. – Gregor Thomas Apr 19 '23 at 15:39
  • 2
    I'd also recommend doing data manipulation "by group" with `dplyr` or `data.table` to handle the weights. – Gregor Thomas Apr 19 '23 at 15:41
  • @GregorThomas Appreciate the responses, and thanks for the heads-up regarding ```cbind```. Must have come across it somewhere and then just kept using it. I didn't even know that data frames can be listed and what exactly happens when I do it, I will look into that. Can you be a bit more specific about the last part though? I'm using ```dplyr```, but how would I use it for weighting? – SpecialK201 Apr 20 '23 at 15:54
  • 1
    I mean, what you call *"very tedious ways to get to a graph using the weights (add up each individual's weight grouped by original response values, then calculate the proportions per group of the total sum)"*, I think would not be too tedious with `dplyr`. – Gregor Thomas Apr 20 '23 at 20:30
  • @GregorThomas Ah okay. I actually assumed there's a smarter way that has R do that work, because "tedious" also usually means "susceptible to errors due to inattention" etc. But I'll go with that then... – SpecialK201 Apr 21 '23 at 10:14

0 Answers0