0

I am analysing a longitudinal survey (https://microdata.worldbank.org/index.php/catalog/3712) with around 2k participating households (dropping with each round). There were 11 waves/rounds, each divided into around 6-8 datasets based on theme of the questions. To analyse it, I need it in proper panel data format, with each theme being in one file, combining all the waves.

Please see the excel snippets (with most columns removed for simplicity) of how it looks: Round 1 vs round 9 (The levels of categorical variables have change names, from full names to just numbers but it's the same question). Basically, the format looks something like this:

head(round1.csv)
ID INCOME SOURCE ANSWER CHANGE
101 1.Business 1. YES 3. Reduced
101 2.Pension 2. NO
102 1.Business 1. YES 2. No change
102 2. Assistance 1. YES 3. Reduced

So far I have only been analysing seperate waves by their own, but I do not know how to:

  1. Combine so many data frames together.
  2. Convert it to the format where each ID appears only once per wave. I used spread to use modelling in single files. I think I can imagine what the data frame would look like if the question was only whether they receive the income source (maybe like this?:
WAVE ID Business Pension
:1 101 1. YES 1. NO
:1 102 1. YES 1. YES
:2 101 1. NO 1. YES
:2 102 1. YES 1. YES

), but I do not understand how it is supposted to look like with also the change to that income included.

  1. How to deal with weights - there are weights added to one of the files for each wave. Some are missing, and they change per wave, as fewer households agree to participate each round. I am happy to filter and only use houesholds that participated in every round, to make it easier.

I looked for an aswer here: Panel data, from wide to long with multiple variables and Transpose a wide dataset into long with multiple steps but I think my problem is different.

I am a student, definitely not an expert, so I apologise for my skill-level.

liborm
  • 2,634
  • 20
  • 32
Kamila
  • 1

1 Answers1

0

There's too many questions at once, I'll ignore the weights (it should be a separate question, after the merging is resolved).

How to merge? For sure you'll be doing something called a left join. The leftmost dataset should be the longest one (the first wave). The others will be joined by ID, and the IDs missing in the next ones will get NAs instead of the values. I'll be using tidyverse code examples - left_join docs here`.

You'll have to deal with a few things on the way.

  • duplicate column names
    • you can use the suffix argument like suffix = c(".wave1", ".wave2")
  • different coding of the data (seen in your samples eg s7q1 1. YES vs 1)
    • use something like extract() to get the same representation

When you're done with the joining, you need to re-shape your data. That would be something like pivot_longer(), followed by extract() to get the .wave# suffix into a separate column. Then you can pivot_wider() back into a wider format, keeping your wave column.

R-like pseudocode, illustrates how it could be done .. does not work (as I don't have your datasets):

library(tidyverse)
library(readxl)

read_excel("wave1.xlsx") -> d_w1
read_excel("wave2.xlsx") -> d_w2

d_w1 %>%
  extract(s7q1, into = "s7q1", regex = "([0-9]+)") %>%
  d_w1fix

d_w1fix %>%
  left_join(d_w2, by = "ID", suffix = c(".wave1", ".wave2")) %>%
  pivot_longer(-ID, names_to = "question", values_to = "answer") %>%
  extract(question, into = c("question", "wave"), regex = "([[:alnum:]]+).wave([0-9])") %>%
  pivot_wider(names_from = "question", values_from = "answer") ->
  d_final

liborm
  • 2,634
  • 20
  • 32