1

I have a balanced panel data table containing thousands of firms with observations in two years each (1 & 2). For the data cleaning process I need to exclude observations where there is a shift between the years in a supposedly time-constant variable.

example <- matrix(c(1,1,2,2,3,3,4,4,1,2,1,2,1,2,1,2,1,1,0,0,0,1,1,0), ncol=3)
colnames(example) <- c('id', 'year', 'supposedly time-constant')
example.table <- data.table(example)
example.table

   id year supposedly time-constant
1:  1    1                        1
2:  1    2                        1
3:  2    1                        0
4:  2    2                        0
5:  3    1                        0
6:  3    2                        1
7:  4    1                        1
8:  4    2                        0

So, in the above table, firms 3 & 4 both show a change in the supposedly time-constant variable, so they would need to be removed. Firms 1 & 2 are what I wish to have. I need a code/function that allows me to clean my data.

I seem to have reached the limits of my R knowledge and hope I can find help here - thanks in advance!

M_R_
  • 35
  • 4

4 Answers4

3

We can use dplyr and select groups which have only one unique value

library(dplyr)
example.table %>%
   group_by(id) %>%
   filter(n_distinct(`supposedly time-constant`) == 1)


#     id  year `supposedly time-constant`
#  <dbl> <dbl>                      <dbl>
#1     1     1                          1
#2     1     2                          1
#3     2     1                          0
#4     2     2                          0

Same logic in base R using ave would be

example.table[with(example.table, ave(`supposedly time-constant`, id, 
             FUN = function(x) length(unique(x))) == 1), ]
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
2

You can use data.table and chaining concept to identify which ID's/firms show such a change:

example.table[, .(unq_val = length(unique(`supposedly time-constant`))), by = .(id)][unq_val >= 2, .(id)]

The above line of code can be broken down as follows: 1. For each id (in the 'by'argument), 2. Create a variable called unq_val which counts the total unique `supposedly time-constant', 3. Then selects only firms/id's which contain value >= 2 for such a variable.

The output of the code is:

   id
1:  3
2:  4

Which is a data.table, which you can then use to filter out observations from your original data.

ds_newbie
  • 79
  • 8
1

Another dplyr approach with all which selects all groups that have all 0s or all 1s. This will be inefficient if you have so many supposedly time-constant conditions:

    library(dplyr)
    example.table %>% 
       group_by(id) %>% 
       filter(all(`supposedly time-constant`==1) | all(`supposedly time-constant`==0))
    # A tibble: 4 x 3
    # Groups:   id [2]
         id  year `supposedly time-constant`
      <dbl> <dbl>                      <dbl>
    1     1     1                          1
    2     1     2                          1
    3     2     1                          0
    4     2     2                          0
NelsonGon
  • 13,015
  • 7
  • 27
  • 57
1

We can use data.table methods as it is a data.table. Grouped by 'id', check the length of unique elemnts in supposedly time-constant is equal to 1 and subset the .SD (Subset of data.table)

library(data.table)
example.table[, .SD[uniqueN(`supposedly time-constant`) == 1], by = id]
#     id year supposedly time-constant
#1:  1    1                        1
#2:  1    2                        1
#3:  2    1                        0
#4:  2    2                        0

Or a bit faster option would be .I

example.table[example.table[, .I[uniqueN(`supposedly time-constant`)
            == 1], by = id]$V1]
#   id year supposedly time-constant
#1:  1    1                        1
#2:  1    2                        1
#3:  2    1                        0
#4:  2    2                        0
akrun
  • 874,273
  • 37
  • 540
  • 662