1

Hi I want to use filter in R to filter all the row with selected countrycode, and the data with continuous year from 1950 to 2014 is like

  countrycode       country currency_unit year   rgdpe   rgdpo      pop      emp      avh
1         USA United States     US Dollar 1950 2279787 2274197 155.5635 62.83500 1983.738
2         USA United States     US Dollar 1951 2440076 2443820 158.2269 65.08094 2024.002
3         USA United States     US Dollar 1952 2530524 2526412 160.9597 65.85582 2020.183
4         USA United States     US Dollar 1953 2655277 2642977 163.6476 66.78711 2014.500
5         USA United States     US Dollar 1954 2640868 2633803 166.5511 65.59514 1991.019
6         USA United States     US Dollar 1955 2844098 2834914 169.5189 67.53133 1997.761

And my code is :

dat_10 <- filter(data_all_country,countrycode == c("USA","CHN","GBR","IND","JPN","BRA","ZAF","FRA","DEU","ARG"))

The amazing thing is the dat_10 is as the following:

  countrycode   country  currency_unit year     rgdpe     rgdpo      pop       emp
1         ARG Argentina Argentine Peso 1954  51117.46  51031.80 18.58889  6.970472
2         ARG Argentina Argentine Peso 1964  69836.62  68879.08 21.95909  7.962999
3         ARG Argentina Argentine Peso 1974 113038.73 110358.46 25.64450  9.135211
4         ARG Argentina Argentine Peso 1984 148994.61 149928.59 29.92091 10.345933
5         ARG Argentina Argentine Peso 1994 379470.19 372903.00 34.55811 12.075872
6         ARG Argentina Argentine Peso 2004 517308.94 499958.94 38.72878 14.669195

as even the valid time-series data is filtered every 10 years, which is the exact number of the country I select as logical variable.

How does this happen and any methods to fix it up ?

exteral
  • 991
  • 2
  • 12
  • 33

1 Answers1

7

Why Should We Use %in% not == ?

Let's look at the difference between == and %in% in more details.

Assuming that we have a vector looks like this.

sample_vec <- c("USA", "CHN", "GBR", "IND", "JPN", "BRA", "USA", "CHN", "GBR")

And we what to return all USA, CHN, and GBR in the vector. The desired output is like this, which would be useful for subsetting or filtering.

#[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE

If we use == with c("USA", "CHN", "GBR"), we can get the following.

sample_vec == c("USA", "CHN", "GBR")
#[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE

Looks good? Wait, it is not doing what we think.

Let's test this code with one additional new country code to the original vector.

# Add one more country
sample_vec2 <- c(sample_vec, "IND")
sample_vec2 ==  c("USA", "CHN", "GBR")
#[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE

Warning message: In sample_vec2 == c("USA", "CHN", "GBR") : longer object length is not a multiple of shorter object length

The result may look good, but pay attention to the warning message. It turns out that when using == to compare two vectors, R recycles the short element to the long one. The above code is doing something as follows. Each pair of character is evaluated separately.

Position  1     2     3     4     5     6     7     8     9    10 
Vector1 "USA" "CHN" "GBR" "IND" "JPN" "BRA" "USA" "CHN" "GBR" "IND" 
Vector2 "USA" "CHN" "GBR" "USA" "CHN" "GBR" "USA" "CHN" "GBR" "USA"
Result   TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE

R evaluates the string from Vector1 and Vector2 on Position 1 if they are the same. If they are the same, returns TRUE, otherwise returns FALSE, and then move to Position 2, and so on. This is why there is a warning message. The length of sample_vec2 is 10, while the length of the target vector is only 3. R thus needs to recycle the elements in the target vector to perform one-to-one comparison.

Now if we realized that R is doing recycle and one-to-one comparison when we use ==, it is clear that it if we want to filter element in a vector, it is not suitable. Let's see the following example.

sample_vec == c("CHN", "GBR", "USA")
#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

The code is almost the same as sample_vec == c("USA", "CHN", "GBR"), except that I changed the order of the country code. But it returns all FALSE! This is because recycling and one-to-one comparison found none of any positions are the same. This is probably not the results we want.

However, if we use the following code.

sample_vec %in% c("CHN", "GBR", "USA")
#[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE

It returns the expected results. This is because %in% is an interface of the match function in R. It returns TRUE or FALSE if matches exist or not.

www
  • 38,575
  • 12
  • 48
  • 84