0

Some wired output with subsetting data.frame in R.

here is files I used

https://d37djvu3ytnwxt.cloudfront.net/assets/courseware/v1/ccdc87b80d92a9c24de2f04daec5bb58/asset-v1:MITx+15.071x+2T2017+type@asset+block/WHO.csv

After read data in R , there are 194 obs. with 13 vars.

> str(WHO)
'data.frame':   194 obs. of  13 variables:
$ Country                      : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Region                       : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
$ Population                   : int  29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
$ Under15                      : num  47.4 21.3 27.4 15.2 47.6 ...
$ Over60                       : num  3.82 14.93 7.17 22.86 3.84 ...
$ FertilityRate                : num  5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
$ LifeExpectancy               : int  60 74 73 82 51 75 76 71 82 81 ...
$ ChildMortality               : num  98.5 16.7 20 3.2 163.5 ...
$ CellularSubscribers          : num  54.3 96.4 99 75.5 48.4 ...
$ LiteracyRate                 : num  NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
$ GNI                          : num  1140 8820 8310 NA 5230 ...
$ PrimarySchoolEnrollmentMale  : num  NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
$ PrimarySchoolEnrollmentFemale: num  NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...

But the result of subsetting with function subset differ from df[,] as example below.

> Outliers <- WHO[WHO$GNI > 10000 & WHO$FertilityRate > 2.5,]
> nrow(Outliers)
  [1] 27
Country                Region Population Under15 Over60 FertilityRate LifeExpectancy ChildMortality CellularSubscribers
NA                 <NA>                  <NA>         NA      NA     NA            NA             NA             NA                  NA
23             Botswana                Africa       2004   33.75   5.63          2.71             66           53.3              142.82
NA.1               <NA>                  <NA>         NA      NA     NA            NA             NA             NA                  NA
NA.2               <NA>                  <NA>         NA      NA     NA            NA             NA             NA                  NA
(trimmed ...)

There is a lot of NA obs.

While use subset function, yield correct results.

> Outliers <- subset(WHO, GNI > 10000 & FertilityRate > 2.5)
> nrow(Outliers)
[1] 7
> Outliers
          Country                Region Population Under15 Over60 FertilityRate LifeExpectancy ChildMortality CellularSubscribers
23           Botswana                Africa       2004   33.75   5.63          2.71             66           53.3              142.82
56  Equatorial Guinea                Africa        736   38.95   4.53          5.04             54          100.3               59.15
63              Gabon                Africa       1633   38.49   7.38          4.18             62           62.0              117.32
83             Israel                Europe       7644   27.53  15.15          2.92             82            4.2              121.66
88         Kazakhstan                Europe      16271   25.46  10.04          2.52             67           18.7              155.74
131            Panama              Americas       3802   28.65  10.13          2.52             77           18.5              188.60
150      Saudi Arabia Eastern Mediterranean      28288   29.69   4.59          2.76             76            8.6              191.24
(trimmed ...)

1 Answers1

0

What about making sure you get rid of the NAs first?

Outliers <- WHO[!is.na(WHO$GNI) & WHO$GNI > 10000 & 
!is.na(WHO$FertilityRate) & WHO$FertilityRate > 2.5,]
Damiano Fantini
  • 1,925
  • 9
  • 11