0

I have a daily minimum temperature, maximum temperature, minimum dew point, and maximum dew point. This data contains NaN, so I want to know in a given year what percent of the data is missing (NaN) and then the total of the percent in all data by column;

calculate the percent of NaN in the column by year along with total percent throughout the period( 1948-2018)

My data is

 Station Date    Month  Day Year    MaxTemp MinTemp MaxDewPoint MinDewPoint
    ORD 1/1/1948    1   1   1948    35.6    26.6    34.16         -27.4
    ORD 1/2/1948    1   2   1948    -2      -16     -16.96       -27.04
    ORD 1/3/1948    1   3   1948    -4      -26     -12            -26
    ORD 1/4/1948    1   4   1948    -5      -26     -15             -26
    ORD 1/5/1948    1   5   1948    8       -25     3               NaN
    ORD 1/6/1948    1   6   1948    -11     -25     -24            -25
    ORD 1/7/1948    1   7   1948    1       -23     NaN            -23
    ORD 1/8/1948    1   8   1948    1       -22     -9              NaN
    ORD 1/9/1948    1   9   1948    NaN     -22     -5             -22
    ORD 1/10/1948   1   10  1948    10      NaN     -2              -22
    ORD 1/11/1948   1   11  1948    -11     -21    -23              -21
    ORD 1/12/1948   1   12  1948    3       -12     -7.96        -20.92
    ORD 1/13/1948   1   13  1948    6.98    -7.6    -7.6         -20.2
    ORD 1/14/1948   1   14  1948    3.92    -9.4    -11.2        NaN
    ORD 1/15/1948   1   15  1948    6        -7    -5.98         NaN
    ORD 1/16/1948   1   16  1948    3       -11     -7.96       -20.02

My Code so far,

    install.packages("dplyr")
library(dplyr)
install.packages("stringr")
library(stringr)
#setting up workspace in the folder#
setwd("D:/Climate Data Analysis/Asignment 1")
#opening a CSV file in r program#
data<- read.csv("chiacagost.csv", header=TRUE, sep=",")
#making data frame of the variables#
dframe<- data.frame(data)
# Missing percentage of the data by column

MisMxTMP<-dframe%>%summarise(NAMisMxTMP=sum(is.na(Max.Temp)/length(Max.Temp)))*100
misMnTMP<-dframe%>%summarise(NAmisMnTMPL=sum(is.na(Min.Temp)/length(Min.Temp)))*100
MisMxDTMP<-dframe%>%summarise(NAMisMxDTMP=sum(is.na(Max.Dew.Point)/length(Max.Dew.Point)))*100
MisMnDTMP<-dframe%>%summarise(NAMisMnDTMP=sum(is.na(Min.Dew.Point)/length(Min.Dew.Point)))*100

I was able to count the total percent of missing data but i want to know by year so that i can exclude the year in my analysis that has the maximum number of missing percentage

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
Lira
  • 53
  • 2
  • 9

1 Answers1

0

To calculate the percent of missing data by year and by variable:

> dframe %>% 
+     tidyr::gather(var, value, MaxTemp, MinTemp, MaxDewPoint, MinDewPoint) %>% 
+     dplyr::group_by(Year, var) %>% 
+     dplyr::summarise(pct_na = sum(is.nan(value)) / n())
# A tibble: 4 x 3
# Groups:   Year [?]
   Year var         pct_na
  <int> <chr>        <dbl>
1  1948 MaxDewPoint 0.0625
2  1948 MaxTemp     0.0625
3  1948 MinDewPoint 0.25  
4  1948 MinTemp     0.0625

To get ther percent of missing data for the whole year, just change group_by(Year, var) to group_by(Year).

Data

dframe <- read.table(textConnection(gsub(" ORD ", "\nORD ", "Station Date Month Day Year MaxTemp MinTemp MaxDewPoint MinDewPoint ORD 1/1/1948 1 1 1948 35.6 26.6 34.16 -27.4 ORD 1/2/1948 1 2 1948 -2 -16 -16.96 -27.04 ORD 1/3/1948 1 3 1948 -4 -26 -12 -26 ORD 1/4/1948 1 4 1948 -5 -26 -15 -26 ORD 1/5/1948 1 5 1948 8 -25 3 NaN ORD 1/6/1948 1 6 1948 -11 -25 -24 -25 ORD 1/7/1948 1 7 1948 1 -23 NaN -23 ORD 1/8/1948 1 8 1948 1 -22 -9 NaN ORD 1/9/1948 1 9 1948 NaN -22 -5 -22 ORD 1/10/1948 1 10 1948 10 NaN -2 -22 ORD 1/11/1948 1 11 1948 -11 -21 -23 -21 ORD 1/12/1948 1 12 1948 3 -12 -7.96 -20.92 ORD 1/13/1948 1 13 1948 6.98 -7.6 -7.6 -20.2 ORD 1/14/1948 1 14 1948 3.92 -9.4 -11.2 NaN ORD 1/15/1948 1 15 1948 6 -7 -5.98 NaN ORD 1/16/1948 1 16 1948 3 -11 -7.96 -20.02")), header = T)
C. Braun
  • 5,061
  • 19
  • 47
  • Thanks C. Braun, this code counts the percent by year only. at the end, i also want to know what percent of the variables is missing total throughout. For example, this code counts missing percent of say max temp for year 1, 2, and n individually. i also want the total missing percent of year 1through n for variable max temp – Lira Jan 29 '19 at 17:07
  • Hi @Lira, you can get the total missing percent of all years by removing `Year` from the `group_by` call. – C. Braun Jan 29 '19 at 17:20