0

My dataset is constructed as follows:

# A tibble: 20 x 8
   iso3   year  Var1 Var1_imp Var2 Var2_imp Var1_type Var2_type
   <chr> <dbl> <dbl>      <dbl>   <dbl>         <dbl> <chr>     <chr>     
 1 ATG    2000    NA      144        NA           277 imputed   imputed   
 2 ATG    2001    NA      144        NA           277 imputed   imputed   
 3 ATG    2002    NA      144        NA           277 imputed   imputed   
 4 ATG    2003    NA      144        NA           277 imputed   imputed   
 5 ATG    2004    NA      144        NA           277 imputed   imputed   
 6 ATG    2005    NA      144        NA           277 imputed   imputed   
 7 ATG    2006    NA      144        NA           277 imputed   imputed   
 8 ATG    2007   144      144       277           277 observed  observed  
 9 ATG    2008    45       45        NA           301 observed  imputed   
10 ATG    2009    NA       71.3      NA           325 imputed   imputed   
11 ATG    2010    NA       97.7      NA           349 imputed   imputed   
12 ATG    2011    NA      124        NA           373 imputed   imputed   
13 ATG    2012    NA      150.       NA           397 imputed   imputed   
14 ATG    2013    NA      177.      421           421 imputed   observed  
15 ATG    2014    NA      203       434           434 imputed   observed  
16 ATG    2015    NA      229.      422           422 imputed   observed  
17 ATG    2016    NA      256.      424           424 imputed   observed  
18 ATG    2017   282      282       429           429 observed  observed  
19 ATG    2018    NA      282       435           435 imputed   observed  
20 EGY    2000    NA    38485        NA        146761 imputed   imputed

I am new to R and I would like to create a line chart for each country with time series for variables Var1_imp and Var2_imp on the same chart (I have 193 countries in my database with data from 2000 to 2018) using filled circles when data are observed and unfilled circles when data are imputed (based on Var1_type and VAr2_type). Circles would be joined with lines if two subsequent data points are observed otherwise circles would be joined with dotted lines.

The main goal is to check country by country if the method used to impute missing data is good or bad, depending on whether there are outliers in time series.

I have tried the following:

ggplot(df, aes(x=year, y=Var1_imp, group=Var1_type))  
+ geom_point(size=2, shape=21) # shape = 21 for unfilled circles and shape = 19 for filled circles
+ geom_line(linetype = "dashed") # () for not dotted line, otherwise linetype ="dashed"

I have difficulties to find out: 1/ how to do one single chart per country per variable 2/ how to include both Var1_imp and Var2_imp on the same chart 3/ how to use geom_point based on conditions (imputed versus observed in Var1_type) 4/ how to use geom_line based on conditions (plain line if two subsequent observed data points, otherwise dotted).

Thank you very much for your help - I think this exercise is not easy and I would learn a lot from your inputs.

  • You need to look at `fill` and `color`. However, how exactly do you want the plots to appear? On one plot? See also `facet`ting See [this](https://www.r-graph-gallery.com/line-chart-several-groups-ggplot2.html) and [this](https://stackoverflow.com/questions/10349096/group-data-and-plot-multiple-lines?noredirect=1&lq=1) – NelsonGon May 18 '20 at 10:52
  • correction: 1/ how to do one single chart per country – Nellicopter May 18 '20 at 10:55

2 Answers2

1

Plotting two varibles at the same time in a meaningful way in a line chart is going to be a bit hard. It's easier if you use pivot_longer to create one column containing both the var1_imp and var2_imp values. You will then have a key column containing var1_imp and var2_imp, and a values column containing the values for those two. You can then plot using year as x, and the new values column as y, with fill set to the key column. You'll then get two lines per country.

However, looking for outliers based on a line chart for 193 countries ins't a very good idea. Use

outlier_values <- boxplot.stats(airquality$Ozone)$out

for to get outliers in a column, or similar with sapply to get multiple columns. Outliers are normally defined as 1.5* IQR, so it's easy to figure out which ones are.

  • This gives me a list of values that are outliers but hard to know for which country, which year..and I need to show this type of chart to countries.. – Nellicopter May 18 '20 at 21:43
  • Well, you can use the vector to subset the dataset if you wish. airquality[which(airquality$Ozone %in% outlier_values),] – RegressionSquirrel May 18 '20 at 23:30
1

You can use the following code

df %>% 
  pivot_longer(cols = -c(sl, iso3, year, Var1, Var2, Var1_type, Var2_type), values_to = "values", names_to = "variable") %>% 
  ggplot(aes(x=year, y=values, group=variable)) + 
  geom_point(size=2, shape=21) + 
  geom_line(linetype = "dashed") + facet_wrap(iso3~., scales = "free") + 
  xlab("Year") + ylab("Imp")

enter image description here Better to use colour like

df %>% 
  pivot_longer(cols = -c(sl, iso3, year, Var1, Var2, Var1_type, Var2_type), values_to = "values", names_to = "variable") %>% 
  ggplot(aes(x=year, y=values, colour=variable)) + 
  geom_point(size=2, shape=21) + 
  geom_line() + facet_wrap(iso3~., scales = "free") + xlab("Year") + ylab("Imp")

enter image description here

Update

df %>% 
  pivot_longer(cols = -c(sl, iso3, year, Var1, Var2),  
               names_to = c("group", ".value"), 
               names_pattern = "(.*)_(.*)") %>%   
  ggplot(aes(x=year, y=imp, shape = type, colour=group)) + 
  geom_line(aes(group = group, colour = group), size = 0.5) +
  geom_point(aes(group = group, colour = group, shape = type),size=2)  +
  scale_shape_manual(values = c('imputed' = 21, 'observed' = 16)) + 
  facet_wrap(iso3~., scales = "free") + xlab("Year") + ylab("Imp")

enter image description here

Data

df = structure(list(sl = 1:20, iso3 = structure(c(1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L
), .Label = c("ATG", "EGY"), class = "factor"), year = c(2000L, 
2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 
2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 
2000L), Var1 = c(NA, NA, NA, NA, NA, NA, NA, 144L, 45L, NA, NA, 
NA, NA, NA, NA, NA, NA, 282L, NA, NA), Var1_imp = c(144, 144, 
144, 144, 144, 144, 144, 144, 45, 71.3, 97.7, 124, 150, 177, 
203, 229, 256, 282, 282, 38485), Var2 = c(NA, NA, NA, NA, NA, 
NA, NA, 277L, NA, NA, NA, NA, NA, 421L, 434L, 422L, 424L, 429L, 
435L, NA), Var2_imp = c(277L, 277L, 277L, 277L, 277L, 277L, 277L, 
277L, 301L, 325L, 349L, 373L, 397L, 421L, 434L, 422L, 424L, 429L, 
435L, 146761L), Var1_type = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L), .Label = c("imputed", 
"observed"), class = "factor"), Var2_type = structure(c(1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
2L, 1L), .Label = c("imputed", "observed"), class = "factor")), class = "data.frame", row.names = c(NA, 
-20L))
Community
  • 1
  • 1
UseR10085
  • 7,120
  • 3
  • 24
  • 54
  • Thank you! Is there a way to have unfilled or filled circles based on var1_type and var2_type, i.e. unfilled if it is imputed and filled if it is observed? – Nellicopter May 18 '20 at 21:26
  • @Nellicopter See my update. Don't forget to accept it as [answer](https://stackoverflow.com/help/someone-answers). – UseR10085 May 19 '20 at 05:24
  • Thank you! I have accepted it as answer. Since I have 193 countries, I can hardly see the charts.. I have tried to use face_wrap_paginate but it runs forever.. could you help on that one? Right now, I cannot see anything.. – Nellicopter May 19 '20 at 07:00
  • You can divide the data into parts e.g. 12 countries in one excel sheet and plot it. – UseR10085 May 19 '20 at 07:10