0

I have a dataset of air pollution values each which corresponds to a particular station. I've computed the number of nulls for each air pollution variable and station code. Now I want to find the station for each pollutant with the minimum number of nulls.

Here is the dataset:

nulls_by_code
# A tibble: 34 x 9
   code    nox   no2    o3   so2    co pm10_raw  pm10  pm25
   <chr> <int> <int> <int> <int> <int>    <int> <int> <int>
 1 BL0    7195  1386  1234  4933 24408    24408  2981  2092
 2 BQ7     240   240   229 24392 24392      310   326 24392
 3 BQ8   24296 24296 24296 24296 24296    24296   323 24296
 4 CR8   24395 24395 24395 24395 24395    24395 24395   733
 5 CT3    1055  1055 24393 24393 24393     2365   971  2888
 6 EI3    1994  1994 24283 24283 24283    24283  3801 24283
 7 EN7     271   271 24392 24392 24392    24392 24392 24392
 8 HG4     205   205  1048 24392 24392    24392 24392 24392
 9 HP1   24391  8610  9596 24391 24391    24391  7987  8255
10 HP3   15633 15633 15633 15633 15633    15633 15633 15633
# … with 24 more rows

I've read a few posts about similar problems, but they're not quite what I want. There is no grouping variable in my data and I want to compute the minimum and the corresponding code for every column, something like:

nulls_by_code %>% 
     summarise(across(nox:pm25, ~ slice(which.min(.)))) %>% 
     mutate(across(nox:pm25, code)) ## I know this won't work, not sure which verb to use!

I've looked at: Find Minimum of a column and corresponding row of the minimum with condition on another column and minimum (or maximum) value of each row across multiple columns and How to select the row with the maximum value in each group but none of these have got me to dplyr dataframe nirvana.

The result should look something like this, or a transpose version of to ensure that columns have the same datatype and don't mix and .


+-----------+-----+-----+------+------+
| row_name  | nox | no2 | pm10 | pm25 |
+-----------+-----+-----+------+------+
| min_value | 205 | 205 | 323  | 733  |
| min_code  | HG4 | HG4 | BQ8  | CR8  |
+-----------+-----+-----+------+------+

I suspect dplyr grandmaster @akrun will be able to fix this in 10 seconds ... :-) Thanks for any help you are able to give.

Carl
  • 4,232
  • 2
  • 12
  • 24
LucieCBurgess
  • 759
  • 5
  • 12
  • 26

1 Answers1

1

Converting to long format via tidyr::pivot_longer and making use of dplyr::top_n this could be achieved like so:

As you want the minimum values for each pollutant we first group by pollutant and get the one row cointaining the minimum value per group by making use of top_n(1, -value) where value is a default name assigned by pivot_longer.

nulls_by_code <- read.table(text = "code    nox   no2    o3   so2    co pm10_raw  pm10  pm25
  1 BL0    7195  1386  1234  4933 24408    24408  2981  2092
2 BQ7     240   240   229 24392 24392      310   326 24392
3 BQ8   24296 24296 24296 24296 24296    24296   323 24296
4 CR8   24395 24395 24395 24395 24395    24395 24395   733
5 CT3    1055  1055 24393 24393 24393     2365   971  2888
6 EI3    1994  1994 24283 24283 24283    24283  3801 24283
7 EN7     271   271 24392 24392 24392    24392 24392 24392
8 HG4     205   205  1048 24392 24392    24392 24392 24392
9 HP1   24391  8610  9596 24391 24391    24391  7987  8255
10 HP3   15633 15633 15633 15633 15633    15633 15633 15633", header = TRUE)

library(dplyr)
library(tidyr)

nulls_by_code %>% 
  pivot_longer(-code, names_to = "pollutant") %>% 
  group_by(pollutant) %>% 
  top_n(1, -value)
#> # A tibble: 8 x 3
#> # Groups:   pollutant [8]
#>   code  pollutant value
#>   <chr> <chr>     <int>
#> 1 BL0   so2        4933
#> 2 BQ7   o3          229
#> 3 BQ7   pm10_raw    310
#> 4 BQ8   pm10        323
#> 5 CR8   pm25        733
#> 6 HG4   nox         205
#> 7 HG4   no2         205
#> 8 HP3   co        15633
stefan
  • 90,330
  • 6
  • 25
  • 51
  • Thanks stefan. Please can you explain your solution?What does top_n(1, -value) do? – LucieCBurgess Oct 15 '20 at 13:51
  • (: Sorry. `top_n` returns the top n rows (here the top 1) based on a column. If we want instead the bottom n rows with minimum values we can simply use -value. BTW: "value" is the name pivot_longer uses as default for the column containing the values. – stefan Oct 15 '20 at 13:54
  • Thanks for the explanation and the help. Not sure why it's been downvoted already as I spent 3 hours trying to fix it myself before posting on StackOverflow. I thought this was a pretty tenacious effort! I will look into pivot_longer – LucieCBurgess Oct 15 '20 at 14:37
  • Don't know either. I know myself that it's not easy to find the right answer on SO even it has probably already been answered. But don't worry about it. This happens sometimes. – stefan Oct 15 '20 at 14:44
  • Thanks for the help stefan, and the kind words, I really appreciate it! I even tried a little humour in my post to make SO a more enjoyable place to be :0) Thanks for taking the time to answer – LucieCBurgess Oct 15 '20 at 15:11