0

I have raw data in the format of df below that I'm needing to condense into a fewer row data.frame. The data represent eventual GIS point data (columns for coordinates are left off here) so I'm hoping to avoid plotting any duplicate points. Each row represents one point for which either or both of agency1 and agency2 publish data. A simpler case with fewer column formats is answered here.

For cases where there are rows with identical agency3_id, I'd like to condense those two rows into one row. AS an example, I want rows 3 and 4 in the original data.frame (which both have agency3_id abcde) to become a single row in the desired data.frame below (its row 3). I'm open to any R method. I'm sure there's a better title question - any help appreciated.

library(tidyverse)
library(lubridate)

agency1_id   <- as.double(c("1500", NA, "2007", NA, "4501", NA))
agency2_id   <- c(NA, "zxc", NA, "xcv", NA, "bnm")
agency3_id   <- c("qwert", "ertyu", "abcde", "abcde", NA, NA)
agency1_val  <- c(0.21, 1.5, 0.23, NA, 4.3, NA)
agency2_val  <- c(0.05, 4.4, NA, 6.3, NA, 2.0)
agency1_desc <- c("st", NA, "ko", NA, "ui", NA)
agency2_desc <- c(NA, "lo", NA, "vf", NA, "kl")
agency1_dtm  <- ymd_hm(c("2019-05-30 04:30", NA, "2019-05-30 04:35", 
                          NA, "2019-05-30 04:33", NA))
agency2_dtm  <- ymd_hm(c(NA, "2019-05-30 04:20", NA, "2019-05-30 
                          04:29", NA, "2019-05-30 04:31"))

df <- data.frame(agency1_id, agency2_id, agency3_id, agency1_val,
                 agency2_val, agency1_desc, agency2_desc, agency1_dtm,
                 agency2_dtm)
as_tibble(df)

# agency1_id agency2_id agency3_id agency1_val agency2_val agency1_desc agency2_desc agency1_dtm         agency2_dtm        
# <dbl>      <fct>      <fct>       <dbl>       <dbl>        <fct>        <fct>        <dttm>              <dttm>             
# 1 1500       NA         qwert      0.21        0.05        st           NA           2019-05-30 04:30:00 NA                 
# 2 NA         zxc        ertyu      1.5         4.4         NA           lo           NA                  2019-05-30 04:20:00
# 3 2007       NA         abcde      0.23         NA         ko           NA           2019-05-30 04:35:00 NA                 
# 4 NA         xcv        abcde       NA         6.3         NA           vf           NA                  2019-05-30 04:29:00
# 5 4501       NA         NA         4.3          NA         ui           NA           2019-05-30 04:33:00 NA                 
# 6 NA         bnm        NA          NA         2           NA           kl           NA                  2019-05-30 04:31:00

desired df:

# agency1_id agency2_id agency3_id agency1_val agency2_val agency1_desc agency2_desc agency1_dtm         agency2_dtm        
# <dbl>      <fct>      <fct>       <dbl>       <dbl>        <fct>        <fct>        <dttm>              <dttm>             
# 1 1500       NA         qwert      0.21        0.05        st           NA           2019-05-30 04:30:00 NA                 
# 2 NA         zxc        ertyu      1.5         4.4         NA           lo           NA                  2019-05-30 04:20:00
# 3 2007       xcv        abcde      0.23        6.3         ko           vf           2019-05-30 04:35:00 2019-05-30 04:29:00                 
# 4 4501       NA         NA         4.3          NA         ui           NA           2019-05-30 04:33:00 NA                 
# 5 NA         bnm        NA          NA         2           NA           kl           NA                  2019-05-30 04:31:00
dbo
  • 1,174
  • 1
  • 11
  • 19
  • What would you want to do with the values in the other columns? Is it always going to be a combination of 1 NA and 1 value? – Sven May 30 '19 at 16:55
  • @Sven good point that I left off - when there are non-unique `agency3_id`s, yep, it will always be that combination. – dbo May 30 '19 at 16:58
  • Possible duplicate of [combine rows in data frame containing NA to make complete row](https://stackoverflow.com/questions/45515218/combine-rows-in-data-frame-containing-na-to-make-complete-row) – akash87 May 30 '19 at 17:00

1 Answers1

1

You can make use of this. Might not be the most concise solution.

# Data with NA values in column - agency3_id
df_na <- df[is.na(df$agency3_id), ]

# Logic
df <- df[!is.na(df$agency3_id), ] %>% 
  group_by(agency3_id) %>% 
  summarise_all(list(~ if(all(is.na(.))) NA else .[!is.na(.)][1]))

# Merge dataframes
rbind(df, df_na)
# Result
# A tibble: 5 x 9
  agency3_id agency1_id agency2_id agency1_val agency2_val agency1_desc agency2_desc agency1_dtm        
* <fct>           <dbl> <fct>            <dbl>       <dbl> <fct>        <fct>        <dttm>             
1 abcde            2007 xcv               0.23        6.3  ko           vf           2019-05-30 04:35:00
2 ertyu              NA zxc               1.5         4.4  NA           lo           NA                 
3 qwert            1500 NA                0.21        0.05 st           NA           2019-05-30 04:30:00
4 NA               4501 NA                4.3        NA    ui           NA           2019-05-30 04:33:00
5 NA                 NA bnm              NA           2    NA           kl           NA                 
# … with 1 more variable: agency2_dtm <dttm>
skillsmuggler
  • 1,862
  • 1
  • 11
  • 16
  • thanks @skillsmuggler. this works for the example, but on my actual twenty or so column `df`, a handful of text and datetime columns are getting converted to numeric, losing their original values and being replaced by numeric values..not sure why.. – dbo May 30 '19 at 18:49