0

I have three datasets of ontario libraries for the past 3 years. The data sets have various information about the libraries, their address, city, card holders,etc. I created a dataset to combine all of the data sets into one new data set called data combined.


like so
data_2017<- read.csv("Downloads/2017.csv")
data_2016<- read.csv("Downloads/2016.csv")
data_2015<- read.csv("Downloads/2015.csv")

common_columns <- Reduce(intersect, list(colnames(data_2017), colnames(data_2016),colnames(data_2015)))
data_combined <- rbind(
  subset(data_2017, select = common_columns),
  subset(data_2016, select = common_columns),
  subset(data_2015, select = common_columns)
)

write.csv(data_combined, "Downloads.csv")

What I need help with is that I need write a sequence of code which will create a single data set that can be used to output a table that lists the number of libraries in each city for the last 3 years. In excel I would use the count function to see the amount of libraries each cities has... to create a new table. I need help with the equivalent in R. I want to make a new table that will have the cities names on the row header and the columns will be the sum of the libraries for each year 2015, 2016 and 2017. I want to make a new dataframe like this: INSTEAD OF 1999, 2000 and 2001.. I want it to say 2015, 2016 and 2017

Here is where you can find the data set for 2015, 2016 and 2017 here is where you can find the datasets.. only use 2015, 2016 and 2017

thanks

Hong Ooi
  • 56,353
  • 13
  • 134
  • 187

1 Answers1

1

This sounds like Calculate the mean by group for summarizing by group, then Reshape multiple value columns to wide format for pivoting from long to wide. However, this is complicated by the fact that some data have commas, rendering them as character instead of numeric, so rbinding them will be problematic. Here's a pipe that should take care of all of that.

I've downloaded those three files to my ~/Downloads/ directory, then

library(dplyr)

alldat <- lapply(grep("ontario", list.files("~/Downloads/", full.names=TRUE), value = TRUE), read.csv)
common_columns <- Reduce(intersect, sapply(alldat, names))

data_combined <- alldat %>%
  lapply(function(dat) as.data.frame(
    lapply(dat, function(z) if (all(grepl("^[0-9.,]*$", z))) type.convert(gsub(",", "", z), as.is = TRUE) else z)
    )) %>%
  lapply(subset, select = common_columns) %>%
  bind_rows() %>%
  tibble() %>%
  count(City = A1.10.City.Town, Year = Survey.Year.From) %>%
  tidyr::pivot_wider(City, names_from = Year, values_from = n)

data_combined
# # A tibble: 336 x 4
#    City         `2015` `2016` `2017`
#    <chr>         <int>  <int>  <int>
#  1 Addison           1      1      1
#  2 Ajax              1      1      1
#  3 Alderville        1      1      1
#  4 Algoma Mills      1      1      1
#  5 Alliston          2      2      2
#  6 Almonte           1      1      1
#  7 Amaranth          1      1      1
#  8 Angus             1      1      1
#  9 Apsley            1      1      1
# 10 Arnprior          2      2      2
# # ... with 326 more rows
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thank you!! I love your approach. It makes a lot of sense. For some reason when I run it this error occurs: Warning message: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : embedded nul(s) found in input Error in `group_by()`: ! Problem adding computed columns. Caused by error in `mutate()`: ! Problem while computing `City = A1.10.City.Town`. Caused by error in `mask$eval_all_mutate()`: ! object 'A1.10.City.Town' not found Run `rlang::last_error()` to see where the error occurred. – user19769766 Aug 15 '22 at 17:39
  • I don't know, you see my code, it doesn't have that error. Then again, I don't call `group_by`. – r2evans Aug 15 '22 at 17:53
  • ya your code works, I'll try to play around with it so I don't get that error – user19769766 Aug 15 '22 at 19:25
  • I know you don't call by group by I'm just confused as to how I get that error when I copy your code from the backtrace here are my results: Backtrace: 1. ... %>% ... 4. dplyr:::count.data.frame(., City = A1.10.City.Town, Year = Survey.Year.From) 6. dplyr:::group_by.data.frame(x, ..., .add = TRUE, .drop = .drop) 7. dplyr::group_by_prepare(.data, ..., .add = .add, caller_env = caller_env()) 8. dplyr:::add_computed_columns(...) 10. dplyr:::mutate_cols(...) 12. mask$eval_all_mutate(quo) – user19769766 Aug 15 '22 at 22:57