lapply For Multiple Files

Question

BorderData07 <- read_csv("Downloads/BorderData/BorderApprehension2007.csv")
BorderData08 <- read_csv("Downloads/BorderData/BorderApprehension2008.csv")
BorderData07[is.na(BorderData07)] = 0
B08[is.na(B08)] = 0
BorderData07$CITIZENSHIP <- str_to_title(BorderData07$CITIZENSHIP)
BorderData07$Region <- countrycode(sourcevar = BorderData07$CITIZENSHIP, origin = "country.name", destination = "region")
BorderData07[nrow(BorderData07), 26] <- "Total"
World_Region <- ddply(BorderData07,"Region",numcolwise(sum))
ggplot(World_Region, aes(x = Region, y = Total)) + geom_col(width = 0.5, position = position_dodge(3), fill = 'blue', alpha = 0.5) + scale_y_log10() + coord_flip() +  geom_text(aes(label=Total), alpha = 1.0, check_overlap = TRUE) +  ggtitle("Apprehension By World Region Totals in 2007")

I'm trying to use lapply to run through each csv file for every year of my border data. The only difference from each one is the ending of the csv file and the title of the graph. My knowledge of lapply is super limited and am having trouble learning how to get it to function properly.

Hi, it's really not clear what your question is. You don't seem to have a list so I don't know why lapply makes sense. Perhaps describe what you are trying to do and provide a minimal example. — Elin, Jul 15 '21 at 01:09
If you're asking how to read those files with `lapply`, I think your question has already been answered here: https://stackoverflow.com/q/11433432/6288065. I agree with @Elin; the code that you provide only shows two files (good for a miniminal example), but the other code lines would be irrelevant... Or, are you asking how to run all those codes within `lapply`?. — LC-datascientist, Jul 15 '21 at 01:14
@LC-datascientist Ya, sorry I may have not been clear on my question. I am trying to run all those codes within lapply. I only provided two files for an example, but there are 13 files from 2007 to 2019. — josephtg1, Jul 15 '21 at 01:44

score 0 · Answer 1 · answered Jul 15 '21 at 02:19

Put everything that you want to apply to each file in a function

apply_fun <- function(file) {
  x <- read_csv(file)
  year <- str_extract(file, '\\d+')
  x[is.na(x)] = 0
  x$CITIZENSHIP <- str_to_title(x$CITIZENSHIP)
  x$Region <- countrycode(sourcevar = x$CITIZENSHIP, origin = "country.name", destination = "region")
  x[nrow(x), 26] <- "Total"
  World_Region <- ddply(x,"Region",numcolwise(sum))
  ggplot(World_Region, aes(x = Region, y = Total)) + 
    geom_col(width = 0.5, position = position_dodge(3), fill = 'blue', alpha = 0.5) + 
    scale_y_log10() + coord_flip() +  
    geom_text(aes(label=Total), alpha = 1.0, check_overlap = TRUE) +  
    ggtitle(paste0("Apprehension By World Region Totals in", year))
}

and then use lapply -

filename <- list.files('Downloads/BorderData/', pattern = '\\.csv$', full.names = TRUE)
list_plots <- lapply(filename, apply_fun)

Hi Ronak, thank you so much for your help! I understand what you did, I'm just a little confused on what the '//d+' does if you're able to explain! — josephtg1, Jul 15 '21 at 15:55
`\\d+` is a [regular expression](https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html) that allows a pattern to match one or more of any digits. — LC-datascientist, Jul 15 '21 at 19:05

LC-datascientist · Answer 2 · 2021-07-15T21:55:04.123

library(tidyverse) # a helpful package to make coding easier
library(stringr)
library(readr)
library(ggplot2)

list.files( # get multiple file paths
    path = "Downloads/BorderData", 
    pattern = "BorderApprehension*.csv", 
    full.names = TRUE
) %>%
    setNames(., paste0("BorderData", str_extract(., "\\d{2}(?=\\.csv)"))) %>% # (optional; provides names to file paths)
    lapply(function(file) {
        year <- str_extract(file, "\\d+(?=\\.csv)") # use in `ggtitle`
        df <- read_csv(file) %>% 
            mutate_all(replace_na, 0) %>% # `BorderData07[is.na(BorderData07)] = 0` equivalent
            mutate(
                CITIZENSHIP = str_to_title(CITIZENSHIP), 
                Region = countrycode(sourcevar = CITIZENSHIP, origin = "country.name", destination = "region")
            )
        df[nrow(df), 26] <- "Total" # BorderData07[nrow(BorderData07), 26] <- "Total"
        World_Region <- ddply(df, "Region", numcolwise(sum))
        ggplot(World_Region, aes(x = Region, y = Total)) + 
            geom_col(width = 0.5, position = position_dodge(3), fill = 'blue', alpha = 0.5) + 
            scale_y_log10() + 
            coord_flip() + 
            geom_text(aes(label=Total), alpha = 1.0, check_overlap = TRUE) +  
            ggtitle(paste("Apprehension By World Region Totals in", year))
    })

The output is a list of ggplots.

If you want it to return the data frames from reading and cleaning the .csv files, you can add a line return(df) at the end inside lapply.

If you use the optional setNames (as shown in the code), the list will have names that correspond to "BorderData07", "BorderData08", etc.

str_extract(., "\\d{2}(?=\\.csv)"))) in the code's setNames uses regular expressions to extract the last two digits before ".csv".

str_extract(file, "\\d+(?=\\.csv)"))) in the code's lapply uses regular expressions to extract one or more digits before ".csv", which would be the year in your example. (?=\\.csv) is only needed if you have digits appearing elsewhere in the file path because it indicates that ".csv" has to immediately follow the digits pattern, making the pattern more specific.

The pipe operators (%>%) and mutate functions are from the dplyr R package, which is included in the tidyverse package. They help reduce redundant code to write, such as the data frame's name.

lapply For Multiple Files

2 Answers2