How to make a loop for multiple excel files to clean data in RStudio?

Question

I am trying to create a loop of sequence for 30 imported excel files in the folder.

In the code below I am getting paths from folders and loading as a list. Then am creating from individual file a data frame because I need to manipulate each of them. Like clean several rows and keep only columns I need.

Is it possible in R to do it without creating individual data frames and clean each sheet right away with loop or for? I was trying to do that but no luck at all so I am sure am missing something.

PS:

What if there are also excel files that do not have exact format as previous excel files and I need to slice more rows then from previous files. Can I still use the same loop or do I need to do it in different tasks?

I can provide more information if this is not enoguh.

This is the code I have for now:

library(readxl)
library(purrr)
library(tidyverse)
library(janitor)
library(stringr)
library(plyr)

  #### get list of paths to our .xlsx files ---- make sure to have full.names TRUE - appends path directory -----
  match_xlsx <- as_tibble(list.files("C:/Users/User/Desktop/astn", pattern = ".xlsx", full.names = TRUE, all.files = FALSE))

  #### read specific sheet from excel ----
  match_list <- lapply(match_xlsx$value, read_xlsx, sheet = "Overview")

  #### create individual dataframes ----
  for (i in seq(match_list))
    assign(paste0("match", i), match_list[[i]])


#### create individual dataframes ----
for (i in seq(match_list))
  assign(paste0("match", i), match_list[[i]])

ruzomberok <- match1 %>% 
  janitor::clean_names() %>% 
  select(fortuna_liga_2_kolo_as_tn_ruzomberok_27_7_2019:abdul_zubairu) %>% 
  slice(2:n()) %>% #  remove row 1 until 3 
  slice(1:16) %>%  # select only the row 1 to 16
  dplyr::rename(sport_item = fortuna_liga_2_kolo_as_tn_ruzomberok_27_7_2019,
         uom = x2)


#### Senica ----
senica <- match2 %>% 
  janitor::clean_names() %>% 
  select(fortuna_liga_senica_as_trencin_1_kolo_20_7_2019:abdul_zubairu) %>% 
  slice(4:n()) %>% #  remove row 1 until 3 
  slice(1:16) %>%  # select only the row 1 to 16
 dplyr:: rename(sport_item= 
fortuna_liga_senica_as_trencin_1_kolo_20_7_2019,
         uom = x2)

On the images below are two examples of files. What I need is to remove red squares on the image (particular rows) and keep only the purple. However, on the 2nd image, it is a difference of 1 more row. Does this helps more?

Why are you calling `for (i in seq(match_list)) assign(paste0("match", i), match_list[[i]])` twice? Also, why not put them in a list instead of saving dozens of data.frames to the global environment? Also could you give an example of the different contents of the excel file? **NOT** by uploading them but by adding a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) in an edit to your question (focus on the **minimal** part ;)) That way you can help others to help you! — dario, Mar 10 '20 at 10:41
Thank you Dario for the feedback! Yeah that I exactly what I want to do to avoid store them to the global environment and instead I would like them to be stored as lists and run loop for that. Sorry for the incomplete question. Beginner here.. I have edited the question and added two images of example files. Thank you a lot, I will be more thorough next time with question. I am sure the scripting can be done much better.Definitely — Mayo Racek MBA, Mar 10 '20 at 10:58

cbo · Accepted Answer · 2020-03-10T15:03:46.730

In your case I would write a cleaning function. Assuming you want to write the clean data for later use (and gain of time) here is a hypothetical way to do so (since I don't have your data to test it) :

Get your list of data as you did :

  #### get list of paths to our .xlsx files ---- make sure to have full.names TRUE - appends path directory -----
  match_xlsx <- list.files("C:/Users/User/Desktop/astn", pattern = ".xlsx", full.names = TRUE, all.files = FALSE)

  #### read specific sheet from excel ----
  match_list <- lapply(match_xlsx, read_xlsx, sheet = "Overview")

Create your processing function :

library(janitor)
library(dplyr)
library(purrr)

clean_fun <- function(file){

        if( stringr::str_detect(names(file), "as_tn_ruzomberok") ){

                res <- file %>% 
                janitor::clean_names() %>% 
                        dplyr::select(fortuna_liga_2_kolo_as_tn_ruzomberok_27_7_2019:abdul_zubairu) %>% 
                        dplyr::slice(2:(2+15)) %>% # get only rows from 2 to 17
                        dplyr::rename(sport_item = fortuna_liga_2_kolo_as_tn_ruzomberok_27_7_2019,
                                      uom = x2)

                utils::write.csv(x = res, file = "/data/ruzomberok.csv")

        }else if( stringr::str_detect(names(file), "as_trencin") ){

                res <- file %>% 
                janitor::clean_names() %>% 
                        dplyr::select(fortuna_liga_senica_as_trencin_1_kolo_20_7_2019:abdul_zubairu) %>% 
                        dplyr::slice(4:(4+15)) %>% # get only rows from 4 to 19
                        dplyr:: rename(sport_item= 
                                               fortuna_liga_senica_as_trencin_1_kolo_20_7_2019,
                                       uom = x2)

                utils::write.csv(x = res, file = "/data/senica.csv")
        }else{
                print("New file type detected, no process applied")
        }

}

Apply it on your list of files :

purrr::map(.x = match_list, .f = ~clean_fun(.x) )

Bonus : anticipate problems with tryCatch defensive programming. Adding such your process will go on even with errors, you can custom the error/warning message :

clean_fun <- function(file){

        if( stringr::str_detect(names(file), "as_tn_ruzomberok") ){

                tryCatch(expr =  {res <- file %>% 
                                 janitor::clean_names() %>% 
                                 dplyr::select(fortuna_liga_2_kolo_as_tn_ruzomberok_27_7_2019:abdul_zubairu) %>% 
                                 dplyr::slice(2:(2+15)) %>% # get only rows from 2 to 17
                                 dplyr::rename(sport_item = fortuna_liga_2_kolo_as_tn_ruzomberok_27_7_2019,
                                               uom = x2)},
                         error = function(cond){ print("error in ruzomberok") },
                         warning = function(cond){ print("w") },
                         finally = print("end")
                )

                utils::write.csv(x = res, file = "/data/ruzomberok.csv")

        }else if( stringr::str_detect(names(file), "as_trencin") ){

                tryCatch(expr = { res <- file %>% 
                        janitor::clean_names() %>% 
                        dplyr::select(fortuna_liga_senica_as_trencin_1_kolo_20_7_2019:abdul_zubairu) %>% 
                        dplyr::slice(4:(4+15)) %>% # get only rows from 4 to 19
                        dplyr:: rename(sport_item= 
                                               fortuna_liga_senica_as_trencin_1_kolo_20_7_2019,
                                       uom = x2)},
                        error = function(cond){ print("error in senica") },
                        warning = function(cond){ print("w") },
                        finally = print("end")
                )

                utils::write.csv(x = res, file = "/data/senica.csv")
        }else{
                print("New file type detected, no process applied")
        }

}

Finally : if you just need the result with no writing just specify a return argument at the end of the function.

thank you very much! this is a brilliant answer and it worked for me. Your answer is is very helpful. — Mayo Racek MBA, Mar 10 '20 at 11:54

How to make a loop for multiple excel files to clean data in RStudio?

1 Answers1