0

I have 70 CSV files with same columns that I want to do the same process for. Basically what I want is importing, cleaning, writing the file and removing all variables, then repeat for the next one. Because each one is 0.5GB.

How can I do that without loading packages iteratively in an efficient way?

library(tidyverse)
setwd("~/R/R-3.5.1/bin/i386")
df <- read.csv(file.choose(), header = TRUE, sep = ",")

inds <- which(df$pc_no == "DELL")
df[inds - 1, c("event_rep", "loc_id")] <- df[inds, c("pc_no", "cust_id")]
df1 <- df[-inds, ]

write.csv(df1, "df1.csv")

rm(list=ls())

To do that I think I will use this piece of code but don't know where to use it exactly. I.E How can I implement the codes above to do that?

list.files(pattern="^events.*?\\.csv", full.names=TRUE, recursive=FALSE)
lapply(files, function(x) {
files <- function(df1)

})
kimi
  • 525
  • 5
  • 17
  • I suggest you make a list of the dataframes with `list.files`, and use `lapply` or `purrr::map` – Calum You Oct 22 '18 at 17:33
  • Fwiw, you might try just reading them all in. They may be 500 MB on disk but less in R. Btw, you might want `if (length(inds)){...}` since `df1[-which(FALSE),]` does not do what you expect. – Frank Oct 22 '18 at 17:33
  • @CalumYou Yes, will add the code that I did above. But don't where to put lapply function exactly. – kimi Oct 22 '18 at 17:36
  • @KadirŞenkaya: you can select what columns you want to read inside either `data.table::fread` or `readr::read_csv`. See this answer https://stackoverflow.com/a/48105838/ – Tung Oct 22 '18 at 17:37
  • @Tung I used fread() to import files once to a single dataframe. But will need to import and export one by one. – kimi Oct 22 '18 at 17:40

1 Answers1

1

Per the comments above, you just need to loop through each file using lapply after assigning your files to an object (which you've defined as files).

library(tidyverse)
setwd("~/R/R-3.5.1/bin/i386")

files <- list.files(pattern="^events.*?\\.csv", full.names=TRUE, recursive=FALSE)

lapply(files, function(x) {

  df <- read.csv(x, header = TRUE, sep = ",")

  inds <- which(df$pc_no == "DELL")
  df[inds - 1, c("event_rep", "loc_id")] <- df[inds, c("pc_no", "cust_id")]
  df1 <- df[-inds, ]

  write.csv(df1, paste0('cleaned_', x), row.names = FALSE)

})
D.sen
  • 938
  • 5
  • 14
  • After the loop should I use rm(list=ls()) ? As mentioned I have 70 csv files. – kimi Oct 22 '18 at 17:51
  • 1
    Well the loop doesn't store 70 csv files, it reads in a file at a time and writes it back out to your directory as 'cleaned_filename.csv', then it reads in the next one. For each iteration, you're only storing one object 'dataset' that is just getting continuously updated. If you want that object cleared you could just do rm(dataset). Is your end goal to have all 70 files cleaned and read into R? Or just processed and outputed to your directory? – D.sen Oct 22 '18 at 17:54
  • Also- it seems we have two different objects dataset and df... should those be aligned as the same object? – D.sen Oct 22 '18 at 18:00
  • My goal is to have all 70 files cleaned and exported. Above code, we import and export one by one right? Just wanted to remove cleaned and exported ones to avoid over-capacity use. – kimi Oct 22 '18 at 18:01
  • Yes then the above code should be fine for memory. And I think dataset should be changed to df, unless there actually two different objects. – D.sen Oct 22 '18 at 18:03