2

I have a list containing 286 items.

length(l)
[1] 286

What I would like to do know is to create a seperate .csv file for a subset of the .csv for each list

split_csv <- function(df, list) {

   setwd("dir")

    for (i in list)

    #print(i)
    df_temp <- df[df$club == i, ]
    name <- paste0("club_", i, ".csv")
    write.csv(df_temp, name)

 setwd("original_dir")

 }

But thing is that I only get only .csv file now! Its strange cause when I uncomment the the #print(i) it does give me all items in the list (so I assume the loop is working.

Any thoughts?

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
Frank Gerritsen
  • 185
  • 5
  • 14
  • 1
    I think you're just missing brackets around your for loop - `for (i in list) {` with the closing bracket before you `setwd("original_dir")` – GregF Dec 28 '15 at 17:58

1 Answers1

3

The main problem with your code is that you don't use curly brackets to put multiple statements inside loop. From R point of view, only first line (df_temp <- df[df$club == i, ]) is evaluated inside loop. Rest of the program - including actually writing content to file - is done only after loop has ended. Because variables created inside loop will be added to global environment and available outside of the loop, no errors are raised. But, effectively, your file writing code is executed only for last iteration of loop.

Fix for this issue is trivial:

set.seed(123)

l <- data.frame(club=sample(LETTERS[1:10], 286, TRUE),
                visitors=as.integer(runif(286, 100, 1000))
                )

split_csv <- function(df, list) {
  setwd("dir")
  for (i in list) {
    #print(i)
    df_temp <- df[df$club == i, ]
    name <- paste0("club_", i, ".csv")
    write.csv(df_temp, name)
  }
  setwd("..")
}
split_csv(l, LETTERS[1:3])
list.files("dir/")
# [1] "club_A.csv" "club_B.csv" "club_C.csv"

But let's use your question as opportunity to see how this code can be improved.

by function can be used to split data.frame into subsets with identical values in given factor (or factors, but let's keep it simple). You can run any function - including custom (and anonymous) one - on that subset.

split_csv2 <- function(df, list) {
  by(df, df$club, function(x) {
      # `x` is subset of df with one value in `club`
      # assign current "club" value for further reference
      i <- x[1, "club"]
      # don't do anything else if current club is not in list of allowed clubs
      if (! i %in% list) return()

      name <- paste0("dir/club_", i, ".csv")
      write.csv(x, name)
    }
  )
}
invisible(split_csv2(l, LETTERS[2:4])) # discard output - it's not helpful anyway
list.files("dir/")
# [1] "club_B.csv" "club_C.csv" "club_D.csv"

There are two main advantages of this approach:

  1. We no longer compare entire column of data frame against some value in each loop iteration, making it significantly faster. Of course with data frame with this order of magnitude there is no way to notice any difference. But one day you might want to perform similar operation for much bigger data set.
  2. Loops are generally frowned upon in R community[citation needed]. Thanks to apply family of functions, they are rarely required. Familiarizing yourself with these functions is one of the most important steps on journey to master R.

Additionally:

  • Inside your function, your second argument will shadow over list function that is used to create lists, one of basic R data structures. In more complex cases this could lead to unexpected behaviors and hard to debug issues. Better avoid that at all.
  • This is highly subjective, but many developers would tell you that changing directory inside function is not good practice.
Community
  • 1
  • 1
Mirek Długosz
  • 4,205
  • 3
  • 24
  • 41