0

I have a data frame with 5 columns. The function below creates and outputs 5 small, 3-column datasets featuring the first two columns of my dataset ("country" and "year") and one each of the other 5 columns.

library(dplyr)

# My data (sample)
country <- c("GR", "GR", "GR", "AL", "AL", "AL", "GE", "GE", "GE")
year <- c(1990, 1991, 1992, 1994, 1997, 1996, 1991, 1992, 1993)
pop <- c("i", "i", "j", "j", "j", "i", "i", "i", "i")
category <- c("1", "2", "2", "2", "2", "2", "1", "1", "2")
age <- c(14, 13, 12, 18, 19, 17, 20, 21, 19)

sample_data <- data.frame(country, year, pop, category, age)
rm(country, year, pop, category, age)

# My function
new.datasets <- function(df, na.rm = TRUE, ...){
  i=1
  for (c in df){
    new_df <- select(df, country, year, i)
    assign(paste("df_new_", i), new_df, envir = globalenv())
    i=i+1
  }
}
new.datasets(sample_data)

Using my current function, the first two datasets produced by my function only contain two columns: "country" and "year". The next three datasets produced contain "country", "year" and one each of the remaining columns ("pop", "category" or "age").

I would like to modify my function so that it DOES NOT produce the first two datasets, which only contain "country" and "year". Rather than creating these first two and then removing them, I'd like them to never be produced at all, if possible. Can you help me out?

(Unfortunately, I can't take any easy workarounds like using rm() to remove these datasets afterward, because this is a very simplified version of my actual problem/code, which requires me to remove these datasets as such.)

Thanks! -- New R User

Anna Jones
  • 91
  • 7
  • Why do you assign them to your global environment? Much better to put them in a list and modify the list as per your requirements – Sotos Sep 03 '19 at 14:48
  • I agree, I'm not actually doing this in my real function, I just tried to create some kind of basic function that shows my problem for the purposes of this question. My overall goal is to see if there's some way to modify the "for (c in df)" bit to exclude the first two columns "c". – Anna Jones Sep 03 '19 at 14:53
  • `for (c in df)` - is that a typo? – Shinobi_Atobe Sep 03 '19 at 14:55
  • 1
    `for (c in df[-(1:2)])`? But I agree, forget that `assign` exists. – Roland Sep 03 '19 at 14:55
  • Is there any way to use the @Roland solution but to apply something like `[-(1:2)]` to `c` instead of to `df`? Your solution works super well in this situation but when I try to apply it to a for loop with many more steps, modifying the dataframe used with the `[-(1:2)]` makes some of the steps not work. – Anna Jones Sep 03 '19 at 15:41

1 Answers1

2

In R it is always recommended to avoid explicit loops and instead use the apply family of functions which are similar to a "for loop" but generally much faster. You also don't need to create a function for this. You can lapply through the column names:

new.datasets <- lapply(colnames(sample_data)[3:ncol(sample_data)], function(x){
    new_df <- cbind(sample_data[,1:2], sample_data[,x])
    assign(paste("df_new_", x), new_df, envir = globalenv())
}

as Sotos has mentioned in his comment you don't need to assign each dataframe to a new variable. Instead you can keep them in a list and perform your next steps on this list. So what I would recommend is:

new.datasets <- lapply(colnames(sample_data)[3:ncol(sample_data)], function(x){
        new_df <- cbind(sample_data[,1:2], sample_data[,x])
    }

This will give you a list of dataframes named "new.datasets".

NelsonGon
  • 13,015
  • 7
  • 27
  • 57
Sbamo
  • 51
  • 4
  • 5
    `apply` being faster than `for` is an old myth. [See here for lots of detail](https://stackoverflow.com/a/2276001/903061). In fact, since R version 3.6 included just-in-time compilation, `for` loops are often **slightly** faster than `apply`. But the speed differences are tiny. What's nice about `apply` is that, for simple functions, it can be shorter for a human to read and write. – Gregor Thomas Sep 03 '19 at 15:31
  • Please see `fortunes::fortune(98)` on why `apply` is normally "discouraged." In fact, this as I found out applies to python too. `apply` I read is only good for a margin of 1. There is almost certainly a vectorised alternative. – NelsonGon Sep 03 '19 at 16:10
  • @Gregor, I am not an expert in R, but I have some experience in working with large datasets and I saw significant differences between for and apply loops. Maybe I am biased, but even the post you linked does not clearly state that "apply being faster than for is an old myth". – Sbamo Sep 04 '19 at 08:02
  • @Sbamo, the first sentence of the accepted/most upvoted answer is *"The apply functions in R don't provide improved performance over other looping functions (e.g. for).*" That seems pretty clear to me. The "old myth" part is my language - as that question and answer are almost 10 years old, I think the "old" part is justified too. – Gregor Thomas Sep 04 '19 at 13:19
  • If you find examples lying around in blogs, gists, etc., where `apply` family functions are faster than `for` loops, it's usually a poorly written for loop. Often the examples will use a vectorized function in apply, and a non-vectorized function in the for loop. Or a problem of not pre-allocating a container for the results: `c()` or `rbind()` or similar used in each iteration of a loop to accumulate a result will kill performance. There are plenty of bad ways to write a slow for loop, but that doesn't mean for loops are inherently slow. – Gregor Thomas Sep 04 '19 at 13:25
  • I would advise a slightly modified `lapply` approach here. `lapply(3:ncol(sample_data), function(x) sample_data[, c(1, 2, x)])` is a little more concise, and more importantly keeps nice column names in the result. In your version, the name of the 3rd column in each result is `sample_data[,x]` – Gregor Thomas Sep 04 '19 at 17:42