How to subset files from list with above certain number of rows, and using functions on dataframes in a list

Question

I need to take a directory full of files, read them, remove the NA values, and then keep only the files with above a certain number of rows, which will have correlations run on them. I have everything up to the subsetting of rows done, which I can't seem to manage.

corr <- function(directory, threshold = 0){

 #reads directory of files

    file_list <- list.files(path = getwd()



 # takes file_list and makes each file into dataframe

    dflist <- lapply(file_list, read.csv)



 # returns list of files, na rows stripped

    nolist <- lapply(dflist, na.omit)

 # removes all with nrows < threshold

    abovelist <- c()
    
    for(file in nolist){
    if (nrow(file) > threshold)
          {append(abovelist, file)}
          }
          
 # 
 }

As you can see, I've tried using a for loop, appending those with nrow > threshold. But whenever I try running this step, all that returns is a NULL value in abovelist. I've noticed the following interaction with square brackets:

 > nrow(nolist[1])
 NULL

 > nrow(nolist[[1]])
 117

It seems like some functions access the dataframes in nolist as one-unit lists, and others actually get at the dataframes themselves (which is what I want here). How do I make sure to do this, here and in general?

alex_jwb90 · Answer 1 · 2020-09-13T23:52:35.797

First of all, you're not assigning the appended list anywhere, which is why it just "disappears". Moreover, I assume you'd want to append the full file dataframe as a list item, so you'd have to wrap it in a list(). If you don't, you'll append all columns as items to your abovelist, which I assume is undesired behavior.
So, to fix your own code, this is what you'd want to do:
if (nrow(file) >= threshold) {abovelist <- append(abovelist, list(file))}

Secondly, to your question on the difference of single and double [ brackets in R, take a look at this explanation.

Finally, here's a very simple tidyverse-way of working through your files (without having to resort to for loops, intermediate lists and stepwise appending results at all).

library(dplyr)
library(purrr)
library(readr)

file_list <- list.files(path = getwd(), pattern = '\\.csv$')
threshold <- 2

filtered_file_list <- file_list %>%
  map(read_csv) %>%
  map(na.omit) %>%
  keep(~ nrow(.x) > threshold)

Ben Norris · Accepted Answer · 2020-09-14T10:30:51.423

0

Here is a simple way to do it:

abovelist <- nolist[sapply(nolist, function(x) nrow(x) > threshold)]

edited Sep 14 '20 at 10:30

answered Sep 13 '20 at 23:54

Ben Norris

5,639
2
6
15

You've skipped the step of omitting `NA`s in that snippet. So, should probably be on `nolist` instead of `dflist` – alex_jwb90 Sep 14 '20 at 00:02

How to subset files from list with above certain number of rows, and using functions on dataframes in a list

2 Answers2