0

I'm learning how to write R functions that reads a directory full of files and reports the number of completely observed cases in each data file.

My function works with one case, but with multiple cases the loop only shows the last record.

complete <- function(directory, id = 1:332) {
    files_list <- list.files(path = directory, full.names = TRUE)
    dat <- data.frame()
    for (i in id) {
            dat <- rbind(dat, read.csv(files_list[i]))
            }
    nobs <- sum(complete.cases(dat))
    id <- i
    data.frame(id, nobs)
}

My expected result when running

    > complete("specdata", 1:6)

    ##   id nobs
    ## 1 1   932
    ## 2 2   711
    ## 3 3   475
    ## 4 4   338
    ## 5 5   586
    ## 6 6   463

Instead when id = 1:6, it returning a data.frame with ten results, it returns:

    > complete("Specdata", 1:6)


   id nobs
 1 6  3562

I suspect the problem is that the function is replacing the values each time as it loops through. I've searched SO and elsewhere for help with "only showing last record" problems and cannot figure out a solution from those other answers.

Thank you in advance for any help. I'm brand new to R as I'm sure is abundantly obvious.

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
john1607
  • 3
  • 1
  • 3
  • @ZheyuanLi: ah yeah, missed the dat inside the rbind call. oops. – Marc B Aug 18 '16 at 17:04
  • 1
    Please provide your **expected output** (ala [reproducible examples](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)). Your code (as @ZheyuanLi stated) is perfectly clear in that you explicitly return a single-row dataframe. If you want to return `dat` then that needs to be in the last line (or within `return(...)`). – r2evans Aug 18 '16 at 17:05
  • @r2evans - I've edited the original post with my expected output, apologies if that was unclear and thanks for your help. – john1607 Aug 18 '16 at 17:11
  • I find it somewhat interesting that you are reading in all of these files, counting the rows and complete cases, and then discarding all of the data that you just read in. Though this works with small files, it is fairly inefficient and you will be punished if/when you get lots of files and/or large files. Are you intending to (1) load in the data for use, ***and*** (2) provide some summary stats on them for info? – r2evans Aug 18 '16 at 17:12
  • @r2evans - Thanks for your reply. I'm intending to read a directory full of files and report the number of completely observed cases in each data file, so yes I believe that's step #1 and #2 in your question. I know that sapply is a faster way of doing this but am curious to know how - if at all - my rudimentary function could be edited to achieved the expected results. Understanding that this would never work well with large data sets. – john1607 Aug 18 '16 at 17:33
  • @john1607 It's primarily because you're returning the data frame after the for loop hence the most recent value of i is set and the sum of all is there. You must rbind at every iteration of for and return the final data frame – amrrs Aug 18 '16 at 17:34
  • If you want your function to return both the imported data and some summary statistics on it, then you likely need to return something like `list(data = dat, stats = data.frame(id, nobs))` (with some corrections). An alternative solution would be to read in all of the data and then get some summary stats on that data. Do you need the summary stats *stored* or just *displayed on the console*? – r2evans Aug 18 '16 at 17:36
  • It's been answered, but for your SA (learning programming), I'd recommend something like `dat <- lapply(files_list, read.csv)` and `stats <- data.frame(id = seq_along(dat), nobs = sapply(dat, function(x) sum(complete.cases(x))))`. This is a slight mod to @Aaron's answer in that it stores the imported data in `dat` and allows you to work with it (which is a good practice anyway when dealing with multiple similarly-structured CSV files). – r2evans Aug 18 '16 at 17:51

2 Answers2

0

Yeah, there's a lot going on in your code that isn't clear. Specifically, the rbind doesn't make sense given your description, nor does having id as parameter in your function. The more R idiomatic way of doing what you describe would be something like this, where the sapply loops over the file list, and the anonymous function reads it in and returns the count of complete cases.

files_list <- list.files(path = directory, full.names = TRUE)
sapply(files_list, function(fi) sum(complete.cases(read.csv(fi))))
Aaron left Stack Overflow
  • 36,704
  • 7
  • 77
  • 142
0

Hope this should work!

complete <- function(directory, id = 1:332) {
        files_list <- list.files(path = directory, full.names = TRUE)
        dat <- data.frame()
        tmp <- data.frame()
                for (i in id) {
                        dat <- rbind(dat, read.csv(files_list[i]))
                        nobs <- sum(complete.cases(dat))
                        id <- i
                        tmp <- rbind(tmp,data.frame(id,nobs))
                }


        tmp
}

Details:

It's primarily because you're returning the data frame after the for loop hence the most recent value of i is set and the sum of all is there. You must rbind at every iteration of for and return the final data frame

amrrs
  • 6,215
  • 2
  • 18
  • 27
  • that worked, thanks very much. I now see how you used two data.frames dat and tmp to resolve the loop writing over itself. – john1607 Aug 18 '16 at 17:43
  • @john1607 It's a typical beginner problem. No worries. Keep coding! – amrrs Aug 18 '16 at 17:44