1

I have a small number of csv files, each containing two columns with numeric values. I want to write a for loop that reads the files, sums the columns, and stores the sum totals for each csv in a numeric vector. This is the closest I've come:

allfiles <- list.files()
for (i in seq(allfiles)) {
     total <- numeric()
     total[i] <- sum(subset(read.csv(allfiles[i]), select=Gift.1), subset(read.csv(allfiles[i]), select=Gift.2))
     total
}

My result is all NA's save a value for the last file. I understand that I'm overwriting each iteration each time the for loop executes and I think* I need to do something with indexing.

Ferdi
  • 540
  • 3
  • 12
  • 23
DnstnRmsy
  • 13
  • 2
  • DnstnRmsy, if either of the answers meets your needs, it is customary on SO to "accept" the answer by selecting the checkmark to the left of it. Not only does it give some "thanks" and accolades (perhaps via an upvote) to the people who helped, it also effectively marks the question as "closed/resolved" for any similar questions that come after you. – r2evans Dec 22 '17 at 21:35
  • 10-4, r2Evans. Thanks for the help again! – DnstnRmsy Dec 22 '17 at 22:14
  • r2Evans, am I only allowed to "accept" one answer? – DnstnRmsy Dec 22 '17 at 22:52
  • You can "accept" only one, even if multiple answers meet your needs, that's how StackExchange works. You can "upvote" zero or more, meaning if you like all then you can upvote all. – r2evans Dec 22 '17 at 22:56

2 Answers2

3

The first problem is that you are not pre-allocating the right length of (or properly appending to) total. Regardless, I recommend against that method.

There are several ways to do this, but the R-onic (my term, based on pythonic ... I know, it doesn't flow well) is based on vectors/lists.

alldata <- sapply(allfiles, read.csv, simplify = FALSE)
totals <- sapply(alldata, function(a) sum(subset(a, select=Gift.1), subset(a, select=Gift.2)))

I often like to that, keeping the "raw/unaltered" data in one list and then repeatedly extract from it. For instance, if the files are huge and reading them is a non-trivial amount of time, then if you realize you also need Gift.3 and did it your way, then you'd need to re-read the entire dataset. Using my method, however, you just update the second sapply to include the change and rerun on the already-loaded data. (Most of the my rationale is based on untrusted data, portions that are typically unused, or other factors that may not be there for you.)

If you really wanted to reduce the code to a single line, something like:

totals <- sapply(allfiles, function(fn) {
  x <- read.csv(fn)
  sum(subset(x, select=Gift.1), subset(x, select=Gift.2))
})
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks for the well thought out reply, r2evans. And double thanks for the practical recommendation; makes complete sense. For my personal understanding, what does setting the length do in regards to the for loop? As suggested above by Onyambu, I set the length, and instead of NAs and value of the last file, I got "0.000"s and the value of the last file. – DnstnRmsy Dec 22 '17 at 20:41
1
allfiles <- list.files()
total <- numeric()
for (i in seq(allfiles)) {
 total[i] <- sum(subset(read.csv(allfiles[i]), select=Gift.1), subset(read.csv(allfiles[i]), select=Gift.2))
}
 total

if possible try and give the total a known length before hand ie total<-numeric(length(allfiles))

Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • Thanks for the quick reply, Onyambu. This is probably rudimentary, but what does setting the length of an object do in relation to the for loop? (And I set the length as you suggested, and my result was all "0.000" except for the last file, which gave me a value.) – DnstnRmsy Dec 22 '17 at 20:37
  • If this answers your question, you can go ahead and accept it in order to close down the question. You can also upvote. Thank you – Onyambu Dec 22 '17 at 20:38
  • 1
    DnstnRmsy, two things: (1) you are *overwriting* `total` each pass through your loop, so you lose all previous information, so at a minimum the `total <- numeric()` should be outside of the `for` loop; (2) though you can grow vectors/lists dynamically, it copies the data each time you append to the vector. (This gets expensive in the long run.) To preempt that, if you know how big it will get (or at least an upper bound), then pre-allocate the space, since replacement of a single element of the vector does not incur a copy-cost. – r2evans Dec 22 '17 at 20:48
  • Thanks @r2evans has answered the questions you did ask – Onyambu Dec 22 '17 at 20:52