1

Newbie here. I have 1000 compressed CSV files that I need to read and row bind. My problem is similar to this one, but with two differences:

a) File names are of different lengths and not sequential, in this form:

"members_[name of company]_[state code].csv"`

I have two vectors, company and states with the required codes. So, I've built a vector of all the files I need with this code:

combinations <- expand.grid(company, states)
csvfiles <- paste0("members_" ,
            combinations$Var1, "_",
            combinations$Var2,".csv" )

so it has all the filenames I need (20 companies X 50 states). But I am lost as to how to cycle through all zip files. There are 10 other CSVs inside those zip files, but I only need the ones described above.

b) When decompressed, the files expand to a directory structure such as this:

/files/member_database/members/state/members_[name of company]_[state code].csv

but when I try to read the CSV from the zip file using

data <- read.csv(unz("members_GE_FL.zip", "members_GE_FL.csv"), header=F,  sep=":")

it returns the 'cannot open connection' message. Adding the path such as ./files/member_database/members/state/members_GE_FL.csv doesn't work either.

Then, I'm not sure if the command read.csv(unz(csvfiles... would make it read the names in my csvfiles, but I'm not sure if that's because of the above or if the command is wrong altogether.

Any help is appreciated -- insights, docs I should look at, etc. Again, I'm NOT trying to get people to do my work. As I type, I have 37 tabs open (many from SO), and have already spent 22 hours on this thing alone. I've learned this post and others how to read a file within a ZIP and from this post how to extract and import data. Still, I can't piece it all together. I've only started with R a few months ago, and have no prior experience as a programmer.

Community
  • 1
  • 1
  • What has all this got to do with [tag:batch-file]? Please read the [tag info](http://stackoverflow.com/tags/batch-file/info) **before** applying tags! Perhaps you meant [tag:batch-processing] (check the [tag info](http://stackoverflow.com/tags/batch-processing/info)!)? – aschipfl Mar 24 '17 at 13:45
  • Sorry, batch-file was suggested automatically. Yes, I would have meant batch-processing. – questionMarc Mar 24 '17 at 14:16
  • What does `unzip(members_GE_FL.zip, list=TRUE)` return? That should probably tell you what strings `unz()` is expecting. – MrFlick Mar 24 '17 at 14:55
  • It returns:```Error in open.connection(file, "rt") : cannot open the connection In addition: Warning message: In open.connection(file, "rt") : cannot locate file 'members_GE_FL.csv' in zip file 'members_GE_FL.zip'``` – questionMarc Mar 24 '17 at 15:02
  • Oops, i meant `unzip("members_GE_FL.zip", list=TRUE)`. That should list all the files in the archive. Did you still pass that to `unz()`? Or what did you run exactly? Because i'm not sure why `members_GE_FL.csv` would be in the error message. That seems odd. – MrFlick Mar 24 '17 at 15:18
  • That command shows indeed all 10 files within the zip. I've built a vector with the filenames of the ZIP archives and the filenames I need, but I am not sure how to cycle through those vectors. – questionMarc Mar 24 '17 at 15:22

1 Answers1

1

I suspect all that was missing was the correct path to the file in the archive: neither "members_GE_FL.csv" nor "./files/member_database/members/state/members_GE_FL.csv" will work.
But "files/member_database/members/state/members_GE_FL.csv" (without the initial dot) should.

For the sake of completeness, here is a complete example:

Let's create some dummy data, three files named out-1.csv, out-2.csv, out-3.csv and zip them in dummy-archive.zip:

if (!dir.exists("data")) dir.create("data")
if (!dir.exists("data/dummy-files")) dir.create("data/dummy-files")
for (i in 1:3)
  write.csv(data.frame(foo = 1:2, bar = 7:8), paste0("data/dummy-files/out-", i, ".csv"), row.names = FALSE)
zip("data/dummy-archive.zip", "data/dummy-files")

Now let's assume we're looking for 3 other files, two of which are in the archive, one is not:

files_to_find <- c("out-2.csv", "out-3.csv", "out-4.csv")

List the files in the archive, and name them for the sake of clarity:

files_in_archive <- unzip("data/dummy-archive.zip", list = TRUE)$Name
files_in_archive <- setNames(files_in_archive, basename(files_in_archive))

#                  dummy-files                    out-2.csv 
#          "data/dummy-files/" "data/dummy-files/out-2.csv" 
#                    out-3.csv                    out-1.csv 
# "data/dummy-files/out-3.csv" "data/dummy-files/out-1.csv" 

Find the indices of files we're looking for in the archive, and read them like you intended to (with read.csv(unz(....))):

i <- basename(files_in_archive) %in% files_to_find
res <- lapply(files_in_archive[i], function(f) read.csv(unz("data/dummy-archive.zip", f)))

# $`out-2.csv`
#   foo bar
# 1   1   7
# 2   2   8
# 
# $`out-3.csv`
#   foo bar
# 1   1   7
# 2   2   8

Clean-up:

unlink(c("data/dummy-files/", "data/dummy-archive.zip"), recursive = TRUE)
Aurèle
  • 12,545
  • 1
  • 31
  • 49
  • @a p o m: your suspicion is correct. The `list_files` command returned the path without "./" . Let me try this and get back to you. Thank you! – questionMarc Mar 27 '17 at 18:52