0

I have some CSV files coming from an s3 bucket that I would like to merge into one dataframe. For the purposes of this exercise, let's use these five files stored in a character vector called script_results:

[1] "05162021-test-america-distributed-a-script_results-03-16-2021-404806813.csv"
[2] "07262021-test-america-distributed-a-script_results-04-26-2021-534691276.csv"
[3] "09202021-test-arizona-script_results-08-19-2020--08-31-2020-354731384.csv"
[4] "09222021-test-wisconsin-script_results-08-19-2020--08-20-2020-068472211.csv"
[5] "01062021-av-america-script_results-01-5-2021-038680844.csv"

The first step is extracting the data from each CSV, using the following (with the first .csv listed above as the example):

extract_data <- aws.s3::s3read_using(read.csv, bucket = "test-america", object = "03162021-test-america-distributed-a-script_results-03-16-2021-404806813.csv")

It would then continuously bind together each of these unpacked CSV's into one dataframe using the rbind.fill function from the plyr package.

So the end result of this would be all five of the data sets coming from the five CSVs listed under script_results merged together using the rbind.fill function.

I should also note that I will be receiving a new file on a daily basis, so I want to continuously adjust for those new files reaching my S3 bucket.

Is this possible to do?

EDIT: Using @r2evans comments

L <- lapply(script_results, function(fn) aws.s3::s3read_using(read.csv, bucket="bucket-name", object=fn))

This works. But the issue is when I try to convert it to a dataframe using:

t<- do.call(rbind.data.frame, L)

I'm given the following error:

Error in (function (..., deparse.level = 1, make.row.names = TRUE, stringsAsFactors = FALSE,  : 
  numbers of columns of arguments do not match
wizkids121
  • 634
  • 1
  • 9
  • 22
  • `plyr::rbind.fill`, `dplyr::bind_rows`, `data.table::rbindlist`, or base `do.call(rbind, ...)` should all generally work so long as all frames have the same column names in the same order. Some of those functions are better at dealing with column differences. All of them start with the premise of reading all frames into a list-of-frames and then combining them into a single frame at the end; note that iteratively growing a frame scales poorly, and with larger data it will take a lot longer than you may want. See https://stackoverflow.com/a/24376207/3358227. – r2evans Oct 13 '21 at 19:41
  • @r2evans Thanks. The reason I'm using `rbind.fill` is because the column names will not be consistent file to file. Some will have the same column names, others won't. So what would this actually look like in code form? – wizkids121 Oct 13 '21 at 19:44
  • There's not enough information here for me to say for certain. Some samples: `dplyr::bind_rows(list(data.frame(a=1,b=2),data.frame(a=11,d=12)))` and `data.table::rbindlist(list(data.frame(a=1,b=2),data.frame(a=11,d=12)),use.names=TRUE,fill=TRUE)` both produce similar output. – r2evans Oct 13 '21 at 19:45
  • @r2evans What additional information would you need? I want to make sure I'm putting together a question someone can actually answer – wizkids121 Oct 13 '21 at 19:48
  • If you look at the link I included in the first comment, among other things it suggested `lapply(my_files, read.csv)`. Translating that to your question, I think it'd be `L <- lapply(script_results, function(fn) aws.s3::s3read_using(read.csv, bucket=..., object=fn))` to get your list of frames; from there apply one of the options for combining them all. – r2evans Oct 13 '21 at 20:02
  • @r2evans This almost worked. But can you check my edit to my question to see if you can solve the error issue. – wizkids121 Oct 13 '21 at 20:19
  • My second comment demonstrates some ways to work this this. I don't use `plyr`, so I cannot comment on it. That is, use `dplyr::bind_rows(L)` or `data.table::rbindlist(L,use.names=TRUE,fill=TRUE)`. – r2evans Oct 13 '21 at 21:46

1 Answers1

1

I'm not familiar with the functions from aws.s3, but here's the general idea in base and maybe you can translate it to work for your case.

library(plyr)

# your old dataframe of bucket data from "yesterday" with a field that holds the
# name of each csv that a given record is associated with
df_old <- data.frame(otherdata = c(1,2,3), csv = "from_a_previous_bucket_dump.csv")

# get the names of files in the bucket "today"
scriptresults <- list.files()

# only read csvs that aren't already in df_old$csv
fs_new <- setdiff(scriptresults, df_old$csv)

# Read a csv and add it name as a new field
read_csv_filename <- function(filename){
  ret <- read.csv(filename)
  ret$csv <- filename
  return(ret)
}
fsl <- lapply(fs_new, read_csv_filename)

df_new <- plyr::rbind.fill(c(fsl, list(df_old)))

read_csv_filename() is from When importing CSV into R how to generate column with name of the CSV?

Skaqqs
  • 4,010
  • 1
  • 7
  • 21
  • Thank you but I'm rather confused by this. Can you explain what's going on in `df_old`? – wizkids121 Oct 13 '21 at 20:20
  • Sure. Based on my understanding, one of the stipulations of your question was to make sure you weren't loading the same csv from the bucket that was already incorporated into your dataset (i.e., from a previous time you ran this script). `df_old` is my example of your working dataset (data already compiled from the bucket). Let me know if that doesn't make sense. – Skaqqs Oct 13 '21 at 20:23
  • So what is `otherdata = c(1,2,3)` mean? What do the 1,2,3 represent? – wizkids121 Oct 13 '21 at 20:25
  • I am attempting (poorly) to illustrate that your working dataset is a data.frame, where each row contains any number of columns and rows, and each row is a record coming from a csv. The name of the origin file of each row of data is stored in `df$csv`. This is important because it is the basis for avoiding duplicated data in `df_new` when you run this script on future occasions to combine new data from the bucket with your working dataset. – Skaqqs Oct 13 '21 at 20:30
  • Perhaps I should ask a question to clarify my assumption. Is there a chance that the bucket will contain both new data (not in your working df) and old data (already in your working df) but not ALL of your old data? – Skaqqs Oct 13 '21 at 20:30
  • Hi @wizkids121, were you able to figure this out? – Skaqqs Oct 22 '21 at 19:24