0

I have a folder with a couple hundred .csv files that I'd like to import and merge. Each file contains two columns of data, but there are different numbers of rows, and the rows have different names. The columns don't have names (For this, let's say they're named x and y).

How can I merge these all together? I'd like to just stick the x columns together, side-by-side, rather than matching on any criteria so that the first row is matched across all data sets and empty rows are given NA. I'd like column x to go away. Although, the rows should stay in the order they were originally in from the csv.

Here's an example:

Data frame 112_c1.csv:

x       y
1  -0.5604
3  -0.2301
4   1.5587
5   0.0705
6   0.1292


Dataframe 112_c2.csv:

x         y
2   -0.83476
3   -0.82764
8    1.32225
9    0.36363
13   0.9373
42  -1.5567
50  -0.12237
51  -0.4837

Dataframe 113_c1.csv:

x       y
5   1.5783
6   0.7736
9   0.28273
15  1.44565
23  0.999878
29 -0.223756
=

Desired result

112_c1.y   112_c2.y  113_c1.y
-0.5604   -0.83476   1.5783
-0.2301   -0.82764   0.7736
1.5587     1.32225   0.28273
0.0705     0.36363   1.44565
0.1292     0.9373    0.999878
NA        -1.5567    -0.223756
NA        -0.12237   -0.223756
NA        -0.12237   NA
NA        -0.4837    NA

I've tried a few things, and looked through many other threads. But code like the following simply produces NAs for any following columns:

df <- do.call(rbind.fill, lapply(list.files(pattern = "*.csv"), read.csv))

Plus, if I use rbind instead of rbind.fill I get the error that names do not match previous names and I'm unsure of how to circumvent this matching criteria.

  • 1
    Sticking them together side-by-side defies an underlying premise of a `data.frame`: that each row is an observation, each value on that row is fundamentally tied together. In a survey, each row is a respondent. In a data log, each row is a point-in-time. While not insurmountable, another issue is that since they have different rows, you will have some columns with more rows than the others, which is not how frames work; the way around this is to lengthen the shorter ones, filling with `NA`. – r2evans Mar 08 '20 at 00:19
  • What is it that you ultimately need to do with this data? There might be more appropriate methods or structures to use in place of a `data.frame`. – r2evans Mar 08 '20 at 00:23
  • @r2evans I understand that it's weird. In this case, each column (ie each csv file) is a participant. The rows have different names because they are observations at different times, and the analysis selects different values based on frame rates from a video. I want to combine them for further analysis. The first thing I'll do is create a standardized average score from each columns and append that to another dataset. The next thing I'll do is analyse the time-series points for each participant. – socialresearcher Mar 08 '20 at 00:26
  • Would it be acceptable to pack them in named lists instead of `data.frame`? As r2evans already mentioned a `data.frame` makes only sense when the data within it follows *data-frame* logic... But maybe you have good reasons for this special structure and there are no alternatives... – dario Mar 08 '20 at 00:28
  • 1
    Perhaps: (1) add a column to each indicating the participant ID; (2) read them in and combine them by rows, so you'll have three columns (`id`, `x`, `y`). From there, analyzing them can be done by-id (`dplyr::group_by` or `data.table`'s `x[,,by=.(id)]` semantics, as well as some base-R methods) and/or other ways. (Perhaps add "row number" as a column as well, in case you need to impose that order.) This is very much a data-science-y kind of issue :-) – r2evans Mar 08 '20 at 00:31
  • 1
    @dario I'm open to other suggestions. I usually work with data frames in R and will do further analyses, so I just went to that out of habit. But, as I mentioned in my other comment, each column is a participant's time series data, with row 1 being time 0 and the last row being the last time point. – socialresearcher Mar 08 '20 at 00:31
  • So you want to do time series analysis stuff with it? – dario Mar 08 '20 at 00:35
  • @r2evans Hmmm... well, i've never worked with data like this before. The file names have the participant ID. Is there an easy way to number the rows based on the row numbers in the csv file? Or to import those row numbers? – socialresearcher Mar 08 '20 at 00:38
  • Just saw your older comment *The first thing I'll do is create a standardized average score from each columns and append that to another dataset. The next thing I'll do is analyse the time-series points for each participant.* After reading that my suggestion is to do all that stuff right when importing the data and save **that** information in a regular *observations* in rows and *variables* in columns `data.frame` – dario Mar 08 '20 at 00:39
  • @dario Yes, but first I need some average scores from each column. – socialresearcher Mar 08 '20 at 00:39
  • @dario Ok, I guess I'm not really sure how to do that. I've never worked with data like this before-usually I only have a few data frames and they are organized better. Do you have an example of what that might look like? ie how to create a single standardized score as I'm importing? – socialresearcher Mar 08 '20 at 00:42
  • Ill try my best ;) – dario Mar 08 '20 at 00:47
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/209229/discussion-between-socialresearcher-and-dario). – socialresearcher Mar 08 '20 at 01:01

3 Answers3

1

Suggested solution using a function to calculate summary statistics right when loading data:

 readCalc <- function(file_path) {
   df <- read.csv(file_path)
   return(data.frame(file=file_path,
                     column = names(df),
                     averages = apply(df, 2, mean),
                     N = apply(df, 2, length),
                     min = apply(df, 2, min),
                     stringsAsFactors = FALSE, row.names = NULL))
 }


 df <- do.call(rbind, lapply(list.files(pattern = "*.csv"), readCalc))

If we need the first or last value we could use dplyr::first, dplyr::last. We might even want to store the whole vector in a list somewhere, but if we only need the summary stats we might not even need it.

dario
  • 6,415
  • 2
  • 12
  • 26
1

Here's a solution to read all your csv files from a folder called "data" and merge the y columns into a single dataframe. This assigns the file name as the column header.

library(tidyverse)

# store csv file paths
data_path <- "data"   # path to the data
files <- dir(data_path, pattern = "*.csv") # get file names
files <- paste(data_path, '/', files, sep="")

# read csv files and combine into a single dataframe 
compiled_data = tibble::tibble(File = files) %>% #create a tibble called compiled_data
  tidyr::extract(File, "name", "(?<=data/)(.*)(?=[.]csv)", remove = FALSE) %>% #extract the file names 
  mutate(Data = lapply(File, readr::read_csv, col_names = F)) %>% #create a column called Data that stores the file names 
  tidyr::unnest(Data) %>% #unnest the Data column into multiple columns 
  select(-File) %>% #remove the File column 
  na.omit() %>% #remove the NA rows 
  spread(name, X2) %>% #reshape the dataframe from long to wide 
  select(-X1) %>% #remove the x column 
  mutate_all(funs(.[order(is.na(.))])) #reorganize dataframe to collapse the NA rows 
Caitlin
  • 505
  • 2
  • 14
0

Taken from here: cbind a dataframe with an empty dataframe - cbind.fill?

x <- c(1:6)
y <- c(1:3)
z <- c(1:10)

cbind.fill <- function(...){
  nm <- list(...) 
  nm <- lapply(nm, as.matrix)
  n <- max(sapply(nm, nrow)) 
  do.call(cbind, lapply(nm, function (x) 
    rbind(x, matrix(, n-nrow(x), ncol(x))))) 
}

df <- as.data.frame(cbind.fill(x,y,z))

colnames(df) <- c("112_c1.y", "112_c2.y", "113_c1.y")

   112_c1.y 112_c2.y 113_c1.y
1         1        1        1
2         2        2        2
3         3        3        3
4         4       NA        4
5         5       NA        5
6         6       NA        6
7        NA       NA        7
8        NA       NA        8
9        NA       NA        9
10       NA       NA       10
ecology
  • 606
  • 3
  • 9
  • 29