Importing and combining multiple CSV files in R with differing numbers and names of rows

Question

I have a folder with a couple hundred .csv files that I'd like to import and merge. Each file contains two columns of data, but there are different numbers of rows, and the rows have different names. The columns don't have names (For this, let's say they're named x and y).

How can I merge these all together? I'd like to just stick the x columns together, side-by-side, rather than matching on any criteria so that the first row is matched across all data sets and empty rows are given NA. I'd like column x to go away. Although, the rows should stay in the order they were originally in from the csv.

Here's an example:

Data frame 112_c1.csv:

x       y
1  -0.5604
3  -0.2301
4   1.5587
5   0.0705
6   0.1292


Dataframe 112_c2.csv:

x         y
2   -0.83476
3   -0.82764
8    1.32225
9    0.36363
13   0.9373
42  -1.5567
50  -0.12237
51  -0.4837

Dataframe 113_c1.csv:

x       y
5   1.5783
6   0.7736
9   0.28273
15  1.44565
23  0.999878
29 -0.223756
=

Desired result

112_c1.y   112_c2.y  113_c1.y
-0.5604   -0.83476   1.5783
-0.2301   -0.82764   0.7736
1.5587     1.32225   0.28273
0.0705     0.36363   1.44565
0.1292     0.9373    0.999878
NA        -1.5567    -0.223756
NA        -0.12237   -0.223756
NA        -0.12237   NA
NA        -0.4837    NA

I've tried a few things, and looked through many other threads. But code like the following simply produces NAs for any following columns:

df <- do.call(rbind.fill, lapply(list.files(pattern = "*.csv"), read.csv))

Plus, if I use rbind instead of rbind.fill I get the error that names do not match previous names and I'm unsure of how to circumvent this matching criteria.

Sticking them together side-by-side defies an underlying premise of a `data.frame`: that each row is an observation, each value on that row is fundamentally tied together. In a survey, each row is a respondent. In a data log, each row is a point-in-time. While not insurmountable, another issue is that since they have different rows, you will have some columns with more rows than the others, which is not how frames work; the way around this is to lengthen the shorter ones, filling with `NA`. — r2evans, Mar 08 '20 at 00:19
What is it that you ultimately need to do with this data? There might be more appropriate methods or structures to use in place of a `data.frame`. — r2evans, Mar 08 '20 at 00:23
@r2evans I understand that it's weird. In this case, each column (ie each csv file) is a participant. The rows have different names because they are observations at different times, and the analysis selects different values based on frame rates from a video. I want to combine them for further analysis. The first thing I'll do is create a standardized average score from each columns and append that to another dataset. The next thing I'll do is analyse the time-series points for each participant. — socialresearcher, Mar 08 '20 at 00:26
Would it be acceptable to pack them in named lists instead of `data.frame`? As r2evans already mentioned a `data.frame` makes only sense when the data within it follows *data-frame* logic... But maybe you have good reasons for this special structure and there are no alternatives... — dario, Mar 08 '20 at 00:28
Perhaps: (1) add a column to each indicating the participant ID; (2) read them in and combine them by rows, so you'll have three columns (`id`, `x`, `y`). From there, analyzing them can be done by-id (`dplyr::group_by` or `data.table`'s `x[,,by=.(id)]` semantics, as well as some base-R methods) and/or other ways. (Perhaps add "row number" as a column as well, in case you need to impose that order.) This is very much a data-science-y kind of issue :-) — r2evans, Mar 08 '20 at 00:31
@dario I'm open to other suggestions. I usually work with data frames in R and will do further analyses, so I just went to that out of habit. But, as I mentioned in my other comment, each column is a participant's time series data, with row 1 being time 0 and the last row being the last time point. — socialresearcher, Mar 08 '20 at 00:31
@r2evans Hmmm... well, i've never worked with data like this before. The file names have the participant ID. Is there an easy way to number the rows based on the row numbers in the csv file? Or to import those row numbers? — socialresearcher, Mar 08 '20 at 00:38
Just saw your older comment *The first thing I'll do is create a standardized average score from each columns and append that to another dataset. The next thing I'll do is analyse the time-series points for each participant.* After reading that my suggestion is to do all that stuff right when importing the data and save **that** information in a regular *observations* in rows and *variables* in columns `data.frame` — dario, Mar 08 '20 at 00:39
@dario Yes, but first I need some average scores from each column. — socialresearcher, Mar 08 '20 at 00:39
@dario Ok, I guess I'm not really sure how to do that. I've never worked with data like this before-usually I only have a few data frames and they are organized better. Do you have an example of what that might look like? ie how to create a single standardized score as I'm importing? — socialresearcher, Mar 08 '20 at 00:42
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/209229/discussion-between-socialresearcher-and-dario). — socialresearcher, Mar 08 '20 at 01:01

score 1 · Accepted Answer · answered Mar 08 '20 at 01:02

Suggested solution using a function to calculate summary statistics right when loading data:

 readCalc <- function(file_path) {
   df <- read.csv(file_path)
   return(data.frame(file=file_path,
                     column = names(df),
                     averages = apply(df, 2, mean),
                     N = apply(df, 2, length),
                     min = apply(df, 2, min),
                     stringsAsFactors = FALSE, row.names = NULL))
 }


 df <- do.call(rbind, lapply(list.files(pattern = "*.csv"), readCalc))

If we need the first or last value we could use dplyr::first, dplyr::last. We might even want to store the whole vector in a list somewhere, but if we only need the summary stats we might not even need it.

Got it. Thanks. I see how this works. This can be easily modified to do other operations as well- very helpful. Thanks for your time! — socialresearcher, Mar 08 '20 at 01:08

score 1 · Answer 2 · answered Mar 08 '20 at 01:21

Here's a solution to read all your csv files from a folder called "data" and merge the y columns into a single dataframe. This assigns the file name as the column header.

library(tidyverse)

# store csv file paths
data_path <- "data"   # path to the data
files <- dir(data_path, pattern = "*.csv") # get file names
files <- paste(data_path, '/', files, sep="")

# read csv files and combine into a single dataframe 
compiled_data = tibble::tibble(File = files) %>% #create a tibble called compiled_data
  tidyr::extract(File, "name", "(?<=data/)(.*)(?=[.]csv)", remove = FALSE) %>% #extract the file names 
  mutate(Data = lapply(File, readr::read_csv, col_names = F)) %>% #create a column called Data that stores the file names 
  tidyr::unnest(Data) %>% #unnest the Data column into multiple columns 
  select(-File) %>% #remove the File column 
  na.omit() %>% #remove the NA rows 
  spread(name, X2) %>% #reshape the dataframe from long to wide 
  select(-X1) %>% #remove the x column 
  mutate_all(funs(.[order(is.na(.))])) #reorganize dataframe to collapse the NA rows

ecology · Answer 3 · 2020-03-08T00:25:43.027

0

Taken from here: cbind a dataframe with an empty dataframe - cbind.fill?

x <- c(1:6)
y <- c(1:3)
z <- c(1:10)

cbind.fill <- function(...){
  nm <- list(...) 
  nm <- lapply(nm, as.matrix)
  n <- max(sapply(nm, nrow)) 
  do.call(cbind, lapply(nm, function (x) 
    rbind(x, matrix(, n-nrow(x), ncol(x))))) 
}

df <- as.data.frame(cbind.fill(x,y,z))

colnames(df) <- c("112_c1.y", "112_c2.y", "113_c1.y")

   112_c1.y 112_c2.y 113_c1.y
1         1        1        1
2         2        2        2
3         3        3        3
4         4       NA        4
5         5       NA        5
6         6       NA        6
7        NA       NA        7
8        NA       NA        8
9        NA       NA        9
10       NA       NA       10

edited Mar 08 '20 at 00:25

answered Mar 08 '20 at 00:19

ecology

606
3
9
29

1

Are you adding anything to the other answer? – r2evans Mar 08 '20 at 00:23
Also, as was said in that answer, this is returning a `matrix` and not a `data.frame`. – r2evans Mar 08 '20 at 00:24
1

fixed :) and not adding anything significant to that answer but maybe the specific column names the op requested. – ecology Mar 08 '20 at 00:27

Importing and combining multiple CSV files in R with differing numbers and names of rows

3 Answers3