0

This one is a doozy. I've been trying to figure this out for a while, but I keep hitting the wall. So, I'm crowd sourcing this in the name of science.

A Brief Introduction

I have about 93 files with unique names in a directory. I read this files in to a list using r.

files.measurements <- as.character(list.files(path = "~/measurements/", full.names = TRUE))

So, what this is doing is just finding the names of all files in the directory. All these files are .csv. Saves me a lot of hassle.

I then read the names of the files. measurements.filenames <- gsub(".csv", "", basename(files.measurements))

The reason to read these files is because each file name represents the name of the measurement. The same item in the file may or may not exist in multiple files.

For Example There are 5 file names, viz., NameA, NameB, NameC, NameD, NameE. Each file has 8 column names: id, name, sex, dob, ..., measurement. (This name is same for each file name)

Of course, the id is unique, but may or may not exist int NameB, if it exists in NameA.

Need So, what I need to do is merge these 93 files to a single dataframe such that the dataframe contains id, name, sex, dob, ... and instead of measurement the name of the file - NameA, for example. The value should be the same for the same id, and if the id doesn't exist, rbind to the dataframe with additional column, else if the id exists, just add the measurement to the column with the new column name - NameB.

Can you please help? This is to gather the data for cardiovascular and HIV diseases for research.

EDIT DATA NameA

id  gender  dob status  date    measurement
1   F   5/24/1942   Rpt 1/12/2018   2.9
2   F   12/1/2017   Rpt 1/12/2018   0.622
3   M   11/15/1957  Rpt 1/11/2018   3.6
4   M   5/17/1947   Rpt 1/11/2018   3.5
5   F   7/17/1955   Rpt 1/11/2018   2.7

NameB

id  gender  dob status  date    measurement
1   F   5/24/1942   Rpt 1/12/2018   3.5
2   F   12/1/2017   Rpt 1/12/2018   2.5
8   M   11/15/1957  Rpt 1/11/2018   1.9
10  M   5/17/1947   Rpt 1/11/2018   0.8
11  F   7/17/1955   Rpt 1/11/2018   1.2

Explanation

So, as you see, all the columns in both tables are the same, but the last measurement is different. Please ignore gender, dob, status and date columns for now. Let's focus on id and measurement. As you can see, id 1 and 2 are in both tables NameA and NameB. if that's the case, then measurement from NameB should be added to the dataframe right next to the measurement from NameB with name (like NameB-measurement). And for all the id's that doesn't exist in NameA from NameBshould be added as new row withmeasurementfromNameAas blank butNameB-measurement` added.

I know it's convoluted, but that's how the researchers gave me the data. I need to clean this up somehow.

user1828605
  • 1,723
  • 1
  • 24
  • 63
  • Still, as in the previous post, a mix of `measurments` and `measurements`... – vaettchen Jan 13 '18 at 04:32
  • 1
    The operation you describe is a full outer join of multiple data frames. First rename the `measurement` column in each of the data frames, then use an answer from the dupe, like `merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)`. – Gregor Thomas Jan 13 '18 at 15:04
  • This did it. Thanks @Gregor. – user1828605 Jan 14 '18 at 06:50

1 Answers1

7

Try the following:

# collecting all the csv files in a given folder
files.measurments <- base::list.files(path = ".", include.dirs = FALSE)

# reading all csv files into a list of dataframes
files.combined <- purrr::map(files.measurements, read.csv)

# combining the individual dataframes into a single dataframe
finaldf <- plyr::rbindfill(files.combined)
Indrajeet Patil
  • 4,673
  • 2
  • 20
  • 51
  • 3
    Nice answer! Welcome to the site! The more modern version would be `dplyr::bind_rows()` instead of `plyr::rbindfill`. – Gregor Thomas Jan 13 '18 at 03:55
  • 2
    Good start on SO. – MKR Jan 13 '18 at 04:15
  • So close. This is an amazing answer. I already learned so much. Although, the last line didn't work for me. I got an error regarding `rbindfill` not exposed to `namespace: plyr`. But, I used @Gregor 's, and that fixed the issue. However, it's still not what I was looking for. the df from the next file is supposed to add to the data frame but if the id exists, then it should do cbind instead of rbind. I'm also looking into it and reading some materials – user1828605 Jan 13 '18 at 04:27
  • `cbind` doesn't make sense - you can't `cbind` a few rows onto a longer data frame - it wouldn't be rectangular. Can you show a small reproducible with sample input and desired output? Do you want to `merge` them? Are you just looking for [this r-faq on merging a list of data frames](https://stackoverflow.com/q/8091303/903061)? – Gregor Thomas Jan 13 '18 at 04:36
  • @Gregor, I updated the question with a sample data. – user1828605 Jan 13 '18 at 05:23