This one is a doozy. I've been trying to figure this out for a while, but I keep hitting the wall. So, I'm crowd sourcing this in the name of science.
A Brief Introduction
I have about 93 files with unique names in a directory. I read this files in to a list using r.
files.measurements <- as.character(list.files(path = "~/measurements/", full.names = TRUE))
So, what this is doing is just finding the names of all files in the directory. All these files are .csv
. Saves me a lot of hassle.
I then read the names of the files.
measurements.filenames <- gsub(".csv", "", basename(files.measurements))
The reason to read these files is because each file name represents the name of the measurement. The same item in the file may or may not exist in multiple files.
For Example
There are 5 file names, viz., NameA
, NameB
, NameC
, NameD
, NameE
. Each file has 8 column names: id, name, sex, dob, ..., measurement. (This name is same for each file name)
Of course, the id is unique, but may or may not exist int NameB
, if it exists in NameA
.
Need
So, what I need to do is merge these 93 files to a single dataframe such that the dataframe contains id
, name
, sex
, dob
, ... and instead of measurement
the name of the file - NameA
, for example. The value should be the same for the same id, and if the id doesn't exist, rbind
to the dataframe with additional column, else if the id exists, just add the measurement
to the column with the new column name - NameB
.
Can you please help? This is to gather the data for cardiovascular and HIV diseases for research.
EDIT
DATA
NameA
id gender dob status date measurement
1 F 5/24/1942 Rpt 1/12/2018 2.9
2 F 12/1/2017 Rpt 1/12/2018 0.622
3 M 11/15/1957 Rpt 1/11/2018 3.6
4 M 5/17/1947 Rpt 1/11/2018 3.5
5 F 7/17/1955 Rpt 1/11/2018 2.7
NameB
id gender dob status date measurement
1 F 5/24/1942 Rpt 1/12/2018 3.5
2 F 12/1/2017 Rpt 1/12/2018 2.5
8 M 11/15/1957 Rpt 1/11/2018 1.9
10 M 5/17/1947 Rpt 1/11/2018 0.8
11 F 7/17/1955 Rpt 1/11/2018 1.2
Explanation
So, as you see, all the columns in both tables are the same, but the last measurement is different. Please ignore gender
, dob
, status
and date
columns for now. Let's focus on id
and measurement
. As you can see, id 1
and 2
are in both tables NameA
and NameB
. if that's the case, then measurement from NameB
should be added to the dataframe right next to the measurement
from NameB
with name (like NameB-measurement
). And for all the id's that doesn't exist in NameA
from NameBshould be added as new row with
measurementfrom
NameAas blank but
NameB-measurement` added.
I know it's convoluted, but that's how the researchers gave me the data. I need to clean this up somehow.