I have about 60 csv
files that I want to merge. One challenge is inconsistent naming of columns, although essentially all files (should) have same data.
To deal with this issue, I would like to first test which files have certain column names (and which don't). I have a vector of strings, where each element reflects a column name, to be checked whether it exists in each csv file.
I'm trying to achieve a dataframe that will have:
- columns: each column corresponds to a column name I'm looking to test whether exists
- rows: each row corresponds to one csv file
- values: in each cell, either
0
or1
to mark whether the csv file has the column name
3 CSVs for example
library(tidyverse)
df_1 <-
tribble(~ date, ~ name, ~ age, ~ gender,
"2020-11-29", "sarah", 43, "female")
df_2 <-
tribble(~ createdAt, ~ person, ~ age, ~ is_female,
"2020-10-10", "bob", 25, 0)
df_3 <-
tribble(~ date, ~ name, ~ age_value, ~ gender,
"2010-01-07", "wendy", 70, "female")
write_csv(df_1, "csv_1.csv")
write_csv(df_2, "csv_2.csv")
write_csv(df_3, "csv_3.csv")
Vector with names
Now let's say that I'm ignorant of the column names in the 3 CSVs created above. I believe that each CSV should have column names of either date
, name
, age
, age_value
, gender
.
col_names_to_test <-
c(
"date",
"name",
"age",
"age_value",
"gender"
)
A basis for the solution
This is just my direction, based on this nice solution that defines a reading & editing function, then uses list.files
with purrr::map_df
on the function defined.
read_plus <-
function(flnm) {
read_csv(flnm, col_types = cols(.default = "c")) # %>%
## here some testing against the vector `col_names_to_test` ?
}
tbl_with_sources <-
list.files(path = //folder-with-csv-files,
pattern = "*.csv",
full.names = TRUE,
recursive = TRUE) %>%
map_df(~ read_plus(.))
This is just a general idea... I'm used to tidyverse approaches but I'll be happy with any solution.
Desired Output
filename date name age age_value gender
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 csv_1 1 1 1 0 1
2 csv_2 0 0 1 0 0
3 csv_3 1 0 0 1 1