2

I have about 60 csv files that I want to merge. One challenge is inconsistent naming of columns, although essentially all files (should) have same data.

To deal with this issue, I would like to first test which files have certain column names (and which don't). I have a vector of strings, where each element reflects a column name, to be checked whether it exists in each csv file.

I'm trying to achieve a dataframe that will have:

  • columns: each column corresponds to a column name I'm looking to test whether exists
  • rows: each row corresponds to one csv file
  • values: in each cell, either 0 or 1 to mark whether the csv file has the column name

3 CSVs for example

library(tidyverse)

df_1 <-
  tribble(~ date, ~ name, ~ age, ~ gender,
        "2020-11-29", "sarah", 43, "female")

df_2 <-
  tribble(~ createdAt, ~ person, ~ age, ~ is_female,
        "2020-10-10", "bob", 25, 0)

df_3 <- 
  tribble(~ date, ~ name, ~ age_value, ~ gender,
        "2010-01-07", "wendy", 70, "female")
write_csv(df_1, "csv_1.csv")
write_csv(df_2, "csv_2.csv")
write_csv(df_3, "csv_3.csv")

Vector with names

Now let's say that I'm ignorant of the column names in the 3 CSVs created above. I believe that each CSV should have column names of either date, name, age, age_value, gender.

col_names_to_test <-
  c(
    "date",
    "name",
    "age",
    "age_value",
    "gender"
  )

A basis for the solution

This is just my direction, based on this nice solution that defines a reading & editing function, then uses list.files with purrr::map_df on the function defined.

read_plus <- 
  function(flnm) {
  read_csv(flnm, col_types = cols(.default = "c")) # %>%
  ## here some testing against the vector `col_names_to_test` ?
  }

tbl_with_sources <-
   list.files(path = //folder-with-csv-files,
              pattern = "*.csv", 
              full.names = TRUE,
              recursive = TRUE) %>% 
  map_df(~ read_plus(.))

This is just a general idea... I'm used to tidyverse approaches but I'll be happy with any solution.

Desired Output

  filename  date  name   age age_value gender
  <chr>    <dbl> <dbl> <dbl>     <dbl>  <dbl>
1 csv_1        1     1     1         0      1
2 csv_2        0     0     1         0      0
3 csv_3        1     0     0         1      1
Sam Firke
  • 21,571
  • 9
  • 87
  • 105
Emman
  • 3,695
  • 2
  • 20
  • 44

2 Answers2

2

Define a function ok which given file name f returns a named 0/1 vector having the same length as col_names_to_test having a 1 if the corresponding component of col_names_to_test exists as a column name in that file and 0 otherwise. Then define a vector of file names files. Give it names without the extension and apply ok to it using map_dfr.

This is reasonably compact and only uses purrr.

library(purrr)

ok <- function(f) +setNames(col_names_to_test %in% names(read.csv(f)), col_names_to_test)
files <- Sys.glob("csv_*.csv")
shortnames <- sub("\\.csv$", "", basename(files))
files %>% setNames(shortnames) %>% map_dfr(ok, .id = "file")

giving:

# A tibble: 3 x 6
  file   date  name   age age_value gender
  <chr> <int> <int> <int>     <int>  <int>
1 csv_1     1     1     1         0      1
2 csv_2     0     0     1         0      0
3 csv_3     1     1     0         1      1

Update

Have completely revised.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • awesome. I'll just add that in my case, I assign `files` with `files <- list.files(path = //folder-with-csv-files, pattern = "*.csv", full.names = TRUE, recursive = TRUE)`. – Emman Nov 29 '20 at 18:03
  • also, would you mind explaining the role of `+` before `setNames` in the function `ok`? – Emman Nov 29 '20 at 18:05
  • 1
    That converts the logical vector to a 0/1 vector. Try `+ c(TRUE, FALSE)` – G. Grothendieck Nov 29 '20 at 18:15
  • Also have added shortnames line to the solution which handles full path names. – G. Grothendieck Nov 29 '20 at 18:21
  • I have another question that I wonder if requires a separate post or can be added here -- is it simple enough to tweak `ok` function (or create `ok_2` instead) so that it will return `1` for *partial matching* too? For example, if `col_names_to_test` has `age`, and `csv_3` has `age_value`, we will consider this a match because `age_value` contains `age` as a substring. This will be useful in cases where we don't know the exact wording for column names across CSVs, but can guesstimate that "age" columns are likely to have `age` at least as a substring. – Emman Nov 30 '20 at 08:16
  • Look at `?grepl` – G. Grothendieck Nov 30 '20 at 11:45
0

If you want only the index of columns matching col_names_to_test, you can use this approach:

library('data.table')
library('dplyr')

col_names_to_test = c('date', 'name', 'age', 'age_value', 'gender')

# define columns indexes matching the pattern
DefCols = function(input_path, patterns) {
  pattern = patterns %>%
    str_flatten('|')
  cols = input_path %>%
    fread(nrows = 1) %>%
    colnames() %>%
    str_which(pattern)
  return(cols)
}

# define the input directory
input_dir = ''

cols = input_dir %>%
  dir(pattern = '.*.csv$', full.names = TRUE, recursive = TRUE) %>%
  lapply(DefCols, col_names_to_test)

However, if you also want to load the dataframes with only the matching columns, you can extend it, as follows:

library('data.table')
library('dplyr')

col_names_to_test = c('date', 'name', 'age', 'age_value', 'gender')

# define columns indexes matching the pattern
LoadDF = function(input_path, patterns) {
  pattern = patterns %>%
    str_flatten('|')
  cols = input_path %>%
    fread(nrows = 1) %>%
    colnames() %>%
    str_which(pattern)
  df = input_path %>%
    fread(drop = -cols) %>%
    as.data.frame()
  return(df)
}

# define the input directory
input_dir = ''

dfs = 'input_dir' %>%
  dir(pattern = '.*.csv$', full.names = TRUE, recursive = TRUE) %>%
  lapply(LoadDF, col_names_to_test)

Note: when I'm loading the data to check the column names, I keep only the first row (nrows = 1), because I don't care about the values in each cell.

rodolfoksveiga
  • 1,181
  • 4
  • 17