How to efficiently handle column name mismatches when merging csv files?

Question

I'm trying to merge all csv files in a folder, and have slightly different versions of some extracts. For whatever reason, subtle changes to column names seem to have been made. What is the best way to take on the task of matching and merging the data from these columns?

I tried using the code below, after examining this answer: How to import multiple .csv files at once?

tbl <-
  list.files(path = "//folder/anotherfolder/",
             pattern = "*.csv", 
             full.names = T) %>% 
  map_df(~read_csv(., col_types = cols(.default = "c")))

The resulting tbl has way more columns than any of the original files, the culprit being the fact that I have columns with names as follows in the tbl after merging the csvs:

I have many of these merges ahead of me, and noted that the differences in column names don't always manifest in the same form (i.e. sometimes the problem is a "/" or some other character).

Can I expect to have to make these corrections manually, or is there a means of implementing some code to match columns? I'm thinking that fuzzy matching would be difficult given the fact that "Email Address Type-2" should only be merged with "Email.Address.Type.2", and not with "Email Address-2", etc.

Is there a means of altering the code to strip all space or punctuation characters from the column titles? This would solve my problem, because the only issues appear to be with characters other than letter and numbers.

My attempt thus far at removing special characters fails with the code below. My idea here was to try and remove characters in the column names strings within the map_df statement.

tbl <-
  list.files(path = "//folder/anotherfolder/",
             pattern = "*.csv", 
             full.names = T) %>% 
  map_df(~read_csv(., col_names = gsub("[^[:alnum:] ]", "", .),trim_ws=TRUE, 
  col_types = cols(.default = "c")))

One more attempt below, and I receive an error message

tbl <-
  list.files(path = "//folder/anotherfolder/",
             pattern = "*.csv", 
             full.names = T) %>% 
  map_df(~read_csv(., {colnames<-c(gsub("[^[:alnum:] ]", "", x=names(.)))}, 
col_names = colnames,
                    col_types = cols(.default = "c")))

The error message received says:

Error: $ operator is invalid for atomic vectors

Thank you in advance for any help you can provide!

Are you sure all your csv file have the same encoding? Try to read each file separetely and if there is still change in column name it's mean you should look at csv encoding.Also, if new column on tbl are empty you can just delet it — akhetos, May 23 '19 at 14:23
@akhetos thanks for the suggestion! I opened some of the files that seem to have read with different titles, and to my disappointment (?) I found that `colnames` appears the same for all sheets. — Pake, May 23 '19 at 20:27
Don't try to do the reading and the renaming in a single step. (1) Read the CSV files into a list, (2) fix the names, (3) combine them. Your `map_df` approaches are failing because `.` is the name *of the file*, so `gsub` on `.` or `names(.)` doesn't make any sense. It doesn't have column names you could edit until *after* you read it in. — Gregor Thomas, May 24 '19 at 17:36

How to efficiently handle column name mismatches when merging csv files?

0 Answers0