How to pop out non-matching column names in a series of csv files?

Question

I am reading multiple csv files (20 files) and finally creating one dataframe. Though I manually checked by eye, the column names were same. However, for some reason I get the below error.

Error in match.names(clabs, names(xi)) : names do not match previous names

This is the code that I wrote

fnames <- list.files("C:/Users/code",pattern='^La') # getting all the files from directory. Update it as required
csv <- lapply(fnames,read.csv)  # reading all the files
source_DF <- do.call(rbind, lapply(csv, '[', 1:8)) # This is the line where it throws error

Please note that I use 1:8 because some times R reads uneven number of columns. For example, all my csv files only has 8 columns but when it reads, sometimes it has 12 and some has even 50. So, to avoid that I have got 1:8. Any other approach to read first 8 columns is also welcome

How can I find out which of the csv file has this naming issue and what are the columns that is causing this issue?

Any help to resolve this error would really be helpful

For future reference: it would be helpful to include a reproducible example, as well as expected output, in order for others to be able to answer your question accurately. — Mikko Marttila, Sep 05 '19 at 09:11
Sure. But in this case, other than my code, I wasn't sure how to share an example — The Great, Sep 05 '19 at 09:13
The core of the problem is that you have a list of data frames, some of which have different names: simply create some fake data like that. A good read: https://stackoverflow.com/a/5963610/4550695 — Mikko Marttila, Sep 05 '19 at 09:15

Mikko Marttila · Answer 1 · 2019-09-05T08:27:32.947

2

I would use a loop here, and check each set of names against the previous ones:

dfs <- list(
  data.frame(foo = 1, bar = 2),
  data.frame(foo = 2, bar = 2),
  data.frame(foo = 3, baz = 2),
  data.frame(foo = 4, bar = 2)
)

for (i in seq_len(length(dfs) - 1)) {
  different <- names(dfs[[i]]) != names(dfs[[i + 1]])
  if (any(different)) {
    message("Names of column(s) ", paste(which(different), collapse = ", "),
            " in data frame ", i + 1, " differ from the previous ones.")
  }
}
#> Names of column(s) 2 in data frame 3 differ from the previous ones.
#> Names of column(s) 2 in data frame 4 differ from the previous ones.

Or, if you just wanted to store the mismatches:

mismatches <- list(integer())
for (i in seq_len(length(dfs) - 1)) {
  different <- names(dfs[[i]]) != names(dfs[[i + 1]])
  mismatches[[i + 1]] <- which(different)
}

str(mismatches)
#> List of 4
#>  $ : int(0) 
#>  $ : int(0) 
#>  $ : int 2
#>  $ : int 2

^{Created on 2019-09-05 by the reprex package (v0.3.0.9000)}

edited Sep 05 '19 at 08:27

answered Sep 05 '19 at 07:00

Mikko Marttila

10,972
18
31

But isn't there any other way to do this? I have more than 20 dataframes and each df has more than 30 columns.. Example was 8 though – The Great Sep 05 '19 at 07:02
I don't see what the number of data frames or columns has to do with this? The solution is the same -- this explicitly answers your question of which files have differing names, and which columns are the ones that differ. If you wanted to compute on that information, you'd just also save it for each iteration. – Mikko Marttila Sep 05 '19 at 08:13
The difference between my answer and @ronakshah's is the "source of truth": mine takes the previous names as the "correct" ones, which matches what causes the `rbind()` error, while theirs takes the set of names that occur in all data frames: meaning that if there are any differing names anywhere, you'll get a "mismatch" for _all_ data frames. – Mikko Marttila Sep 05 '19 at 08:30

Ronak Shah · Accepted Answer · 2019-09-05T07:22:42.390

One way to check it would be to subset first 8 columns from each dataframe, get the common names present in all the dataframe, then use setdiff to find out if there is any mismatch of column names

list_df <- lapply(csv, '[', 1:8)
cols <- Reduce(intersect, lapply(list_df, names))
lapply(list_df, function(x) setdiff(names(x), cols))

If all your column names are the same you should get character(0) as output for each dataframe. If there is any mismatch setdiff will display the name of the column.

Also another hint to check would be is length(cols) 8 ?

How to pop out non-matching column names in a series of csv files?

2 Answers2