Intersection of many CSV files based on a specific column

Question

I have many csv files in one directory. All the files have the same columns structure such as:

Class    val1  val2
A        6.5   2.3
F        9     6.8

I would like to find the name of the classes that exist in all the CSV files. In other words, I would like to get the intersection of all the files based on the Class column and regardless to the values in val1 and val2.

I applied the following:

temp = list.files(pattern="*.csv")
myfiles = lapply(temp, read.delim)
x <- Reduce(intersect,myfiles)

But this will retrieve similar classes along with similar values, which I don't want. I only want the name of classes that are in all files.

`myclasses = lapply(myfiles, '[[', "Class")` to extract the class column from each data frame, then `Reduce(intersect, myclasses)` — Gregor Thomas, Jun 25 '19 at 14:50
@Gregor `myclasses` returns null. Do you have any idea why is that? — Adam Amin, Jun 25 '19 at 15:56
@AdamAmin, that might mean there is no Class in ALL datasets! Please post the `dput` of your *myfiles* object: `dput(myfiles[1:5])` for [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5965451), so we can run on our end. — Parfait, Jun 25 '19 at 16:24
@AdamAmin if everything's as described in your question, my comment should work. A reproducible example would be key to debugging, I would suggest `dput(lapply(myfiles[1:5], head))`, which should make a copy/pasteable version of the first 6 rows of your first 5 data frames. — Gregor Thomas, Jun 25 '19 at 17:05

Parfait · Answer 1 · 2019-06-25T15:16:20.580

Simply run the chain merge without all=TRUE. If final data frame is empty then, NO Class matches ALL datasets. To avoid name conflicts with the merge, a step is needed to rename the columns dynamically. Below adds underscore suffixes by corresponding list item number.

temp <- list.files(pattern="*.csv")
myfiles <- lapply(temp, read.delim)

# RENAME val1 and val2 COLUMNS TO AVOID merge CONFLICT
myfiles <- lapply(seq_along(myfiles), function(i) 
              setNames(myfiles[[i]], c("Class", paste0("val1_", i), paste0("val2_", i)))) 

# CHAIN MERGE
final_df <- Reduce(function(x, y) merge(x, y, by="Class"), myfiles)

# UNIQUE Class NAMES
unique(final_df$Class)

Alternatively, stack all data frames and return the frequency that is equal to length of myfiles:

temp <- list.files(pattern="*.csv")
myfiles <- lapply(temp, read.delim)

# ROW BIND ALL DFs
stack_df <- do.call(rbind, myfiles)

# RETURN VECTOR OF CLASS VALUES AND COUNTS
freqs <- table(stack$Class)

# RETURN NAME WITH VALUE COUNT EQUAL LENGTH OF myfiles
names(freqs[freqs == length(myfiles)])

Rextester demo

Intersection of many CSV files based on a specific column

1 Answers1