join data frames for specific column

Question

I have several data frame with format like below. I want to join/merge the data frames by species and extracting kmers from all data frames such that the out contains one column with species and multiple column with kmers, one form each of the files. The kmers column will then be give the name of the file from which it originated. df1

reads taxReads kmers species
232 2323 23234 Bacteria
555 12 4545 Virus

df2

reads taxReads kmers species
12 23 56 Bacteria
932 1213 12 Virus

out

species df1 df2
Bacteria 23234 56
Virus 4545 12

I have tried making a script using join_all, but it does not select the correct column (kmers):

file_list = list.files(pattern="tsv$")    

datalist = lapply(file_list, function(x){
  dat = read.csv(file=x, header=T, sep = "\t")
  names(dat)[2] = x
  return(dat)
})
joined <- join_all(dfs = datalist,by = "species",type ="full" )

score 2 · Accepted Answer · answered Oct 22 '21 at 19:58

2

I'll assume that you've read in the files into a list of frames, named by the basename of the file (with the extension removed). Naming the list-of-frames as dfs, we have

dfs <- list(df1 = structure(list(reads = c(232L, 555L), taxReads = c(2323L, 12L), kmers = c(23234L, 4545L), species = c("Bacteria", "Virus")), class = "data.frame", row.names = c(NA, -2L)), df2 = structure(list(reads = c(12L, 932L), taxReads = c(23L, 1213L), kmers = c(56L,12L), species = c("Bacteria", "Virus")), class = "data.frame", row.names = c(NA, -2L)))

dfs
# $df1
#   reads taxReads kmers  species
# 1   232     2323 23234 Bacteria
# 2   555       12  4545    Virus
# $df2
#   reads taxReads kmers  species
# 1    12       23    56 Bacteria
# 2   932     1213    12    Virus

From here, two steps:

Rename the kmers columns to the filename (sans extension), and filter out unneeded columns,

dfs <- Map(function(x, nm) { names(x)[names(x) == "kmers"] <- nm; x[, c("species", nm)]; }, dfs, names(dfs))
dfs
# $df1
#    species   df1
# 1 Bacteria 23234
# 2    Virus  4545
# $df2
#    species df2
# 1 Bacteria  56
# 2    Virus  12

Reduce with merge.
```
Reduce(function(d1, d2) merge(d1, d2, by = "species", all = TRUE), dfs)
#    species   df1 df2
# 1 Bacteria 23234  56
# 2    Virus  4545  12
```
This could be code-golfed here with just Reduce(merge, dfs), but I broke it out with a two-arg anon-func so that you can control some of merge's options.

answered Oct 22 '21 at 19:58

r2evans

141,215
6
77
149

I got an error regarding zero length input, however, all my files should contain rows: Error in mapply(FUN = f, ..., SIMPLIFY = FALSE) : zero-length inputs cannot be mixed with those of non-zero length – user2300940 Oct 22 '21 at 20:23
It's hard to troubleshoot something like that when I don't have your data. Start looking at each of the frames after you read them in, for instance `all(sapply(dfs, function(z) c("kmers","species") %in% names(z)))`. – r2evans Oct 22 '21 at 20:25
> all(sapply(datalist, function(z) c("kmers","species") %in% names(z))) [1] TRUE – user2300940 Oct 22 '21 at 20:26
My data frames are denoted as [[1]] and [[2]] in my datalist that reads all the data frames (dfs in your example) – user2300940 Oct 22 '21 at 20:28
I don't seem to have names of the data frames when I input them using my formula: datalist = lapply(file_list, function(x){ dat = read.csv(file=x, header=T, sep = "\t",check.names = F) }) – user2300940 Oct 22 '21 at 20:30
1

Perhaps `names(dfs) <- tools::file_path_sans_ext(basename(file_list))`? – r2evans Oct 22 '21 at 20:31

join data frames for specific column

1 Answers1