0

I have several data frame with format like below. I want to join/merge the data frames by species and extracting kmers from all data frames such that the out contains one column with species and multiple column with kmers, one form each of the files. The kmers column will then be give the name of the file from which it originated. df1

reads taxReads kmers species
232 2323 23234 Bacteria
555 12 4545 Virus

df2

reads taxReads kmers species
12 23 56 Bacteria
932 1213 12 Virus

out

species df1 df2
Bacteria 23234 56
Virus 4545 12

I have tried making a script using join_all, but it does not select the correct column (kmers):

file_list = list.files(pattern="tsv$")    

datalist = lapply(file_list, function(x){
  dat = read.csv(file=x, header=T, sep = "\t")
  names(dat)[2] = x
  return(dat)
})
joined <- join_all(dfs = datalist,by = "species",type ="full" )  
user2300940
  • 2,355
  • 1
  • 22
  • 35

1 Answers1

2

I'll assume that you've read in the files into a list of frames, named by the basename of the file (with the extension removed). Naming the list-of-frames as dfs, we have

dfs <- list(df1 = structure(list(reads = c(232L, 555L), taxReads = c(2323L, 12L), kmers = c(23234L, 4545L), species = c("Bacteria", "Virus")), class = "data.frame", row.names = c(NA, -2L)), df2 = structure(list(reads = c(12L, 932L), taxReads = c(23L, 1213L), kmers = c(56L,12L), species = c("Bacteria", "Virus")), class = "data.frame", row.names = c(NA, -2L)))

dfs
# $df1
#   reads taxReads kmers  species
# 1   232     2323 23234 Bacteria
# 2   555       12  4545    Virus
# $df2
#   reads taxReads kmers  species
# 1    12       23    56 Bacteria
# 2   932     1213    12    Virus

From here, two steps:

  1. Rename the kmers columns to the filename (sans extension), and filter out unneeded columns,

    dfs <- Map(function(x, nm) { names(x)[names(x) == "kmers"] <- nm; x[, c("species", nm)]; }, dfs, names(dfs))
    dfs
    # $df1
    #    species   df1
    # 1 Bacteria 23234
    # 2    Virus  4545
    # $df2
    #    species df2
    # 1 Bacteria  56
    # 2    Virus  12
    
  2. Reduce with merge.

    Reduce(function(d1, d2) merge(d1, d2, by = "species", all = TRUE), dfs)
    #    species   df1 df2
    # 1 Bacteria 23234  56
    # 2    Virus  4545  12
    

    This could be code-golfed here with just Reduce(merge, dfs), but I broke it out with a two-arg anon-func so that you can control some of merge's options.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • I got an error regarding zero length input, however, all my files should contain rows: Error in mapply(FUN = f, ..., SIMPLIFY = FALSE) : zero-length inputs cannot be mixed with those of non-zero length – user2300940 Oct 22 '21 at 20:23
  • It's hard to troubleshoot something like that when I don't have your data. Start looking at each of the frames after you read them in, for instance `all(sapply(dfs, function(z) c("kmers","species") %in% names(z)))`. – r2evans Oct 22 '21 at 20:25
  • > all(sapply(datalist, function(z) c("kmers","species") %in% names(z))) [1] TRUE – user2300940 Oct 22 '21 at 20:26
  • My data frames are denoted as [[1]] and [[2]] in my datalist that reads all the data frames (dfs in your example) – user2300940 Oct 22 '21 at 20:28
  • I don't seem to have names of the data frames when I input them using my formula: datalist = lapply(file_list, function(x){ dat = read.csv(file=x, header=T, sep = "\t",check.names = F) }) – user2300940 Oct 22 '21 at 20:30
  • 1
    Perhaps `names(dfs) <- tools::file_path_sans_ext(basename(file_list))`? – r2evans Oct 22 '21 at 20:31