0

I have about 70 dataframes in a list, each of them has a column named SNP. I want to find the common SNPs that exist in all dataframes. This is the code I used:

setwd("~")
library(data.table)

files <- list.files()
dflist <- list()
for(i in 1:length(files)){
 dflist[[i]] <- fread(files[i])
}

map(dflist, ~.$SNP) %>% 
reduce(intersect) 

However, this returns the following message:

character(0)
list(structure(list(`10:103391446` = c("10:115562764:TTTC_",
"10:115562765:TTC_T", "10:14188623_CCTGA_C", "10:15988900:G_GGT"
)), row.names = c(NA, -4L), class = c("data.table", "data.frame"
)), structure(list(SNP = c("rs34394051",
"rs11121177", "rs10799615", "rs590013")), row.names = c(NA, -4L
), class = c("data.table", "data.frame")),
    structure(list(SNP = c("rs34394051", "rs11121177", "rs10799615",
    "rs590013")), row.names = c(NA, -4L), class = c("data.table",
    "data.frame")))

Can you help please?

r2evans
  • 141,215
  • 6
  • 77
  • 149
rkl
  • 47
  • 8
  • 3
    it means you not have any common characters that common to all list element columns – akrun Feb 23 '21 at 18:58
  • 1
    For safety (and a little code-golf): `dflist <- lapply(setNames(nm=files), fread)`. BTW, you may also get `NULL` if `SNP` is not a column in *all* of them; if it is missing in one, it will kill the rest of your output. – r2evans Feb 23 '21 at 19:02
  • I definitely have SNPs in common between the data frames. I just assumed the code is not working. Some SNPs (i.e. genetic variants) do not have a name and have the following format 1:234564. Can that interfere? – rkl Feb 23 '21 at 19:02
  • rkl, please [edit] your question and add some known context, perhaps the output from `dput(lapply(dflist[1:3], head, 4))` (assuming that that sampling has matches). – r2evans Feb 23 '21 at 19:02
  • See my previous comment, and note that the column name in `dflist[[1]]` is `10:103391446`, not `SNP`. (It suggests that one of your files is not structured the same, having no column name(s) and very different-looking contents.) – r2evans Feb 23 '21 at 19:06

1 Answers1

1

Your problems appear to be two-fold:

  1. One of your frames is missing SNP as a column name. That will often cause problems:

    setdiff(mtcars$QUUX, mtcars$cyl)
    # NULL
    

    This is not hard to fix (names(dflist[[1]]) <- "SNP"), but does not resolve all of the problems.

  2. Your first frame has completely different-looking data. When I skip the first frame, it works.

    map(dflist[-1], ~.$SNP) %>%
      reduce(intersect)
    # [1] "rs34394051" "rs11121177" "rs10799615" "rs590013"  
    
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks so much. When I read the files using dflist <- lapply(setNames(nm=files), fread), the SNP is the header for all data frames. Can you briefly explain why this happens and what the difference is in importing the files using these different ways? – rkl Feb 23 '21 at 19:20
  • The difference between `lapply` and a `for` loop are minor, some of it is stylistic. For both, I often suggest skimming through https://stackoverflow.com/a/24376207/3358227 for a discussion on keeping frames in `list`s, though some of it (just the *concept* of it) you're already doing. I would not expect my code here to produce different column names than yours, I wonder if your `files` has a mixed bag of data structures (i.e., `"10:115562764:TTTC_"` vs `"rs34394051"`). – r2evans Feb 23 '21 at 19:39
  • 1
    BTW, the use of `for (i in 1:length(files))` is a little fragile: when `files` is empty for some reason, while you might expect the `for` loop to do nothing, it will instead fire twice, because `1:length(.)` resolves to `1:0` resolves to `c(1L, 0L)`. It's better to use `seq_len(length(files))` or, better yet in this case, `seq_along(files)`. In both of those cases, a zero-length `files` will result in the `for` loop doing nothing, which is intuitively what should happen. – r2evans Feb 23 '21 at 19:40
  • Hello, yes it does. The column contains entries such as 10:115562764:TTTC_ and rs453.. When there is no rsid (i.e. name for the genetic variant), the data refers to it by its chromosome and position (i.e. 10:115562764). – rkl Feb 24 '21 at 13:50
  • Hi @r2evans is the code still OK if there are values such as these in (10:115562764:TTTC_ and rs453) in the column ? My final code is: ``` files <- list.files() dflist <- lapply(setNames(nm=files), fread) for(i in 1:length(files)){ dflist[[i]] <- fread(files[i]) } df<- map(dflist, ~.$SNP) %>% reduce(intersect) %>% as.data.frame() ``` – rkl Feb 24 '21 at 15:25
  • The method I have here does not care if the strings start with `"rs"` or `"10:"` or anything else. The premise is looking for commonality. – r2evans Feb 24 '21 at 16:10