merge two datasets based on common column

Question

I have one question. Like now i have two files:sampleattributes and genecount.I have filtered sample attributes file and it has a column name sampid and genecount has a column name sampid. I am trying to merge the two files using the common sampid. This is what I have written:

       GTEx_Analysis_v8_Annotations_SampleAttributesDS <- read_delim("/new_gtex/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt",delim = "\t", escape_double = FALSE,trim_ws = TRUE)
sample_attributes <- select(GTEx_Analysis_v8_Annotations_SampleAttributesDS,SAMPID,SMTS,SMTSD,SMAFRZE)


sample_attributes_braindata <- sample_attributes %>% filter(sample_attributes$SMTS == "Brain" & sample_attributes$SMAFRZE == "RNASEQ")

sample_attributes_braindata <- data.frame(sample_attributes_braindata)



GTEx_Analysis_gene_reads <- read_table2("/new_gtex/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct")

GTEx_Analysis_gene_reads <- data.frame(GTEx_Analysis_gene_reads)
gene_count <- data.frame(t(GTEx_Analysis_gene_reads[,-c(1:2)]))
colnames(gene_count) <- GTEx_Analysis_gene_reads$Name

This is how my sample_Attributes_braindata look like: And this is how my gene_count data look like:

I tried to rename the gene_count first column with GTEX ID using this command:

colnames(gene_count) <- GTEx_Analysis_gene_reads$Name

But its not happening.

I also tried this command to rename the first column with sampid:

colnames(gene_count)[1] <- "SAMPID"

What I want to do it merge the two datasets by the common column SAMPID or GTEXid

genecount2 <- merge(sample_attributes_braindata,gene_count, by=SAMPID)

dput(gene_count[1:5, 1:4])
structure(list(ENSG00000223972.5 = c(0, 0, 0, 0, 0), ENSG00000227232.5 = c(187,
109, 143, 251, 113), ENSG00000278267.1 = c(0, 0, 1, 0, 0), ENSG00000243485.5 = c(1,
0, 0, 1, 0)), row.names = c("GTEX.1117F.0226.SM.5GZZ7", "GTEX.1117F.0426.SM.5EGHI",
"GTEX.1117F.0526.SM.5EGHJ", "GTEX.1117F.0626.SM.5N9CS", "GTEX.1117F.0726.SM.5GIEN"
), class = "data.frame")

dput((sample_attributes_braindata[1:5, 1:4]))
structure(list(SAMPID = c("GTEX-1117F-3226-SM-5N9CT", "GTEX-111FC-3126-SM-5GZZ2",
"GTEX-111FC-3326-SM-5GZYV", "GTEX-1128S-2726-SM-5H12C", "GTEX-1128S-2826-SM-5N9DI"
), SMTS = c("Brain", "Brain", "Brain", "Brain", "Brain"), SMTSD = c("Brain - Cortex",
"Brain - Cortex", "Brain - Cerebellum", "Brain - Cortex", "Brain - Cerebellum"
), SMAFRZE = c("RNASEQ", "RNASEQ", "RNASEQ", "RNASEQ", "RNASEQ"
)), row.names = c(NA, 5L), class = "data.frame")

looks like you overwrote `gene_count` as a vector of names, rather than a dataframe/tibble?.. This is also why you probably got this type of error first (Error: `x` and `y` must share the same src, set `copy` = TRUE (may be slow).`)?? — langtang, Feb 21 '22 at 23:05
The error message means that `SAMPID` is not a column in both `sample_attributes_braindata` and `gene_count`. It would be easier for us to help you with a fix if you included a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Especially: remove the parts of your code that are not useful for the error, and provide some example data with your question. — jpiversen, Feb 22 '22 at 08:04
It would be great if you could share your sample data as copy/pasteable code, not a pictures. (We can't test and demonstrate code on pictures of data.) `dput()` is a great way to share copy/pasteable sample data including all class and structure information, e.g., `dput(gene_count[1:5, 1:4])` for the first 5 rows and 4 columns of `gene_count` (share a similar illustrative subset of the other data frame). — Gregor Thomas, Feb 22 '22 at 21:00
Reproducible data will help a lot since you seem to have multiple issues - `gene_count` seems to have row names, rather than a column to match on, the matches are not exact as one data frame has `-` while the other has `.`, etc. If you need more help making a reproducible example, see the link shared by jpiversen, it has lots of good information, explanation, and examples. — Gregor Thomas, Feb 22 '22 at 21:01

score 1 · Accepted Answer · answered Feb 23 '22 at 14:07

1

Looking at youre gene_count data, it doesn't have a column for the SAMPID, those were imported as row names. We'll convert them to an actual column, replace the "." with "-" so they match the braindata format, and then we can join. Your sample data doesn't have any elements in common so I use a full_join, but you may prefer a left, right, or inner join--I'm not really sure what your use case is.

library(dplyr)
gene_count %>%
  rownames_to_column(var = "SAMPID") %>%
  mutate(SAMPID = gsub(pattern = ".", replacement = "-", x = SAMPID, fixed = TRUE)) %>%
  full_join(sample_attributes_braindata, by = "SAMPID")

answered Feb 23 '22 at 14:07

Gregor Thomas

136,190
20
167
294

Yes, this worked. I wanted to join them by SAMPID. They do have some ID in common. – Rhea Bedi Feb 24 '22 at 21:34
I'm sure your real data has elements in common, but the sample data you `dput()` in your question doesn't. Normally I would assume `inner_join()` or `left_join()`, but without any rows in common in the example, `full_join()` is the only option to demonstrate that it works. – Gregor Thomas Feb 24 '22 at 21:51

merge two datasets based on common column

1 Answers1