Merge/match two data frames

Question

I would like to merge two data frames, y$genes and symbol_annotations, by the row names of y and the second column, "hgnc_symbol", of symbol_annotations, and create a column labeled "Symbol", y$genes$Symbol, listing all of the matches. If there is no match between "hgnc_symbol" and the row name, I would like for 'NA' to populate instead of an empty cell. I keep getting an error because the two data frames aren't of the same dimensions and contain NAs, and I'm not sure how to correct it.

>read.counts <- read.table("gene_counts.txt", header=TRUE) 
>row.names(read.counts) <- read.counts$Geneid 
>treatment <- factor(treatment)
> head(treatment)
[1] T0          IL2         IL2.ZA      IL2.OKT3    IL2.OKT3.ZA T0         
Levels: T0 IL2 IL2.OKT3 IL2.OKT3.ZA IL2.ZA
>y <- DGEList(read.counts, group=treatment, genes=read.counts)
>head(y$genes)
                SM01 SM02 SM03 SM04 SM05 SM06 SM07 SM08 SM09 SM10 SM11 SM12 SM13 SM14 SM15 SM16 SM17 SM18 SM19
ENSG00000223972    0    1    1    1    0    0    1    0    0    3    0    0    1    2    0    0    0    0    1
ENSG00000227232   33   31   13   15   20   43   36   32   43   43   61   42   92   73   80   64   33   25   28
ENSG00000278267    1    0    1    0    0    5    3    1    1    2    1    0    2    4    6    0    2    2    1
ENSG00000243485    0    0    0    0    0    0    0    0    0    0    0    0    0    0    2    0    0    0    0
ENSG00000237613    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
ENSG00000268020    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
                SM20 SM21 SM22 SM23 SM24 SM25 SM26 SM27 SM28 SM29 SM30
ENSG00000223972    0    0    0    0    1    0    0    0    0    0    0
ENSG00000227232   15   60   13   29   22   28   87   42   61   67   74
ENSG00000278267    2    3    5    1    3    4    4    3    2    4    3
ENSG00000243485    0    0    0    0    0    1    0    0    0    0    1
ENSG00000237613    0    0    0    0    0    0    0    0    0    0    0
ENSG00000268020    0    0    0    0    0    0    0    0    0    0    0
>head(symbol_annotations, n=10)
   ensembl_gene_id hgnc_symbol
1  ENSG00000210049       MT-TF
2  ENSG00000211459     MT-RNR1
3  ENSG00000210077       MT-TV
4  ENSG00000210082     MT-RNR2
5  ENSG00000209082      MT-TL1
6  ENSG00000198888      MT-ND1
7  ENSG00000210100       MT-TI
8  ENSG00000223795        <NA>
9  ENSG00000210107       MT-TQ
10 ENSG00000210112       MT-TM
>dim(symbol_annotations)
[1] 58069     2
>dim(y$genes)
[1] 58051    30
>y$genes$Symbol <- merge((rownames(y)), symbol_annotations[,c(2)])
Error in if (n > 0) c(NA_integer_, -n) else integer() : 
  missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In rep.fac * nx : NAs produced by integer overflow
2: In .set_row_names(as.integer(prod(d))) :
  NAs introduced by coercion to integer range

extract the `rownames(y$genes)` and make it a column. then `merge(df1,df2,by.x,by.y,all.x=TRUE)` — joel.wilson, Dec 12 '16 at 19:24
Please share your data in a [reproducible format](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). There are tons of merging questions on this site and I don't understand exactly how yours is any different. Perhaps make that more clear. Check out this question: http://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right — MrFlick, Dec 12 '16 at 19:25
If I create `y$genes$ENSEMBL_ID <- rownames(y$genes)`, and try to merge to the data frame by `y$genes$Symbol <- merge(y$genes$ENSEMBL_ID, symbol_annotations, by.x=1,by.y=1, all.x=TRUE)`, I get the error: **Error in `$<-.data.frame`(`*tmp*`, "Symbol", value = list(x = c(1L, 2L, : replacement has 58069 rows, data has 58051** @MrFlick, `y` is a DGEList using edgeR. — emblake, Dec 12 '16 at 19:37
I would suggest not placing the result of the merge into a new slot of y as DGEList is an S4 class and could that cause problems down the road or now. `y.merge <- merge(y$genes, symbol_annotations, by.x="row.names", by.y="ensembl_gene_id")` — emilliman5, Dec 12 '16 at 19:48
@emilliman5, I am trying to make a heatmap of my DE results. If I place the merged annotations in a separate object, then try to rename the row names of data frame logCPM with my annotations, I get the following error: ***Error in `row.names<-.data.frame`(`*tmp*`, value = value) : invalid 'row.names' length***. Again, there is an error in the dimension size, but I still don't know how to fix it. Thanks. — emblake, Dec 12 '16 at 20:54
The `merge` command should generate the matrix you need for the heatmap.... please check the dimensions of `y.merge` — emilliman5, Dec 12 '16 at 21:21
@emilliman5, I needed to make the row names a column in the matrix in order to merge it properly to generate a labeled heatmap...rookie mistake! Thanks for your help! — emblake, Dec 13 '16 at 13:37

Merge/match two data frames

0 Answers0