0

I have two different dataframes as shown in figures attached. Dataframe1 and Dataframe2.

This is what I tried.

#First dataframe
structure(list(Label = c("Gene 1", "Gene 2", "Gene 3", "Gene 4", 
"Gene 5", "Gene 6", "Gene 7", "Gene 8", "Gene 9", "Gene 10", 
"Gene 11", "Gene 12", "Gene 13", "Gene 14", "Gene 15", "Gene 16", 
"Gene 17", "Gene 18", "Gene 19", "Gene 20", "Gene 21", "Gene 22", 
"Gene 23", "Gene 24", "Gene 25", "Gene 26", "Gene 27", "Gene 28", 
"Gene 29", "Gene 30"), Count = c(1500, 1600, 1700, 1800, 1900, 
2000, 2100, 2200, 2300, 2400, 2500, 2600, 2700, 2800, 2900, 3000, 
3100, 3200, 3300, 3400, 3500, 3600, 3700, 3800, 3900, 4000, 4100, 
4200, 4300, 4400)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-30L))

df_1 <- read_excel("Demo_data.xlsx", sheet = "Dataframe1")
str(df_1)
View(df_1)

df_1.1 <- column_to_rownames(df_1, 'Label')
View(df_1.1)

df_1.2 <- t(df_1.1)
View(df_1.2)

df_1.2 <- as.data.frame(df_1.2)
str(df_1.2)


typeof(dff1)
str(dff1)


#Second dataframe
structure(list(Label = c("Control1", "Control2", "Control3", 
"Control4", "Control5", "Control6", "Control7", "Control8", "Control9", 
"Control10", "Control11", "Control12", "Control13", "Control14", 
"Control15", "Control16", "Control17", "Control18", "Control19", 
"Control20", "Control21", "Control22", "Control23", "Control24"
), Count = c(1800, 1400, 1110, 1900, 2500, 2900, 2100, 900, 5000, 
2300, 700, 1400, 3400, 2310, 3322, 2200, 4400, 2100, 1000, 6700, 
4300, 2120, 4800, 4300)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -24L))


df_2 <- read_excel("Demo_data.xlsx", sheet = "Dataframe2")

df_2.1 <- column_to_rownames(df_2, 'Label')
View(df_2.1)

df_2.1 <- t(df_2.1)
View(df_2.1)

df_2.1 <- as.data.frame(df_2.1)
str(df_2.1)

correlation <- cor(df_1.2, df_2.1)
View(correlation)

This is my desired output but I am getting NA for every correlation. Any help is highly appreciated.

Desired output (without NA)

Dan Chaltiel
  • 7,811
  • 5
  • 47
  • 92
Abhi
  • 1
  • 1
  • Can you provide your dataframes using `dput()` instead of images? [See here on how to make a reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – jrcalabrese Dec 10 '22 at 19:13
  • Thank you for suggesting. I have included the dataframes using dput(). – Abhi Dec 10 '22 at 19:44
  • As far as I can tell, you have two numeric variables, `df_1$Count` and `df_2$Count`. First, are you trying to find the correlation between those two columns? If yes, you can't make a [heatmap](https://r-graph-gallery.com/79-levelplot-with-ggplot2.html) with only one Pearson's r. Alternatively, are you transposing your dfs to maximize columns in order to make a heatmap? I don't think that'll work since you can't generate Pearson's r with only two numeric values. For example, the first cell in `correlation` is equivalent to `cor(1800, 1500)`. If you type that into your R console, you get `NA`. – jrcalabrese Dec 10 '22 at 20:23
  • Okay. Thank you for suggesting that. My only objective is plotting a heatmap between all rows in both dataframes. In the heatmap, I want to see what fold increase or decrease is there between entries in two dataframes. The heatmap should have x-axis as all rows in dataframe1 and y-axis as dataframe2. Do you have any suggestions? – Abhi Dec 10 '22 at 20:38
  • By increase or decrease, do you mean the value when you subtract `df_1$Count` from `df_2$Count`, or do you mean an increase/decrease compared to the previous row (for each variable)? Also, assuming you can match `Control1` with `Gene 1` (and so on), would a scatterplot accomplish the same goal? – jrcalabrese Dec 10 '22 at 20:45
  • By increase or increase, I mean fold change. For example, Gene1 has count 1500, I want to know its fold change with respect to every control gene in df2 (on y-axis). And I want that fold change for every gene with respect to every other control in df2. A scatterplot might work. But if there's a way to make a heatmap, it'll help a lot. Please let me know if you have any other questions. – Abhi Dec 10 '22 at 20:55
  • Can you add details on how to calculate fold change to your original question? – jrcalabrese Dec 11 '22 at 02:44

1 Answers1

0

As it is written in the comments, what you are trying to achieve is rather unclear.

If you want to compute the correlation between the Count column in each dataframe and visualize it using a scatterplot, you can use the following code:

library(tidyverse)

df_1 <- structure(list(Label = c("Gene 1", "Gene 2", "Gene 3", "Gene 4", 
                                 "Gene 5", "Gene 6", "Gene 7", "Gene 8", "Gene 9", "Gene 10", 
                                 "Gene 11", "Gene 12", "Gene 13", "Gene 14", "Gene 15", "Gene 16", 
                                 "Gene 17", "Gene 18", "Gene 19", "Gene 20", "Gene 21", "Gene 22", 
                                 "Gene 23", "Gene 24", "Gene 25", "Gene 26", "Gene 27", "Gene 28", 
                                 "Gene 29", "Gene 30"), 
                       Count = c(1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300, 2400, 2500, 
                                 2600, 2700, 2800, 2900, 3000, 3100, 3200, 3300, 3400, 3500, 3600, 
                                 3700, 3800, 3900, 4000, 4100, 4200, 4300, 4400)), 
                  class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -30L))


df_2 <- structure(list(Label = c("Control1", "Control2", "Control3", 
                                 "Control4", "Control5", "Control6", "Control7", "Control8", "Control9", 
                                 "Control10", "Control11", "Control12", "Control13", "Control14", 
                                 "Control15", "Control16", "Control17", "Control18", "Control19", 
                                 "Control20", "Control21", "Control22", "Control23", "Control24"), 
                       Count = c(1800, 1400, 1110, 1900, 2500, 2900, 2100, 900, 5000, 2300, 700, 1400, 
                                 3400, 2310, 3322, 2200, 4400, 2100, 1000, 6700, 4300, 2120, 4800, 4300)), 
                  class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -24L))

dat = left_join(
  df_1 %>% mutate(id=str_extract(Label, "\\d+")),
  df_2 %>% mutate(id=str_extract(Label, "\\d+")), 
  by="id", suffix=c("_gene", "_ctl")
)

dat
#> # A tibble: 30 x 5
#>    Label_gene Count_gene id    Label_ctl Count_ctl
#>    <chr>           <dbl> <chr> <chr>         <dbl>
#>  1 Gene 1           1500 1     Control1       1800
#>  2 Gene 2           1600 2     Control2       1400
#>  3 Gene 3           1700 3     Control3       1110
#>  4 Gene 4           1800 4     Control4       1900
#>  5 Gene 5           1900 5     Control5       2500
#>  6 Gene 6           2000 6     Control6       2900
#>  7 Gene 7           2100 7     Control7       2100
#>  8 Gene 8           2200 8     Control8        900
#>  9 Gene 9           2300 9     Control9       5000
#> 10 Gene 10          2400 10    Control10      2300
#> # ... with 20 more rows

cor(dat$Count_gene, dat$Count_ctl, use="pairwise.complete.obs")
#> [1] 0.5047392

ggplot(dat, aes(x=Count_gene, y=Count_ctl)) + 
  geom_point()
#> Warning: Removed 6 rows containing missing values (`geom_point()`).

Created on 2022-12-12 with reprex v2.0.2

Basically, I extracted the id as the last digits of the label, then used left_join() to merge the dataframes.

This might look overly complicated but it is always a good idea to keep your data tidy in one dataframe.

Note that in your example, df_2 stops at id==24 so the correlation is computed on the 24 complete observations only.

However, a correlation is computed across 2 vectors, so in order to have a heatmap you would need a set of many vectors, which you don't seem to have.

For your next questions, it would be great if you use the reprex package as I did in this answer.

Dan Chaltiel
  • 7,811
  • 5
  • 47
  • 92