0

I have two data tables in the form of Columns namely pair of Diseases and their measures as a pair. Below is the first one(sample data) disease_table1

  **d1**   **d2** **Value**

Disease1 Disease2  3.5
Disease3 Disease4  5
Disease5 Disease6  1.1
Disease1 Disease3  2.4
Disease6 Disease2  6.7

the real Dataset 1(disease_table1) is below:

 Bladder cancer                         X-linked ichthyosis (XLI)        3.5
 Leukocyte adhesion deficiency (LAD)    Aldosterone synthase Deficiency  1.8
 Leukocyte adhesion deficiency (LAD)    Brain Cancer                     1.5
 Tangier disease                        Pancreatic cancer                0.66

I want to show the difference between these two data tables while plotting the disease pairs and its values for both tables. I used the plot function and lines function but its too simple,and is not able to differentiate much.Also I would like to have the names of the disease pairs while plotting.

   plot(density(disease_table1$value))
   lines(density(disease_table1$value))

Thanks

Rgeek
  • 419
  • 1
  • 9
  • 23
  • 3
    Could you provide us with a [reproducable example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? – Jaap Jan 28 '14 at 18:55
  • I have added the real dataset,code as an example. – Rgeek Jan 28 '14 at 19:39
  • With 400,000+ disease pairs you probably need a clustering approach. can you post a link to your data, or a more representative subset, say a few thousand records? – jlhoward Jan 28 '14 at 21:09

1 Answers1

2

Some sample code:

# creating dataframes (i made up a second one)
df1 <- read.table(text = "d1   d2 x
Disease1 Disease2  3.5
Disease3 Disease4  5
Disease5 Disease6  1.1
Disease1 Disease3  2.4
Disease6 Disease2  6.7", header = TRUE, strip.white = TRUE)

df2 <- read.table(text = "d1   d2 y
Disease1 Disease2  4.5
Disease3 Disease4  2
Disease5 Disease6  3.1
Disease1 Disease3  1.4
Disease6 Disease2  5.7", header = TRUE, strip.white = TRUE)

# needed libraries
library(reshape2)
library(ggplot2)

# merging dataframes & creating unique identifier variable
data <- merge(df1, df2, by = c("d1","d2"))
data$diseasepair <- paste0(data$d1,"-",data$d2)

data.long <- melt(data, id="diseasepair", measure=c("x","y"), variable="group")

# make the plot
ggplot(data.long) +
  geom_bar(aes(x = diseasepair, y = value, fill = group), 
           stat="identity", position = "dodge", width = 0.7) +
  scale_fill_manual("Group\n", values = c("red","blue"), 
                    labels = c(" X", " Y")) +
  labs(x="\nDisease pair",y="Value\n") +
  theme_bw()

The result:

enter image description here

Is this what you're lookin for?

Jaap
  • 81,064
  • 34
  • 182
  • 193
  • I have 400k pairs of such kind,so I don't think this would work.It would have worked great for a smaller dataset though.I believe , a curve or heat map could work? – Rgeek Jan 28 '14 at 19:45
  • For 400k pairs a heat map won't work either IMHO. Do you want to compare the values for each pair? Or just for specific pairs? – Jaap Jan 28 '14 at 19:55
  • Basically I want to show enrichment of disease pairs using the values in one dataset vs the other.So, I want to compare the values for each pair. – Rgeek Jan 28 '14 at 19:59
  • It's possibly a better solution to make subsets of your dataset for groups or for specific combinations. All those 400k pairs in one plot won't produce a plot of any value (at least that's what I think). First decide what you're looking for, then create subsets & create some plots. – Jaap Jan 28 '14 at 20:25