0

I have a csv file which is separated by tab space. Note: The confidence_interval column has values (lower and upper) which is separated by space.

Clusters vs regions
#Cluster        total_cluster_markers   total_gene_list overlap confidence_interval     p-value odds_ratio
cluster0_fisher 512     5840    209     1.7182382746801 2.47748441988708        1.481417e-14    2.064883
cluster1_fisher 425     5840    151     1.32891483798265 2.00728239610416       2.765814e-06    1.635712
cluster2_fisher 2339    5840    778     1.39944144525363 1.68467086400037       2.828628e-19    1.535827
cluster3_fisher 745     5840    294     1.68442891430663 2.28721533485484       1.176685e-17    1.96402
cluster4_fisher 960     5840    359     1.57334775537734 2.06762614937187       4.882507e-17    1.804544
cluster5_fisher 1038    5840    401     1.6771709720448 2.17858158866511        7.620883e-22    1.91242
cluster6_fisher 601     5840    258     1.91504755165867 2.67872178784458       3.390846e-21    2.266055
cluster7_fisher 914     5840    365     1.75583321881608 2.31518016906748       8.043077e-23    2.017119
cluster8_fisher 1144    5840    435     1.6468354153442 2.11561394952237        4.589145e-22    1.867252
cluster9_fisher 2390    5840    870     1.64540029553634 1.97088684681019       1.797583e-36    1.801035
cluster10_fisher        2564    5840    952     1.72111579504647 2.04939671061079       2.037412e-44    1.87854
cluster11_fisher        510     5840    212     1.77408302607108 2.55721196987857       7.082665e-16    2.131412
cluster12_fisher        1692    5840    654     1.76451430192618 2.17296718053884       1.0919e-35      1.958701
cluster13_fisher        3083    5840    1134    1.73401721793267 2.03895922438595       2.043825e-51    1.880658
cluster14_fisher        733     5840    276     1.55077148161245 2.11642986931938       1.094299e-13    1.812898
cluster15_fisher        1373    5840    377     0.988223149703785 1.26730522269035      0.07333463      1.119935
cluster16_fisher        703     5840    273     1.62824034204187 2.23248738568041       1.911082e-15    1.908004

I need to make errorbar plot using odds ratio and CI, here is the code I am using:

library('ggplot2')
args <- commandArgs(TRUE)
cluster_file <- args[1]
headers=read.csv(cluster_file, skip = 1, sep='\t', header = F, nrows = 1, as.is = T)
boxLabels = c("C0","C1", "C2", "C3",
          "C4", "C5", "C6",
          "C7", "C8", "C9",
          "C10", "C11","C12",
          "C13", "C14","C15",
          "C16")

dat <- read.csv(cluster_file, header=FALSE, sep='\t' ,skip=2, 
                 stringsAsFactors = FALSE)
colnames(dat)=headers

dat <- cbind(dat, do.call("rbind", strsplit(dat[, 5], " ")))
pdf("rplot.pdf")
boxCILow=c(dat$"1")
boxCIHigh=c(dat$"2")
(p <- ggplot(dat, aes(x = odds_ratio, y = boxLabels)) +
    geom_vline(aes(xintercept = 1), size = .25, linetype = "dashed") +
    geom_errorbarh(aes(xmax = boxCIHigh, xmin = boxCILow), size = .5, 
 height =
            .2, color = "gray50") +
 geom_point(size = 2, color = "orange") +
 scale_x_continuous(breaks = seq(1.0, 3.0, 0.1), labels = seq(1.0, 3.0, 0.1),
                   limits =  c(0.9,3.0)) +
 theme_bw()+
 theme(panel.grid.minor = element_blank()) +
 ylab("") +
 xlab("Odds ratio") +
 ggtitle(tools::file_path_sans_ext(cluster_file))
)
dev.plot()

Here is the warning I get: Removed 15 rows containing missing values (geom_errorbarh).

The graph does not look right. I have manually input 3 vectors using same data to the ggplot2 code and the graph looks right. So I am wondering if I have parsed the information from csv file correctly?

user2998764
  • 445
  • 1
  • 6
  • 22
  • Have you print the columns you got to check if it what parsed correctly or not ? I mean, `print(dat$C4)`, `print(dat$C5)`... ? – ZiGaelle Jul 23 '19 at 07:32
  • There is no C4, C5 in the CSV file. boxLabels is only a vector where I define what labels I want on the y-axis. I have however printed dat$"#Cluster" and that looks ok – user2998764 Jul 23 '19 at 07:42
  • Ok, sorry for the unclarity, but I meant `dat[,4]`, `dat[,5]`... to verify how the column vectors looks like, if it's the correct list of numbers and if they are detected as numeric vectors and not list of strings.. Not only the first column – ZiGaelle Jul 23 '19 at 07:53
  • Yes, CI gets detected as strings - "1.7182382746801 2.47748441988708 " "1.32891483798265 2.00728239610416 ". I converted it after cbind line like this: boxCILow=c(as.double(dat$"1")), but I get the same result. – user2998764 Jul 23 '19 at 08:24
  • Could be related to the axis scale then: https://stackoverflow.com/questions/32505298/explain-ggplot2-warning-removed-k-rows-containing-missing-values – ZiGaelle Jul 23 '19 at 08:30
  • I used the same dataset but input it as vectors, df <- data.frame(yAxis = length(boxLabels):1, boxOdds = (c(2.064883,1.635712,1.535827,1.96402,1.804544,1.91242,2.266055,2.017119,1.867252,1.801035,1.87854,2.131412,1.958701,1.880658,1.812898,1.119935,1.908004)),).. and the correct plot was generated. So I am thinking it is to do with formatting of data when it goes into ggplot2 – user2998764 Jul 23 '19 at 08:38

1 Answers1

-1

Making use of library('tidyverse') solves the formatting issues for me.

user2998764
  • 445
  • 1
  • 6
  • 22