0

I have this following data frame, df for which I desire to plot a histogram.

     x
1   -28313937
2   -218616099
3   -18406124
4   20307666
5   31985283
6   41429217
7   46488567
8   47690792
9   51127321
10  53168291
11  55247883
12  -49200409
13  33398814
14  36198419
15  42765257
16  45857195
17  43870899
18  50557988
19  49574516
20  52317786
21  50769743

I use the following piece of code for plotting the histogram,

R_hist <- ggplot(df, aes(x=x)) + 
geom_histogram(binwidth=.5, colour="black", fill="white") + 
geom_vline(aes(xintercept=mean(x, na.rm=T)), color="violet", linetype="dashed", size=1)

When I tried to call the object R_hist, I get an Error : cannot allocate vector of size 4.1 Gb In addition: Warning messages: 1: In seq.default(round_any(range[1], size, floor), round_any(range[2], : Reached total allocation of 4021Mb: see help(memory.size)

Could someone please let me know why the histogram is not being plotted as it should here

Thanks.

Amm
  • 1,749
  • 4
  • 17
  • 27
  • Can you make your problem reproducible? – Roman Luštrik Jan 29 '14 at 14:56
  • 3
    You're trying to plot a bar for every value between `-218616099` and `55247883` in 0.5 increments... do you want 21 bars with a height indicated in `x`? ... FWIW, that is a vector of 500 million values, which winds up being too large to allocate. – Justin Jan 29 '14 at 14:57
  • @RomanLuštrik Reproducible in what sense? I tried using a different name for the graph object still got the same error though – Amm Jan 29 '14 at 14:58
  • @Justin Thanks for your comment. Yes, indeed I want 21 bars with height indicated in x – Amm Jan 29 '14 at 14:59
  • 1
    Give us the data and the code you use to plot. Here are some tips on how to do that: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Roman Luštrik Jan 29 '14 at 15:00
  • @RomanLuštrik see my answer for a method to grab the data provided. – Justin Jan 29 '14 at 15:05
  • @Justin I wanted it to be pedagogical. :) – Roman Luštrik Jan 29 '14 at 16:54

2 Answers2

1

as indicated in the comments, you're trying to plot a histogram with a bar from the min to max value in df$x.

Instead, use geom_bar and stat='identity':

# grab the data provied
df <- read.table('clipboard')

# switch the names cause it'll bug me
df$y <- df$x
df$x <- row.names(df)

# plot using some identifier (row.names in this case)
ggplot(df, aes(x=x, y=y)) + geom_bar(stat='
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
Justin
  • 42,475
  • 9
  • 93
  • 111
  • How can I make a boxplot for this data in ggplot excluding the negative values. `boxplot(df)` plots the entire data – Amm Jan 30 '14 at 10:14
  • 1
    @Amm I strongly encourage you to read a few intro to R guides. Specifically you want to look into subsetting. However, in this instance, you would use `boxplot(df[df$x>0,])` – Justin Jan 30 '14 at 14:25
  • Thanks for the tip. I will look into subsetting. – Amm Jan 30 '14 at 14:28
0

Loading large datasets in R is usually inefficient; I'd recommend looking at DuckDB (it has an R connector).

DuckDB allows you to query large datasets in several formats (e.g., CSV, Parquet) using SQL, so you don't have memory issues. You can use it to compute the height of the histogram bins pretty efficiently, then use R to plot it (as opposed to loading the entire dataset into R).

Here's a snippet you can use in DuckDB to compute bins and bin heights, given a bin size:

select
  floor(COLUMN/BIN_SIZE)*BIN_SIZE,
  count(*) as count
from "path/to/file.parquet"
group by 1
order by 1;
Eduardo
  • 1,383
  • 8
  • 13