0

I am trying to use fread() to read in a table of 2 columns (x, y) and ~3 00 million rows (62 GB) and plot the x and y in a scatter plot. I am using "fread" and it works fine if I only use a small portion of the data, like 30000 rows.

But if I run it on the whole data set, I got:
"Warning message: In setattr(ans, "row.names", .set_row_names(nr)) : NAs introduced by coercion to integer range /var/spool/torque/mom_priv/jobs/11244921.cri16sc001.SC: line 14: 70765 Killed Rscript 10_plotZ0Z1.R"

What could be the reason?

Phoenix Mu
  • 648
  • 7
  • 12
  • 2
    Well, regardless of the `fread()` error, if you try to plot that without using something like a scatterplot-smooth chart, it's likely going to fail miserably. That's far too many points for even base R graphics (it's too many in general, really). How much system RAM do you have? You will need a substantial buffer above 62GB to plot the entirety of the data even with base R graphics and especially if you try it with ggplot2. – hrbrmstr Nov 28 '18 at 20:49
  • @hrbrmstr I am using cluster and currently having 65 GB memory. How to do scatterplot smooth? – Phoenix Mu Nov 29 '18 at 15:58
  • That is not enough memory. https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/scatter.smooth & https://stat.ethz.ch/R-manual/R-patched/library/stats/html/lowess.html (you should really consider sampling) – hrbrmstr Nov 29 '18 at 20:24
  • Thanks, so I am actually using `readLines(con, n=1)` to read the table line by line and use `points()` to plot it point by point. It seems to be working now. – Phoenix Mu Dec 01 '18 at 00:05

1 Answers1

2

You could sample your big file as already suggested in the comments. Unfortunately, it seems that fread doesn't have yet such a feature implemented - see this opened issue (upvoting the feature could motivate the developers to work on it). But as mentioned here, if you are on Linux, then try the shuf -n shell command:

library(data.table)

# Generate some random data
dt <- data.table(x = rnorm(10^6), y = rnorm(10^6))
# write to csv file
fwrite(dt, "test-dt.csv")

# Read a random sample of 10^5 rows
dt2 <- fread(cmd = "shuf -n 100000 test-dt.csv")
dt[, plot(x,y)]

Alternatively, you could read blocks of rows from your file with multiple calls to fread as showed here.

Valentin_Ștefan
  • 6,130
  • 2
  • 45
  • 68
  • Thank you! I think the NAs error occurs because I have a header line that was also read into the table, but I did not notice it first. – Phoenix Mu Dec 01 '18 at 00:04