Warnings of "NAs introduced by coercion" in fread function

Question

I am trying to use fread() to read in a table of 2 columns (x, y) and ~3 00 million rows (62 GB) and plot the x and y in a scatter plot. I am using "fread" and it works fine if I only use a small portion of the data, like 30000 rows.

But if I run it on the whole data set, I got:
"Warning message: In setattr(ans, "row.names", .set_row_names(nr)) : NAs introduced by coercion to integer range /var/spool/torque/mom_priv/jobs/11244921.cri16sc001.SC: line 14: 70765 Killed Rscript 10_plotZ0Z1.R"

What could be the reason?

Well, regardless of the `fread()` error, if you try to plot that without using something like a scatterplot-smooth chart, it's likely going to fail miserably. That's far too many points for even base R graphics (it's too many in general, really). How much system RAM do you have? You will need a substantial buffer above 62GB to plot the entirety of the data even with base R graphics and especially if you try it with ggplot2. — hrbrmstr, Nov 28 '18 at 20:49
@hrbrmstr I am using cluster and currently having 65 GB memory. How to do scatterplot smooth? — Phoenix Mu, Nov 29 '18 at 15:58
That is not enough memory. https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/scatter.smooth & https://stat.ethz.ch/R-manual/R-patched/library/stats/html/lowess.html (you should really consider sampling) — hrbrmstr, Nov 29 '18 at 20:24
Thanks, so I am actually using `readLines(con, n=1)` to read the table line by line and use `points()` to plot it point by point. It seems to be working now. — Phoenix Mu, Dec 01 '18 at 00:05

score 2 · Answer 1 · answered Nov 30 '18 at 21:51

You could sample your big file as already suggested in the comments. Unfortunately, it seems that fread doesn't have yet such a feature implemented - see this opened issue (upvoting the feature could motivate the developers to work on it). But as mentioned here, if you are on Linux, then try the shuf -n shell command:

library(data.table)

# Generate some random data
dt <- data.table(x = rnorm(10^6), y = rnorm(10^6))
# write to csv file
fwrite(dt, "test-dt.csv")

# Read a random sample of 10^5 rows
dt2 <- fread(cmd = "shuf -n 100000 test-dt.csv")
dt[, plot(x,y)]

Alternatively, you could read blocks of rows from your file with multiple calls to fread as showed here.

Thank you! I think the NAs error occurs because I have a header line that was also read into the table, but I did not notice it first. — Phoenix Mu, Dec 01 '18 at 00:04

Warnings of "NAs introduced by coercion" in fread function

1 Answers1