Reading Column Names and Column Values for Extremely Large File R

Question

I need to load a CSV file in RStudio 3 (on Mac) which is ~11GB; this makes it impossible to load using regular commands. I need a subset of this data based on certain column values which should make it manageable. How can I:

get the column names without loading the entire file as a dataframe
read a single column from this file as a list

Thank you.

Do you know how many columns the data have? And the type of data, character, numeric, what? — Rui Barradas, Oct 31 '21 at 05:15
See function `read.csv.sql` in package `sqldf` to read just one column. — Rui Barradas, Oct 31 '21 at 05:31
Check out `vroom` or `data.table` packages mentioned in my answer https://stackoverflow.com/a/48105838/786542. I would read only the first line to parse column name, then read selected columns. — Tung, Oct 31 '21 at 05:46

score 2 · Accepted Answer · answered Oct 31 '21 at 05:43

Here is a way.

1. Column names

The column names are read with reaLines, setting n = 1, in order to read just the columns headers line. Then scan with sep = "," will break the line into column names.

library(sqldf)

col_names <- readLines(tmpfile, n = 1)
col_names
#[1] "mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb"

tc <- textConnection(col_names)
col_names <- scan(tc, sep = ",", what = character())
close(tc)

2. Data

The 4th column is "hp. Read only that one with read.csv.sql. The SQL statement is put together with sprintf.

col_names[4]
#[1] "hp"

SQL <- sprintf("select %s from file", col_names[4])
SQL
#[1] "select hp from file"

hp <- read.csv.sql(tmpfile, sql = SQL)
str(hp)
#'data.frame':  6718464 obs. of  1 variable:
#  $ hp: int  110 110 93 110 175 105 245 62 95 123 ...

would this also work for https://stackoverflow.com/questions/69663085/extract-data-from-large-xml-using-r ? — Donald Seinen, Oct 31 '21 at 05:47

Reading Column Names and Column Values for Extremely Large File R

1 Answers1

1. Column names

2. Data