1

I have a data set that weighs 4GB compressed and more than 20GB uncompressed.

The file can be downloaded here.

I have tried several ways to load it and It have not been possible. There are similar questions in stackoverflow (question1, question2)

I tried what they suggest and I have the same problems that the questioner.

I have tried to change manually the extension of the file from .rar to .gz and read it from two ways and only a few rows but it doesn't work:

Code:

#First attemp
data <- read.table(gzfile("./data_in/song_log.gz"),header = F,sep=",",nrow=10)
data <- read.csv(gzfile("./data_in/song_log.gz"),header = F,sep=",",nrow=10)
data <- read.csv2(gzfile("./data_in/song_log.gz"),header = F,sep=",",nrow=10)


#Triying with "ff" package

library("ff")
data <- ff::read.csv.ffdf(gzfile("./data_in/song_log.gz"),header = F,sep=",",nrow=10)
Error in read.table.ffdf(FUN = "read.csv", ...) : 
  only ffdf objects can be used for appending (and skipping the first.row chunk)

Any suggestion for this case?

Thanks in advance

Community
  • 1
  • 1
Henry Navarro
  • 943
  • 8
  • 34

1 Answers1

1
devtools::install_github("jimhester/archive") # mind the install guidelines at https://github.com/jimhester/archive/blob/master/configure#L64-L72
library(archive)

con <- archive_read("~/Data/song_log.rar")

readLines(con, 3)
## [1] "hora;userId;songId;generoId;deviceId;trendingSong" "18-12-2016 00:00:25;27103;231990117;23;1_27103;0" 
## [3] "18-12-2016 00:02:00;74637;241781021;24;1_74637;0" 

You can use anything that can take in an R connection object.

I'm not abt to read in 20GB for the example but those lines worked and I'd suggest using Apache Drill with the sergeant package and converting this CSV to parquet.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • I don't know what happen but I try to execute this code and inmediately I have the bomb that says "R Session Aborted" :-( – Henry Navarro Dec 20 '17 at 20:14