14

I'm trying to write a data frame to a gzip file but having problems.

Here's my code example:

df1 <- data.frame(id = seq(1,10,1), var1 = runif(10), var2 = runif(10))

gz1 <- gzfile("df1.gz","w" )
writeLines(df1)

Error in writeLines(df1) : invalid 'text' argument

Any suggestions?

EDIT: an example line of the character vector I'm trying to write is:

0 | var1:1.5 var2:.55 var7:1250

The class label / y-variable is separated from the x-vars by a " | " and variable names are separated from values by " : " and spaces between variables.

EDIT2: I apologize for the wording / format of the question but here are the results: Old method:

system.time(write(out1, file="out1.txt"))
#    user  system elapsed 
#   9.772  17.205  86.860 

New Method:

writeGzFile <- function(){
  gz1 = gzfile("df1.gz","w");
  write(out1, gz1);
  close(gz1) 
}

system.time( writeGzFile())
#    user  system elapsed 
#   2.312   0.000   2.478 

Thank you all very much for helping me figure this out.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
screechOwl
  • 27,310
  • 61
  • 158
  • 267
  • As is often asked on Rhelp: "What problem are you trying to solve". – IRTFM Jan 08 '13 at 23:12
  • Hint: the answer @DWin comment is not "How do I write a data frame to a gzip file?" – Spacedman Jan 08 '13 at 23:14
  • The longer question would be "Is it faster to write a .txt file or a .gz file from R?" – screechOwl Jan 08 '13 at 23:16
  • That depends on how long your piece of string is. In computer terms, whether your CPU or I/O is the bottleneck. Writing a big file to a fast disk is quicker than computing a compressed form on a slow CPU. – Spacedman Jan 08 '13 at 23:17
  • I was hoping to get an answer to the question "what purpose might there be in processing the R data object in a manner other than achieved by `save`"? Do you need it to be read by a program other than R? – IRTFM Jan 08 '13 at 23:25
  • Yes. Please see comment stream in Spacedman's answer. – screechOwl Jan 08 '13 at 23:28
  • The examples in `?readRDS` helped me understand the compression and serialization that R does in `readRDS` and `saveRDS`. – geneorama Dec 02 '19 at 20:54

6 Answers6

26

writeLines expects a list of strings. The simplest way to write this to a gzip file would be

df1 <- data.frame(id = seq(1,10,1), var1 = runif(10), var2 = runif(10))
gz1 <- gzfile("df1.gz", "w")
write.csv(df1, gz1)
close(gz1)

This will write it as a gzipped csv. Also see write.table and write.csv2 for alternate ways of writing the file out.

EDIT:Based on the updates to the post about desired format, I made the following helper (quickly thrown together, probably admits tons of simplification):

function(df) {
    rowCount <- nrow(df)
    dfNames <- names(df)
    dfNamesIndex <- length(dfNames)
    sapply(1:rowCount, function(rowIndex) {
        paste(rowIndex, '|', 
            paste(sapply(1:dfNamesIndex, function(element) {
                c(dfNames[element], ':', df[rowIndex, element])
            }), collapse=' ')
        )
    })
}

So the output looks like

a <- data.frame(x=1:10,y=rnorm(10))
writeLines(myser(a))
# 1 | x : 1 y : -0.231340933021948
# 2 | x : 2 y : 0.896777389870928
# 3 | x : 3 y : -0.434875004781075
# 4 | x : 4 y : -0.0269824962632977
# 5 | x : 5 y : 0.67654540494899
# 6 | x : 6 y : -1.96965253674725
# 7 | x : 7 y : 0.0863177759402661
# 8 | x : 8 y : -0.130116466571162
# 9 | x : 9 y : 0.418337557610229
# 10 | x : 10 y : -1.22890714891874

And all that is necessary is to pass the gzfile in to writeLines to get the desired output.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
user295691
  • 7,108
  • 1
  • 26
  • 35
  • For people using VW, see also this answer for faster options than `writeLines`: http://stackoverflow.com/a/41215573/3576984 – MichaelChirico Feb 05 '17 at 20:20
5

To write something to a gzip file you need to "serialize" it to text. For R objects you can have a stab at that by using dput:

gz1 = gzfile("df1.gz","w")
dput(df1, gz1)
close(gz1)

However you've just written a text representation of the data frame to the file. This will quite probably be less efficient than using save(df1,file="df1.RData") to save it to a native R data file. Ask yourself: why am I saving it as a .gz file?

In a quick test with some random numbers, the gz file was 54k, the .RData file was 34k

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • Thank you. The reason I'm writing to .gz is that the output is an input file for another program that reads .gz files. In other words it's leaving the R ecosystem. Otherwise I'd use .RData. – screechOwl Jan 08 '13 at 23:14
  • So just gzip the .RData file? No, that won't work, because gzip is a compression that tells you nothing about the format of the data in the file when uncompressed. Is it a gzipped CSV file, a gzipped NetCDF file, a gzipped RData file? You haven't told us. – Spacedman Jan 08 '13 at 23:16
  • Sorry, I'm using it as an input file for a program called vowpal wabbit. It has some weird delimiting using '|', ':' and ' '. – screechOwl Jan 08 '13 at 23:18
  • We're getting closer to the real question. Want to edit yours to say more of what it is you are wanting to do? It seems the other answer (write.csv) could be better. But that's guesswork. – Spacedman Jan 08 '13 at 23:20
  • I'm current using 'write(df1, file = "df1.txt")'. But it's taking a long time to run (It's ~200k rows). I was curious if using .gz would be faster, but couldn't get R to write a .gz file, which is the reason for the question. – screechOwl Jan 08 '13 at 23:36
5

Another very simple way to do it is:

# We create the .csv file
write.csv(df1, "df1.csv")

# We compress it deleting the .csv
system("gzip df1.csv")

Got the idea from: http://blog.revolutionanalytics.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html

Gorka
  • 3,555
  • 1
  • 31
  • 37
1

You can use the gzip function in R.utils:

library(R.utils)
library(data.table)

#Write gzip file
df <- data.table(var1='Compress me',var2=', please!')
fwrite(df,'filename.csv',sep=',')
gzip('filename.csv',destname='filename.csv.gz')`

#Read gzip file
fread('gzip -dc filename.csv.gz')
          var1      var2
1: Compress me , please!
user3055034
  • 593
  • 1
  • 5
  • 14
1

For tidyverse methods adding the compression extension to the file name will perform the compression. From https://readr.tidyverse.org/reference/write_delim.html

The write_*() functions will automatically compress outputs if an appropriate extension is given. At present, three extensions are supported, .gz for gzip compression, .bz2 for bzip2 compression and .xz for lzma compression.

library(tidyverse)
df <- data.table(var1='Compress me',var2=', please!')
write_csv(df, "filename.csv.gz")
jameshowison
  • 151
  • 8
1

It's working out of the box with data.tables fwrite function:

df1 <- data.frame(id = seq(1,10,1), var1 = runif(10), var2 = runif(10))
data.table::fwrite(df1, file = "df1.csv.gz")
fc9.30
  • 2,293
  • 20
  • 19