3

I have code which uses write.csv to save a large number of files in bzip2 format. Here's a small reproduceable example:

df <- data.frame(A = rnorm(100000), B = rnorm(100000), C = rnorm(100000))
write.csv(df, file = bzfile('df.csv.bzip2'))

I want to speed up the code. I know data.table::fwrite is much faster than write.csv, but I don't know how to get fwrite to save to csv.bzip2. I've optimistically tried the below, but the compression doesn't appear to be working, e.g. the file size is 5.4MB vs. 2.5MB from the write.csv version saved above.

data.table::fwrite(df, 'df2.csv.bzip2') 

Can anyone advise if it's possible to use fwrite to save a compressed csv in bzip2 format? If not, can anyone advise on an alternative way to save a csv via fwrite and then convert to bzip2 format? E.g. something like the below. It's not essential to do the compression within fwrite, I just want to use fwrite to speed up the saving process and for the end product to be a properly-compressed csv.bzip2 file.

data.table::fwrite(df, 'df2.csv') #saves a normal csv
# (add code here which converts the output of ```fwrite``` to a properly-compressed csv.bzip2 file)

NB I'm aware I can save as gzip through fwrite, but I want the file to be in bzip2 format.

jruf003
  • 980
  • 5
  • 19
  • 2
    Not supported with `fwrite()` afaik. You can use `readr::write_csv()` with file extension `.bz2` for bzip2 compression. – Ritchie Sacramento Jan 31 '23 at 07:51
  • Thanks @RuiBarradas! That's worked -- if you want to post as an answer I'll accept. – jruf003 Jan 31 '23 at 08:06
  • 2
    @jruf003 - Not to state the obvious, but giving it a `bz2` file extension and using `gzip` compression doesn't mean the file is in `bz2` format, it's a `gzip` file with the wrong extension. – Ritchie Sacramento Jan 31 '23 at 08:12

2 Answers2

2

If gzip instead of bzip2 solves the compression problem, just set argument compress = "gzip".

data.table::fwrite(iris, '~/Temp/df2.gz')
file.size('~/Temp/df2.gz')
#> [1] 3867

data.table::fwrite(iris, '~/Temp/df2.gz', compress = 'gzip')
file.size('~/Temp/df2.gz')
#> [1] 874

Created on 2023-01-31 with reprex v2.0.2

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • @KonradRudolph You're right, thanks. Sorry for the wrong file extension, I copied&pasted from the question. As for the file format, the OP seems to have solved the problem with gzip and asked me to post my (also wrong) comment as an answer. – Rui Barradas Jan 31 '23 at 08:31
  • 1
    Apologies, I know I said I'd accept your answer, I didnt realise however that your suggestion wasn't identical to compressing via bzip2. This has since been pointed out – jruf003 Jan 31 '23 at 23:15
2

You can use R.utils::bzip2 to compress the file afterwards.

df <- data.frame(A = rnorm(100000), B = rnorm(100000), C = rnorm(100000))

system.time(write.csv(df, file = bzfile("df.csv.bz2")))
#       User      System verstrichen 
#      0.912       0.005       0.917 

system.time({data.table::fwrite(df, "df2.csv"); R.utils::bzip2("df2.csv")})
#       User      System verstrichen 
#      0.487       0.011       0.473 

system.time(readr::write_csv(df, "df3.csv.bz2")) #Comment from @Ritchie Sacramento
#       User      System verstrichen                                           
#      0.743       0.042       0.988 

file.size("df.csv.bz2")
#[1] 2511607

file.size("df2.csv.bz2")
#[1] 2232901

file.size("df3.csv.bz2")
#[1] 2431997
GKi
  • 37,245
  • 2
  • 26
  • 48
  • Thanks @GKi, very helpful! I'm trying to write to .bzip2 rather .bz2... but only because this is the format of the existing code. Do you know if the two extensions are equivalent? And if not is there any reason I couldnt write to .bzip2 by specifying ext = 'bzip2' in R.utils::bzip2() (which I see is an option)? This looks to work but I'm no expert. Thanks again – jruf003 Jan 31 '23 at 23:09
  • At least in the [wiki bzip2](https://en.wikipedia.org/wiki/Bzip2) they say, that `.bz2` is the filenameextension of a file compressed with bzip2. – GKi Feb 01 '23 at 07:30