2

I tried to compress data using Arrow.jl. However, the test run using the below code didn’t show any size reduction (or compression). May I seek advice on my implementation, like is there something I am doing wrong? Code:

using CSV, DataFrames, Arrow
df = CSV.read("input_data.csv", DataFrame)
function compress_data(data::DataFrame)
    io = Arrow.tobuffer(data)
    d = Arrow.Table(io; convert=false)
    Arrow.write("output_data.lz4", d; compress=:lz4)
end
compress_data(df)

Look forward to the suggestions. Thanks!

Mohammad Saad
  • 1,935
  • 10
  • 28
  • 2
    Is your data compressable? – Oscar Smith Jul 29 '21 at 21:49
  • 1
    Only you know your data, so you have to think about why it might be compressible. Are values restricted to a small range? Is it a time series where subsequent values are close to or correlated with previous values? Is your data from measurements where the precision of your numbers far exceed the accuracy of the measurements, so you can safely throw away least significant digits that are just noise? Only then can you try to rearrange, transform, or truncate your data in ways to facilitate compression by standard tools such as lz4, zlib, etc. – Mark Adler Aug 12 '21 at 17:13
  • thanks @OscarSmith for the response! apologies for late response!! I guess there were mixture of datatypes involved, i guess which could be leading to inefficient compression. – Mohammad Saad Aug 31 '21 at 05:13
  • Thanks for the response @MarkAdler, highly appreciate the insight and suggestion. I will surely, try to observe the data as per the suggested approach !! – Mohammad Saad Aug 31 '21 at 05:16

1 Answers1

2

Code looks fine, and testing it with an input CSV with all zero values, the compression ratio is high.

I suspect the case here is using floating point numbers and there are 2 potentially tricky things to keep in mind here

  1. In the case where the floats are within a small range, e.g. 0. < x < 1., we might expect potential for compression, but we will likely be disappointed as the byte-pattern of floats doesn't lend itself to common compression techniques.
  2. Text representation of a Float64 might truncate decimals and store much less then 8 bytes per value, so it's possible to actually increase the save when saving the binary representation instead.

Compression techniques for floats do exist however, e.g. Blosc, but results are likely to be disappointing unless you are lucky with your data. Lossy compression techniques can achieve high compression rate, e.g. zfp . You can find more information on the topic here on SO: Compressing floating point data

Mikael Öhman
  • 2,294
  • 15
  • 21