Questions tagged [snappy]

Snappy is a compression algorithm for byte streams and a library implementing this algorithm. The standard distribution includes bindings for C and C++; there are third-party bindings for many other languages.

Snappy does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.

Snappy is widely used inside Google, in everything from BigTable and MapReduce to Google's internal systems.

366 questions
95
votes
6 answers

Parquet vs ORC vs ORC with Snappy

I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy. I have read many a documents that state Parquet to be better in…
Rahul
  • 2,354
  • 3
  • 21
  • 30
42
votes
7 answers

Methods for writing Parquet files using Python?

I'm having trouble finding a library that allows Parquet files to be written using Python. Bonus points if I can use Snappy or a similar compression mechanism in conjunction with it. Thus far the only method I have found is using Spark with the…
octagonC
  • 635
  • 1
  • 6
  • 11
40
votes
5 answers

Spark SQL - difference between gzip vs snappy vs lzo compression formats

I am trying to use Spark SQL to write parquet file. By default Spark SQL supports gzip, but it also supports other compression formats like snappy and lzo. What is the difference between these compression formats?
Shankar
  • 8,529
  • 26
  • 90
  • 159
37
votes
8 answers

UnsatisfiedLinkError: /tmp/snappy-1.1.4-libsnappyjava.so Error loading shared library ld-linux-x86-64.so.2: No such file or directory

I am trying to run a Kafka Streams application in kubernetes. When I launch the pod I get the following exception: Exception in thread "streams-pipe-e19c2d9a-d403-4944-8d26-0ef27ed5c057-StreamThread-1" java.lang.UnsatisfiedLinkError:…
el323
  • 2,760
  • 10
  • 45
  • 80
28
votes
3 answers

Decompression 'SNAPPY' not available with fastparquet

I am trying to use fastparquet to open a file, but I get the error: RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP', 'UNCOMPRESSED'] I have the following installed and have rebooted my interpreter: python …
B. Sharp
  • 281
  • 1
  • 3
  • 6
22
votes
4 answers

Comparison between lz4 vs lz4_hc vs blosc vs snappy vs fastlz

I have a large file of size 500 mb to compress in a minute with the best possible compression ratio. I have found out these algorithms to be suitable for my use. lz4 lz4_hc snappy quicklz blosc Can someone give a comparison of speed and…
Sayantan Ghosh
  • 998
  • 2
  • 9
  • 29
19
votes
4 answers

Is Snappy splittable or not splittable?

According to this Cloudera post, Snappy IS splittable. For MapReduce, if you need your compressed data to be splittable, BZip2, LZO, and Snappy formats are splittable, but GZip is not. Splittability is not relevant to HBase data. But from the…
moon
  • 1,702
  • 3
  • 19
  • 35
18
votes
3 answers

How to install snappy C libraries on Windows 10 for use with python-snappy in Anaconda?

I want to install parquet for python using pip within an Anaconda 2 installation on Windows 10. While installing I ran into the error that is described here, the installer can't find snappy-c.h. There is no mention on how to install this on Windows…
Khris
  • 3,132
  • 3
  • 34
  • 54
15
votes
2 answers

Spark + Parquet + Snappy: Overall compression ratio loses after spark shuffles data

Commmunity! Please help me understand how to get better compression ratio with Spark? Let me describe case: I have dataset, let's call it product on HDFS which was imported using Sqoop ImportTool as-parquet-file using codec snappy. As result of…
Mikhail Dubkov
  • 1,223
  • 1
  • 12
  • 16
14
votes
5 answers

pandas df.to_parquet write to multiple smaller files

Is it possible to use Pandas' DataFrame.to_parquet functionality to split writing into multiple files of some approximate desired size? I have a very large DataFrame (100M x 100), and am using df.to_parquet('data.snappy', engine='pyarrow',…
Austin
  • 6,921
  • 12
  • 73
  • 138
13
votes
1 answer

How to decompress the hadoop reduce output file end with snappy?

Our hadoop cluster using snappy as default codec. Hadoop job reduce output file name is like part-r-00000.snappy. JSnappy fails to decompress the file bcz JSnappy requires the file start with SNZ. The reduce output file start with some bytes 0…
DeepNightTwo
  • 4,809
  • 8
  • 46
  • 60
12
votes
5 answers

unable to prepare context: unable to evaluate symlinks in Dockerfile path: lstat /var/lib/snapd/void/Dockerfile: no such file or directory

I installed docker on Ubuntu with snap (snappy?), and then I ran this: ln -sf /usr/bin/snap /usr/local/bin/docker when I run docker build I get: unable to prepare context: unable to evaluate symlinks in Dockerfile path: lstat…
user11612258
11
votes
1 answer

How do I decide between LZ4 and Snappy compression?

I need to select a compression algorithm when configuring "well-known application". Also, as part of my day job, my company is developing distributed application that deal with a fair amount of data. We've been looking into compressing data to try…
user5994461
  • 5,301
  • 1
  • 36
  • 57
11
votes
2 answers

LZ4 library decompressed data upper bound size estimation

I'm using LZ4 library and when decompressing data with int LZ4_decompress_safe (const char* source, char* dest, int compressedSize, int maxDecompressedSize); I want to estimate maximum decompressed data size. But I can not find reverse function…
bobeff
  • 3,543
  • 3
  • 34
  • 62
11
votes
4 answers

How do I read Snappy compressed files on HDFS without using Hadoop?

I'm storing files on HDFS in Snappy compression format. I'd like to be able to examine these files on my local Linux file system to make sure that the Hadoop process that created them has performed correctly. When I copy them locally and attempt to…
Robert Rapplean
  • 672
  • 1
  • 9
  • 30
1
2 3
24 25