1

I'm trying to choose the right format for file exchange with my spark application. I use Spark 2.4.7 + Haddop 2.10 on Kubernetess. My app downloads CSV file from S3 and process it. The file is provided by a 3rd party company.

I was thinking about asking them to use lz4, lzo or other splittable compression. However, what I can see the command line tools file format is not compatible with Hadoop lz4 or lzo codecs (I tried lzop and lz4 cli)

Do you know any CLI tools which allow preparing lz4 or lzo compressed files in formats which Hadoop codecs will understand?

Matzz
  • 670
  • 1
  • 7
  • 17
  • Did you ever find a solution for this? – Aidan Steele Sep 29 '21 at 06:41
  • No. Unfortunately, those formats are not standardized. So compression is the same, but for example frames format is different. So I assume its useful to use them, use them inside of the platform, but not to integrate with the outside world. – Matzz Sep 30 '21 at 09:26

0 Answers0