0

using aws-php-sdk to read csv.gz files from a s3 bucket. All the files are csv.gz files which I am planning on reading and then importing the data into my database.

I have looked at many stack overflow questions but can't seem to get it working.

Here is the code I have written so far.

$s3 = new S3Client([
    'version' => 'latest',
    'region' => 'us-east-2',
    'credentials' => [
        'key' => '',
        'secret' => ''
    ]
]);
$s3->registerStreamWrapper();

if ($stream = fopen('s3://bucket/file.csv.gz', 'r')) {
    // While the stream is still open
    while (!feof($stream)) {
        // Read 1024 bytes from the stream
        $d = gzread($stream, 1024);
        var_dump($d);
    }
    // Be sure to close the stream resource when you're done with it
    fclose($stream);
}

The following code just returns loads of random characters which must be the contents of the compressed files.

If someone could share a code example of how to uncompress the csv.gz file and then read from it allowing me to import it into a database I would really appreciate it.

Lukerayner
  • 412
  • 6
  • 23

2 Answers2

1

For anyone entering this thread in a future search:

I've tried to use Hamlet's answer regarding the use of stream_filter_append, but noticed that some of my machines had issues with this solution (fread returned 0 bytes read although the stream is still open and not all data had been read.

I came across with this thread - Turns out that there's a known bug in php for stream buffering. Using the code from the thread I wrote this piece, it might help the next traveller crossing these paths:

        // init $client from S3Client
        $client->registerStreamWrapper();
        // full s3 path to the file
        $gzipUrl = "s3://{$bucketName}/{$filePath}";
        $sourceFile = fopen($gzipUrl, 'rb');
        $targetFile = fopen(sys_get_temp_dir() . '/user_input_import.csv.gz', 'wb');
        stream_copy_to_stream($sourceFile, $targetFile);
        fclose($sourceFile);
        fclose($targetFile);

        copy('compress.zlib://' . sys_get_temp_dir() . '/user_input_import.csv.gz', $tmpFileName);

For my project that did the trick.

Dharman
  • 30,962
  • 25
  • 85
  • 135
Erez
  • 31
  • 2
0

In order to get gzread working correctly you have to open file with gzopen. But gzopen doesn't support file streams, so with this function you can open only files that located on server by providing corresponded file path.

In your case, solution would be using of compression filters. So you have to use fopen in order to open file stream and then apply compression stream filter in order to read .gz decoded data.

stream_filter_append(
        $stream,
        'zlib.inflate',
        STREAM_FILTER_READ,
        ["window" => 32]
    );

See documentation for more info: https://www.php.net/manual/en/filters.compression.php

Unfortunately there is not enough of info about window parameter in documentation above. Here is a little bit more useful info from zlib.h inflateInit2 documentation:

The windowBits parameter is the base two logarithm of the maximum window size (the size of the history buffer). It should be in the range 8..15 for this version of the library. The default value is 15 if inflateInit is used instead.

...

windowBits can also be greater than 15 for optional gzip decoding. Add 32 to windowBits to enable zlib and gzip decoding with automatic header detection, or add 16 to decode only the gzip format (the zlib format will return a Z_DATA_ERROR). If a gzip stream is being decoded, strm->adler is a CRC-32 instead of an Adler-32. Unlike the gunzip utility and gzread() (see below), inflate() will not automatically decode concatenated gzip streams.
inflate() will return Z_STREAM_END at the end of the gzip stream. The state would need to be reset to continue decoding a subsequent gzip stream.

Based on this info I would suggest to use window size 32, because this size will support both zlib and gzip decoding with automatic header detection.

So final code should looks like that:

$s3 = new S3Client([
    'version' => 'latest',
    'region' => 'us-east-2',
    'credentials' => [
        'key' => '',
        'secret' => '',
    ],
]);
$s3->registerStreamWrapper();

if ($stream = fopen('s3://bucket/file.csv.gz', 'r')) {
    stream_filter_append(
        $stream,
        'zlib.inflate',
        STREAM_FILTER_READ,
        ["window" => 32]
    );
    // While the stream is still open
    while (!feof($stream)) {
                // Read 1024 bytes from the stream
        $d = gzread($stream, 1024);
        var_dump($d);
    }
    // Be sure to close the stream resource when you're done with it
    fclose($stream);
}
Hamlet
  • 31
  • 3