1

I'm trying to store relatively huge portion of data into MongoDB, using BulkWrite for multiple inserts. Data comes from external resource, so I cannot fully control it's content. Amount of data records can reach 15k and even more for single bulk operation. Here is the part of my code:

$bulk = new BulkWrite();
foreach ($data as $id => $item) {
    $bulk->update(
        ['id' => $item['id']],
        $item,
        ['upsert' => true]
    );
}

try {
    $result = $mongo->getManager()->executeBulkWrite(
        'db.collection',
        $bulk,
        new WriteConcern(WriteConcern::MAJORITY)
    );
} catch (\MongoDB\Driver\Exception\UnexpectedValueException $e) {
    // we have a problem here
}

From time to time I face exceptions like this:

[MongoDB\Driver\Exception\UnexpectedValueException] Got invalid UTF-8 value serializing '�8'

I don't want to filter all this data, because it will affect performance. Is it possible to get exact record on which this exception occured? As far as I know this data is not available from UnexpectedValueException object and it's previous exception which is also UnexpectedValueException.

Bushikot
  • 783
  • 3
  • 10
  • 26
  • Hmm I do not know much about MongoDB, but maybe it is a collation problem? There are characters which only exist in utf8mb4 (4 bytes per char) e.g. the beer mug. – Blackbam Mar 30 '18 at 14:42
  • @Blackbam thanks for your attention, but problem is that I cannot precisely detect exact data record where this symbol located from the whole inserted list. But I never knew about beer mug, that's funny :) – Bushikot Mar 30 '18 at 15:28
  • [Try reading this question and **all** its answers.](https://stackoverflow.com/questions/279170/utf-8-all-the-way-through) – Martin Apr 02 '18 at 12:50
  • Thanks @Martin, I'll check it. – Bushikot Apr 10 '18 at 13:46

1 Answers1

0

The only solution I found - is adding try/catch block with manual BSON encoding method call. Sadly bson value cannot be used as input for bulk update/insert operation. So, this \MongoDB\BSON\fromPHP call is redundant, however it makes possible to detect "bad" record. Here is the code:

$bulk = new BulkWrite();
foreach ($data as $id => $item) {
    try {
        $bson = \MongoDB\BSON\fromPHP($item);
    } catch (UnexpectedValueException $e) {
        foreach ($item as $key => $value) {
            if (is_string($value)) {
                $item[$key] = mb_convert_encoding($value, 'UTF-8', 'UTF-8');
            } else {
                $item[$key] = $value;
            }
        }
    }

    $bulk->update(
        ['id' => $item['id']],
        $item,
        ['upsert' => true]
    );
}   
Bushikot
  • 783
  • 3
  • 10
  • 26