0

I have an application where I have to store people's names and make them searchable. The technologies I am using are python (v2.7.6) django (v1.9.5) rest framwork. The dbms is postgresql (v9.2). Since the user names can be arabic we are using utf-8 as db encoding. For search we are using haystack (v2.4.1) with Amazon Elastic Search for indexing. The index was building fine a few days ago but now when I try to rebuild it with

python manage.py rebuild_index

it fails with the following error

'ascii' codec can't decode byte 0xc3 in position 149: ordinal not in range(128)

The full error trace is

  File "/usr/local/lib/python2.7/dist-packages/haystack/management/commands/update_index.py", line 188, in handle_label
    self.update_backend(label, using)
  File "/usr/local/lib/python2.7/dist-packages/haystack/management/commands/update_index.py", line 233, in update_backend
    do_update(backend, index, qs, start, end, total, verbosity=self.verbosity, commit=self.commit)
  File "/usr/local/lib/python2.7/dist-packages/haystack/management/commands/update_index.py", line 96, in do_update
    backend.update(index, current_qs, commit=commit)
  File "/usr/local/lib/python2.7/dist-packages/haystack/backends/elasticsearch_backend.py", line 193, in update
    bulk(self.conn, prepped_docs, index=self.index_name, doc_type='modelresult')
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 188, in bulk
    for ok, item in streaming_bulk(client, actions, **kwargs):
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 160, in streaming_bulk
    for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs):
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 85, in _process_bulk_chunk
    resp = client.bulk('\n'.join(bulk_actions) + '\n', **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/utils.py", line 69, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/__init__.py", line 795, in bulk
    doc_type, '_bulk'), params=params, body=self._bulk_body(body))
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/transport.py", line 329, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_requests.py", line 68, in perform_request
    response = self.session.request(method, url, data=body, timeout=timeout or self.timeout)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 455, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 558, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 330, in send
    timeout=timeout
  File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 558, in urlopen
    body=body, headers=headers)
  File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 353, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python2.7/httplib.py", line 979, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python2.7/httplib.py", line 1013, in _send_request
    self.endheaders(body)
  File "/usr/lib/python2.7/httplib.py", line 975, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python2.7/httplib.py", line 833, in _send_output
    msg += message_body
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 149: ordinal not in range(128)

My guess is that befor we didn't have arabic characters in our database so the index was building fine but now since users have entered arabic chars the index fails to build.

taimur
  • 103
  • 8

3 Answers3

1

If you are using the requests-aws4auth package, then you can use the following wrapper class in place of the AWS4Auth class. It encodes the headers created by AWS4Auth into byte strings thus avoiding the UnicodeDecodeError downstream.

from requests_aws4auth import AWS4Auth

class AWS4AuthEncodingFix(AWS4Auth):
    def __call__(self, request):
        request = super(AWS4AuthEncodingFix, self).__call__(request)

        for header_name in request.headers:
            self._encode_header_to_utf8(request, header_name)

        return request

    def _encode_header_to_utf8(self, request, header_name):
        value = request.headers[header_name]

        if isinstance(value, unicode):
            value = value.encode('utf-8')

        if isinstance(header_name, unicode):
            del request.headers[header_name]
            header_name = header_name.encode('utf-8')

        request.headers[header_name] = value
Sean
  • 4,365
  • 1
  • 27
  • 31
  • Thanks, this worked for me! I guess this is more of a Python 2 issue, because for another python 3 project of mine, it works without your modification. – Özer Oct 30 '19 at 22:48
0

I suspect you're correct about the arabic chars now showing up in the DB.

are also possibly related to this issue. The first link seems to have some kind of work around for it, but doesn't have a lot of detail. I suspect what the author meant with

The proper fix is to use unicode type instead of str or set the default encoding properly to (I assume) utf-8.

is that you need to check that the the machine it's running on is LANG=en_US.UTF-8 or at least some UTF-8 LANG

Paul
  • 1,108
  • 9
  • 14
0

Elasticsearch supports different encoding so having arabic characters shouldn't be the problem.

Since you are using AWS, I will assume you also use some authorization library like requests-aws4auth. If that is the case, notice that during authorization, some unicode headers are added, like u'x-amz-date'. That is a problem, since python's httplib perfoms the following during _send_output(): msg = "\r\n".join(self._buffer) where _buffer is a list of the HTTP headers. Having unicode headers makes msg be of <type 'unicode'> while it really should be of type str (Here is a similar issue with different auth library).

The line that raises the exception, msg += message_body raises it since python needs to decode message_body to unicode so it matches the type of msg. The exception is rised since py-elasticsearch already took care of the encoding, so we end up of encoding to unicode twice, which cause the exception (as explained here).

You may want to try to replace the auth library (for example with DavidMuller/aws-requests-auth) and see if it fixes the problem.

Community
  • 1
  • 1
avikam
  • 1,018
  • 9
  • 11