Non-ASCII characters are not correctly displayed in PDF when served via HttpResponse and AJAX

Question

I have generated a PDF file which contains Cyrillic characters (non-ASCII) with ReportLab. For this purpose I have used the "Montserrat" font, which support such characters. When I look in the generated PDF file inside the media folder of Django, the characters are correctly displayed:

I have embedded the font by using the following code in the function generating the PDF:

from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import A4
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont

pdfmetrics.registerFont(TTFont('Montserrat', 'apps/Generic/static/Generic/tff/Montserrat-Regular.ttf'))
canvas_test = canvas.Canvas("media/"+filename, pagesize=A4)
canvas_test.setFont('Montserrat', 18)
canvas_test.drawString(10, 150, "Some text encoded in UTF-8")
canvas_test.drawString(10, 100, "как поживаешь")
canvas_test.save()

However, when I try to serve this PDF via HttpResponse, the Cyrillic characters are not properly displayed, despite being displayed in the Montserrat font:

The code that serves the PDF is the following:

# Return the pdf as a response
fs = FileSystemStorage()
if fs.exists(filename):
    with fs.open(filename) as pdf:
        response = HttpResponse(
            pdf, content_type='application/pdf; encoding=utf-8; charset=utf-8')
        response['Content-Disposition'] = 'inline; filename="'+filename+'"'
        return response

I have tried nearly everything (using FileResponse, opening the PDF with with open(fs.location + "/" + filename, 'rb') as pdf...) without success. Actually, I do not understand why, if ReportLab embeddes correctly the font (local file inside media folder), the file provided to the browser is not embedding the font.

It is also interesting to note that I have used Foxit Reader via Chrome or Edge to read the PDF. When I use the default PDF viewer of Firefox, different erroneous characters are displayed. Actually the font seems to be also erroneous in that case:

Edit

Thanks to @Melvyn, I have realized that the error did not lay in the response directly sent from the Python view, but in the success code in the AJAX call, which I leave hereafter:

$.ajax({
    method: "POST",
    url: window.location.href,
    data: { trigger: 'print_pdf', orientation: orientation, size: size},
    success: function (data) {
        if (data.error === undefined) {
            var blob = new Blob([data]);
            var link = document.createElement('a');
            link.href = window.URL.createObjectURL(blob);
            link.download = filename + '.pdf';
            link.click();
        }
    }
 });

This is the part of the code that is changing somehow the encoding.

Solution with the ideas from comments

I finally come up with a solution thanks to all the comments I have received, specially from @Melvyn. Instead of creating a Blob object, I have just set the responseType of the AJAX to Blob type. This is possible since JQuery 3:

$.ajax({
    method: "POST",
    url: window.location.href,
    xhrFields:{
        responseType: 'blob'
    },
    data: { trigger: 'print_pdf', orientation: orientation, size: size},
    success: function (data) {
        if (data.error === undefined) {
            var link = document.createElement('a');
            link.href = window.URL.createObjectURL(data);
            link.download = filename + '.pdf';
            link.click();
        }
    }
 });

Handling an error when returning response

You can return an error from Python (i.e. catching an exception) as follows:

except Exception as err:
    response = JsonResponse({'msg': "Error"})
    error = err.args[0]
    if error is not None:
        response.status_code = 403 # To announce that the user isn't allowed to publish
        if error==13:
            error = "Access denied to the PDF file."
        response.reason_phrase = error
        return response

Then, you just have to use the native error handling from AJAX (after the success section):

error: function(data){
    $("#message_rows2").text(data.statusText);
    $('#errorPrinting').modal();
}

See further details in this link.

I hope this post helps people with the same problem while generating PDFs in non-ASCII (Cyrillic) characters. It took me several days...

Make sure the font is embedded in the PDF and not just assuming the client will have the font. Please show the code that generates the PDF. — Antoine Pinsard, Nov 04 '20 at 13:18
Hi @AntoinePinsard. I have added the lines that I have used with reportlab to embed the font. I guess that is what you mean, right? The problem is in the httpresponse, in the produced file inside media everything is fine... — David Duran, Nov 04 '20 at 13:29
I have checked the PDF file from media in a computer without the font and it is also correctly displayed. — David Duran, Nov 04 '20 at 13:29
What happens if you omit `; encoding=utf-8; charset=utf-8`? A PDF file is binary so charset is not relevant. — Antoine Pinsard, Nov 04 '20 at 13:46
If I omit the `encoding` or the `charset`, the same result is obtained. Actually I added these ones because I thought it had to do with not having the PDF in UTF-8... — David Duran, Nov 04 '20 at 13:53
That is indeed weird. Did you try with another browser, and also with downloading the file with `Content-Disposition: attachment` ? — Antoine Pinsard, Nov 04 '20 at 13:56
`Content-Disposition: attachment` results in the same output. But what is interesting is that Firefox is giving other symbols (not the correct ones though). So maybe it has to do with the browser. — David Duran, Nov 04 '20 at 14:02
I have inspected `pdf` variable (specifically `encoding` attribute) and it says the following: `io.BufferedReader\' object has no attribute encoding`. So it may seem that it is actually an encoding error... — David Duran, Nov 04 '20 at 15:35
Try using FileResponse instead of HttpResponse : https://docs.djangoproject.com/en/stable/ref/request-response/#django.http.FileResponse — Antoine Pinsard, Nov 04 '20 at 15:51
No luck. Same result. What is strange is that the file in media has a size of 22kB, while the one downloaded is 32kB... — David Duran, Nov 04 '20 at 16:01
Could it be that the `file_to_stream` has encoding `cp1252` and the browser expects `utf-8`? — David Duran, Nov 04 '20 at 16:09
Does anything change if you set the Cyrillic text with the same font, but *in bold face*? Also, just to be sure -- your Firefox does pass PDFJS tests ( http://mozilla.github.io/pdf.js/features/ ), does it not? — LSerni, Nov 06 '20 at 17:35
Hi @LSerni, I have set the font to bold before the Cyrillic text, but the result is the same. Regarding the test you are mentioning, Firefox passes all tests, but Chrome fails in "@font-face loading completion detection". — David Duran, Nov 06 '20 at 17:47
Could you upload the sample PDF shown above, somewhere? I'll try and have a look at it. — LSerni, Nov 06 '20 at 19:44
You can find the correct PDF in the following link https://wetransfer.com/downloads/825ea2649b4227316d9d4c4665755a7220201106222943/e16d45. The wrong PDF can be accessed in https://wetransfer.com/downloads/bd68b2827e0407685b31174ed14cadd920201106223045/ec2521. — David Duran, Nov 06 '20 at 22:32
the solution works for the downloading the file but in case if we want to return the error message as the response then it fails as the return type is set as blob. Can anyone please help in this case. — Sunil, Mar 06 '21 at 06:24
@Sunil, I solved this issue by following the following link https://stackoverflow.com/questions/377644/jquery-ajax-error-handling-show-custom-exception-messages. Basically, you have to raise an error in Python and then use the AJAX native error handling. — David Duran, Mar 07 '21 at 11:06
It was still failing because return type is expected as blob. I solved it by adding below instead of just return type. `xhr: function() { var xhr = new XMLHttpRequest(); xhr.onreadystatechange = function() { if (xhr.readyState == 2) { if (xhr.status == 200) { xhr.responseType = "blob"; } } }; return xhr; }` — Sunil, Mar 08 '21 at 12:11

score 1 · Accepted Answer · 2020-11-09T19:04:33.250

1

You are doing some encoding/recoding, because if you look at the diff between the files, it's littered with unicode replacement characters:

% diff -ua Cyrillic_good.pdf Cyrillic_wrong.pdf > out.diff

% hexdump out.diff|grep 'ef bf bd'|wc -l
    2659

You said you tried without setting the encoding and charset, but I don't think that was tested properly - most likely you saw an aggressively browser-cached version.

The proper way to do this is to use FileResponse, pass in the filename and let Django figure out the right content type.

The following is a reproducible test of a working situation:

First of all, put Cyrillic_good.pdf (not wrong.pdf), in your media root.

Add the following to urls.py:

#urls.py
from django.urls import path
from .views import pdf_serve

urlpatterns = [
    path("pdf/<str:filename>", pdf_serve),
]

And views.py in the same directory:

#views.py
from pathlib import Path

from django.conf import settings
from django.http import (
    HttpResponseNotFound, HttpResponseServerError, FileResponse
)

def pdf_serve(request, filename: str):
    pdf = Path(settings.MEDIA_ROOT) / filename
    if pdf.exists():
        response = FileResponse(open(pdf, "rb"), filename=filename)
        filesize = pdf.stat().st_size
        cl = int(response["Content-Length"])
        if cl != filesize:
            return HttpResponseServerError(
                f"Expected {filesize} bytes but response is {cl} bytes"
            )
        return response

    return HttpResponseNotFound(f"No such file: {filename}")

Now start runserver and request http://localhost:8000/pdf/Cyrillic_good.pdf.

If this doesn't reproduce a valid pdf, it is a local problem and you should look at middleware or your OS or little green men, but not the code. I have this working locally with your file and no mangling is happening.

In fact, the only way to get a mangled pdf now is browser cache or response being modified after Django sends it, since the content length check would prevent sending a file that has different size then the one on disk.

JS Part

I would expect the conversion to happen in the blob constructor as it's possible to hand a blob a type. I'm not sure the default is binary-safe. It's also weird your data has an error property and you pass the entire thing to the blob, but we can't see what promise you're reacting on.

success: function (data) {
    if (data.error === undefined) {
        console.log(data) // This will be informative
        var blob = new Blob([data]);
        var link = document.createElement('a');
        link.href = window.URL.createObjectURL(blob);
        link.download = filename + '.pdf';
        link.click();
    }
}

edited Nov 09 '20 at 19:04

answered Nov 08 '20 at 11:59

Using `FileResponse` this way leads to the following error `ValueError: read of closed file`. Apparently, `FileReponse` cannot be used with a context manager (see https://code.djangoproject.com/ticket/29278). Anyway, I have cleared all my cache and I have directly used `return FileResponse(open(filename))`. This arises the following error in line 23 of `lib/encodings/cp1252.py`: `charmap code can't decode byte 0x8d in position 561: character map to `. So it seems that it is not able to do properly the encoding... – David Duran Nov 08 '20 at 13:13
I have also tried after this `FileResponse(open(filename, encoding="utf-8")` which leads to the following error `UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 10: invalid start byte`. Anyway, I think this encoding command should not work in binary files like PDFs. – David Duran Nov 08 '20 at 13:18
By the way, you were probably right that the browser was using cache, because before I could download the file when using `FileReponse`. – David Duran Nov 08 '20 at 13:40
You need to *not* encode. I'm not sure why it is trying and pass filename=filename, so mimetypes.guess_type sets content correctly. Do **not** set content_type or encoding so it tries to read as text. That is the problem. These should be binary responses, using binary file input. Unless you have a custom FileSystemStorage() that defaults to opening as text, I don't see why it would do this. – Nov 08 '20 at 14:00
Ok, by using `return FileResponse(open(filename, "rb"))`, I do not get the aforementioned error. However, the output PDF has still wrong characters. – David Duran Nov 08 '20 at 14:21
pass `filename=filename` to FileResponse(). – Nov 08 '20 at 14:22
No difference when using `return FileResponse(open(filename, "rb"), filename=filename)` – David Duran Nov 08 '20 at 14:30
Your answer has been really helpful @Melvyn. It has made me fall into that my response is passed to Javascript through an AJAX call and the download happens there. I have placed my code in a normal view response and everything worked. Hence, the error shall lay in the JS. I edit my question to include this JS code. – David Duran Nov 08 '20 at 20:21
I will investigate on that part of the code. Anyway, I think that you deserve to be awarded the answer and the bounty. If you find the solution, I will kindly accept the answer. Otherwise, if I find it, I will tell you and accept anyway your answer. – David Duran Nov 08 '20 at 20:27
All file operations required an *data check* on all steps. Like `read data > write somewhere` for ` *read*, next `send data (bytes object, not file) > to_wiever_OR_File`. We haven't any idea *which part got unicode and/or another encoding*. On my old projects: `i compress everthings like *.bz2* and decompress and pushing to web page stage !` – dsgdfg Nov 09 '20 at 22:11
@Melvyn, the `data.error` is only available when I cannot open the file and I send a `JsonResponse` with the `error` keyword. In this case, `data` is just the `FileResponse`. Hence, `data.error` is undefined. – David Duran Nov 10 '20 at 07:03
I have tried defining `let BOM = new Uint8Array([0xEF,0xBB,0xBF]);` and then `var blob = new Blob([BOM, data], {encoding:"UTF-8",type:"application/pdf;charset=UTF-8"});`, but the PDF is still wrong. – David Duran Nov 10 '20 at 07:09
Again, it's a binary file. Do **not** encode anything. Encoding is a text mapping that defines what byte values correspond to what character using a lookup table (such as Unicode). PDF, while some parts are readable, is a binary file and should not be touched. So don't add a BOM, don't modify it in any way, but set type to "application/pdf". Nothing else. No charset, because there is no "set of characters" that it should lookup the bytes in. – Nov 10 '20 at 07:40
Yes, I was just testing things. `var blob = new Blob([data], {type:"application/pdf"});` doesn't work neither. – David Duran Nov 10 '20 at 11:24
@Melvyn, it finally worked. See my edited question. Thanks a lot for your time. If you want, you can add the final solution to your answer too, so that it is more useful for future users. Best regards. – David Duran Nov 10 '20 at 11:46
1

Figures that jquery doesn't respect the headers but assumes text responses, no matter the mime-type the server sends. I wonder if Axios does a better job. For your understanding: a PDF **file** is a binary blob container, even though it's **document** content can be text encoded in UTF-8 and the container has pdf-reader instructions in ASCII, it's woven with embedded images and fonts. – Nov 10 '20 at 13:25

Sunil · Answer 2 · 2021-03-08T12:24:01.533

1

For those who are doing form validation in views, you need to add below code in js file as return type is expected as blob.

xhr: function() {
    var xhr = new XMLHttpRequest();
    xhr.onreadystatechange = function() {
        if (xhr.readyState == 2) {
            if (xhr.status == 200) {
                xhr.responseType = "blob";
            }
        }
    };
    return xhr;
},
success: function (response, textStatus, jqXHR) {
    var blob = new Blob([response])
    var link=document.createElement('a');
    link.href=window.URL.createObjectURL(blob);
    link.download="contract.pdf";
    link.click();
},
error: function (response, textStatus, jqXHR) {
    $('#my_form').click();
}

edited Mar 08 '21 at 12:24

answered Mar 08 '21 at 12:18

Sunil

141
1
9

Hi @Sunil. Nice way to handle errors. Another way is by returning the error directly from Python (see my last edit of the question). – David Duran Mar 09 '21 at 11:20
@DavidDuran, I was facing issues with JsonResponse. As in error function expected datatype of data is blob since we are initializing it as blob. – Sunil Mar 10 '21 at 17:38

Non-ASCII characters are not correctly displayed in PDF when served via HttpResponse and AJAX

Edit

Solution with the ideas from comments

Handling an error when returning response

2 Answers2

JS Part