5

I have set up 3 Google Cloud Storge buckets and 3 functions (one for each bucket) that will trigger when a PDF file is uploaded to a bucket. Functions convert PDF to png image and do further processing.

When I am trying to create a 4th bucket and similar function, strangely it is not working. Even if I copy one of the existing 3 functions, it is still not working and I am getting this error:

Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/functions_v1beta2/worker.py", line 333, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions_v1beta2/worker.py", line 199, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions_v1beta2/worker.py", line 196, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 27, in pdf_to_img with Image(filename=tmp_pdf, resolution=300) as image: File "/env/local/lib/python3.7/site-packages/wand/image.py", line 2874, in __init__ self.read(filename=filename, resolution=resolution) File "/env/local/lib/python3.7/site-packages/wand/image.py", line 2952, in read self.raise_exception() File "/env/local/lib/python3.7/site-packages/wand/resource.py", line 222, in raise_exception raise e wand.exceptions.PolicyError: not authorized/tmp/tmphm3hiezy' @ error/constitute.c/ReadImage/412`

It is baffling me why same functions are working on existing buckets but not on new one.

UPDATE: Even this is not working (getting "cache resources exhausted" error):

In requirements.txt:

google-cloud-storage
wand

In main.py:

import tempfile

from google.cloud import storage
from wand.image import Image

storage_client = storage.Client()

def pdf_to_img(data, context):
    file_data = data
    pdf = file_data['name']

    if pdf.startswith('v-'):
        return 

    bucket_name = file_data['bucket']

    blob = storage_client.bucket(bucket_name).get_blob(pdf)

    _, tmp_pdf = tempfile.mkstemp()
    _, tmp_png = tempfile.mkstemp()

    tmp_png = tmp_png+".png"

    blob.download_to_filename(tmp_pdf)
    with Image(filename=tmp_pdf) as image:
        image.save(filename=tmp_png)

    print("Image created")
    new_file_name = "v-"+pdf.split('.')[0]+".png"
    blob.bucket.blob(new_file_name).upload_from_filename(tmp_png)

Above code is supposed to just create a copy of image file which is uploaded to bucket.

Dustin Ingram
  • 20,502
  • 7
  • 59
  • 82
Naveed
  • 522
  • 6
  • 22
  • None of the wand (imagemgick) functionality is working. I tried cropping an image and I got this error: wand.exceptions.CacheError: cache resources exhausted `/tmp/tmpt7_1dq6i' @ error/cache.c/OpenPixelCache/3984 – Naveed Nov 14 '18 at 15:39
  • I do not know if this is related, but if the server was updated for imagemagick, it could have added the policy restriction on PDF files for security, due to a bug in Ghostscript that has now been fixed. If you relax the policy restriction, it might work again. See https://stackoverflow.com/questions/52861946/imagemagick-not-authorized-to-convert-pdf-to-an-image/52863413#52863413 – fmw42 Nov 20 '18 at 17:45
  • @fmw42 What you said is true, but if you observe the code I posted above, Wand module is not even creating a copy of a PNG file. Also I tried editing `policy.xml` from within the cloud function but it didn't work. – Naveed Nov 20 '18 at 18:28
  • @Naveed Did you manage to get this working? I'm trying to write a very similar function (convert each page of a pdf to jpeg) and I'm getting the same `wand.exceptions.PolicyError: not authorized` – RogB Jan 17 '19 at 16:40
  • @RogB No its still not working. I am doing the PDF to PNG (you can do JPEG as well) conversion on my computer itself using pdf2image (set concurrency to 3 for faster processing) and then sending the images to cloud bucket for further processing. – Naveed Jan 17 '19 at 20:20
  • @Naveed Given DustinIngram's answer below, I'm building a small VM that will run a Flask API with code to call Wand. I will then call this API from App Engine. My initial tests are promising. I'll update here once I get it working. – RogB Jan 17 '19 at 23:11
  • Sorry, I do not know your software, but this line in the error message may be incorrect - at least it looks odd to me. `Image(filename=tmp_pdf, resolution=300) as image: File `. Should the filename be `tmp.pdf`and not `tmp_pdf`? – fmw42 Jan 21 '19 at 21:10
  • Does this answer your question? [convert:not authorized \`aaaa\` @ error/constitute.c/ReadImage/453](https://stackoverflow.com/questions/42928765/convertnot-authorized-aaaa-error-constitute-c-readimage-453) – kenorb Jul 14 '20 at 16:14

4 Answers4

4

Because the vulnerability has been fixed in Ghostscript but not updated in ImageMagick, the workaround for converting PDFs to images in Google Cloud Functions is to use this ghostscript wrapper and directly request the PDF conversion to png from Ghostscript (bypassing ImageMagick).

requirements.txt

google-cloud-storage
ghostscript==0.6

main.py

import locale
import tempfile
import ghostscript

from google.cloud import storage

storage_client = storage.Client()

def pdf_to_img(data, context):
    file_data = data
    pdf = file_data['name']

    if pdf.startswith('v-'):
        return 

    bucket_name = file_data['bucket']

    blob = storage_client.bucket(bucket_name).get_blob(pdf)

    _, tmp_pdf = tempfile.mkstemp()
    _, tmp_png = tempfile.mkstemp()

    tmp_png = tmp_png+".png"

    blob.download_to_filename(tmp_pdf)

    # create a temp folder based on temp_local_filename
    # use ghostscript to export the pdf into pages as pngs in the temp dir
    args = [
        "pdf2png", # actual value doesn't matter
        "-dSAFER",
        "-sDEVICE=pngalpha",
        "-o", tmp_png,
        "-r300", tmp_pdf
        ]
    # the above arguments have to be bytes, encode them
    encoding = locale.getpreferredencoding()
    args = [a.encode(encoding) for a in args]
    #run the request through ghostscript
    ghostscript.Ghostscript(*args)

    print("Image created")
    new_file_name = "v-"+pdf.split('.')[0]+".png"
    blob.bucket.blob(new_file_name).upload_from_filename(tmp_png)

Anyway, this gets you around the issue and keeps all the processing in GCF for you. Hope it helps. Your code works for single page PDFs though. My use-case was for multipage pdf conversion, ghostscript code & solution in this question.

timhj
  • 497
  • 4
  • 14
1

This actually seems to be a show stopper for ImageMagick related functionalities using PDF format. Similar code deployed by us on Google App engine via custom docker is failing with the same error on missing authorizations.

I am not sure how to edit the policy.xml file on GAE or GCF but a line there has to be changed to:

<policy domain="coder" rights="read|write" pattern="PDF" />

@Dustin: Do you have a bug link where we can see the progress ?

Update:

I fixed it on my Google app engine container by adding a line in docker image. This directly changes the policy.xml file content after imagemagick gets installed.

RUN sed -i 's/rights="none"/rights="read|write"/g' /etc/ImageMagick-6/policy.xml
  • Thanks for your inputs. Unfortunately I can't use app engine as it is not suitable for long running background processes. I am processing thousands of PDFs. I tried AWS lambda function but the complexity of AWS turned me off. – Naveed Nov 20 '18 at 10:35
  • I am using Cloud Functions with App engine for huge volume of data, works perfectly :) – Hasan Rafiq Nov 21 '18 at 12:22
  • @Naveed: As of August 2018 you can use docker images on serverless containers( Cloud functions ) - https://cloud.google.com/blog/products/gcp/cloud-functions-serverless-platform-is-generally-available. Try signing up for the alpha program at https://services.google.com/fb/forms/serverlesscontainers/ – Hasan Rafiq Nov 21 '18 at 14:52
0

This is an upstream bug in Ubuntu, we are working on a workaround for App Engine and Cloud Functions.

Dustin Ingram
  • 20,502
  • 7
  • 59
  • 82
  • I am getting the same error if I create a new bucket on a new Google Cloud account and use one of my 3 functions (which are working fine on their respective older buckets). Also I have tried allotting 2GB Ram (which is highest) to my GC function. All in vain. – Naveed Nov 14 '18 at 20:32
  • 1
    Thanks, I can reproduce it. Looking into it. – Dustin Ingram Nov 14 '18 at 22:30
  • If we get such errors in a local machine, we have to edit the policy.xml file in /etc/ImageMagick but can't do that in a cloud function. Looks like there is some issue in current GC function deployment while functions which were deployed few weeks ago are working fine. – Naveed Nov 15 '18 at 05:14
  • Dustin: Waiting for your response. Wand is unable to convert pdf to png, getting "policy error, not authorized". I believe this is related to policy.xml file of ImageMagick. – Naveed Nov 15 '18 at 21:26
  • 1
    I've filed an issue internally, will update here when this is resolved. – Dustin Ingram Nov 16 '18 at 20:34
  • Thanks a lot Dustin. I am in the middle of a project. It is nightmarish for me. – Naveed Nov 16 '18 at 20:39
  • Dustin: Is there any chance that this will be rectified any time soon? – Naveed Dec 02 '18 at 23:51
  • @Naveed: [This is an upstream bug in Ubuntu](https://bugs.launchpad.net/ubuntu/+source/imagemagick/+bug/1796563), we are determining how best to resolve it for Cloud Functions / App Engine. – Dustin Ingram Dec 04 '18 at 12:02
  • Dustin, appreciate your efforts, but let me tell you this is affecting our business badly, its been almost one month now and your team has not fixed the bug. Could we expect speedy resolution if we opt for paid support packages? We already have a paid Google Cloud account. – Naveed Dec 13 '18 at 18:03
  • @DustinIngram, any update on this? Is there any alternative to convert PDF to JPG in either GCF or GAE? – RogB Jan 17 '19 at 17:14
  • @RogB We are still working to find a workaround for this, I will update this thread when I have new information. An alternative would be to use [App Engine Flex](https://cloud.google.com/appengine/docs/flexible/) or [Compute Engine](https://cloud.google.com/compute/), which would allow you to modify the image or VM with the appropriate policy. – Dustin Ingram Jan 17 '19 at 19:19
  • @DustinIngram Thank you. – RogB Jan 17 '19 at 23:08
0

While we wait for the issue to be resolved in Ubuntu, I followed @DustinIngram's suggestion and created a virtual machine in Compute Engine with an ImageMagick installation. The downside is that I now have a second API that my API in App Engine has to call, just to generate the images. Having said that, it's working fine for me. This is my setup:

Main API:

When a pdf file is uploaded to Cloud Storage, I call the following:

response = requests.post('http://xx.xxx.xxx.xxx:5000/makeimages', data=data)

Where data is a JSON string with the format {"file_name": file_name}

On the API that is running on the VM, the POST request gets processed as follows:

@app.route('/makeimages', methods=['POST'])
def pdf_to_jpg():
    file_name = request.form['file_name']

    blob = storage_client.bucket(bucket_name).get_blob(file_name)

    _, temp_local_filename = tempfile.mkstemp()
    temp_local_filename_jpeg = temp_local_filename + '.jpg'

    # Download file from bucket.
    blob.download_to_filename(temp_local_filename)
    print('Image ' + file_name + ' was downloaded to ' + temp_local_filename)

    with Image(filename=temp_local_filename, resolution=300) as img:
        pg_num = 0
        image_files = {}
        image_files['pages'] = []

        for img_page in img.sequence:
            img_page_2 = Image(image=img_page)
            img_page_2.format = 'jpeg'
            img_page_2.compression_quality = 70
            img_page_2.save(filename=temp_local_filename_jpeg)

            new_file_name = file_name.replace('.pdf', 'p') + str(pg_num) + '.jpg'
            new_blob = blob.bucket.blob(new_file_name)
            new_blob.upload_from_filename(temp_local_filename_jpeg)
            print('Page ' + str(pg_num) + ' was saved as ' + new_file_name)

            image_files['pages'].append({'page': pg_num, 'file_name': new_file_name})

            pg_num += 1

    try:
        os.remove(temp_local_filename)
    except (ValueError, PermissionError):
        print('Could not delete the temp file!')

    return jsonify(image_files)

This will download the pdf from Cloud Storage, create an image for each page, and save them back to cloud storage. The API will then return a JSON file with the list of image files created.

So, not the most elegant solution, but at least I don't need to convert the files manually.

RogB
  • 441
  • 1
  • 4
  • 14