1

Cloud Function will triggered once a file gets uploaded in the storage, My File Name : PubSubMessage. Inside Text : Hi, this this the first message

from google.cloud import storage
storage_client = storage.Client()

def hello_gcs(event, context):
file = event

bucket = storage_client.get_bucket(file['bucket'])

blob = bucket.blob(file['name'])

contents = blob.download_as_string()
print('contents: {}'.format(contents))

decodedstring = contents.decode(encoding="utf-8", errors="ignore")
print('decodedstring: \n{}'.format(decodedstring))

print('decodedstring: \n{}'.format(decodedstring))

------WebKitFormBoundaryAWAKqDaYZB3fJBhx
Content-Disposition: form-data; name="file"; filename="PubSubMessage.txt"
Content-Type: text/plain

Hi, this this the first line.
Hi ,this is the second line. 

hi this is the space after.
------WebKitFormBoundaryAWAKqDaYZB3fJBhx--

My Requirements.txt file

google-cloud-storage
requests==2.20.0
requests-toolbelt==0.9.1

How do i get the actual string inside the file "Hi, I am the first message....." ?

What is the best possible way to get the text from a file? TIA

Kamal Garg
  • 107
  • 1
  • 2
  • 9
  • I see you've edited your post to include further things you want to do once you read the string inside the file, but I think it would be better to [split them into one or more separate questions](https://meta.stackexchange.com/questions/39223/one-post-with-multiple-questions-or-multiple-posts). – Rafael Almeida Jun 24 '20 at 14:27
  • @RafaelAlmeida i didnt tried those 2 things as i am stuck with getting the text part. – Kamal Garg Jun 24 '20 at 14:48
  • @RafaelAlmeida I tried that code but its failing. i have updated my question with the code. please help. not sure y its not working – Kamal Garg Jun 24 '20 at 14:54
  • From a quick look, it appears an indent problem, you're missing three spaces before the `print` in the last two lines. If this does not solve the problem, please include the error message you're getting. – Rafael Almeida Jun 24 '20 at 15:00
  • @RafaelAlmeida if its an indent problem, cloud functions throws a compilation error. Bigger problem is there is no proper error coming in logs. its just saying crashed with no other info but i am sure that its not an indent problem – Kamal Garg Jun 24 '20 at 15:15
  • That's a bit harder to debug then. You might try increasing the memory allocation if it's currently small enough that the addition and use of the extra lib caused you to run out of memory. Otherwise, try to [debug locally](https://stackoverflow.com/a/53700497/89303) and/or "debug by bisection": try removing (or "commenting out") large parts of the code until you find the exact line causing the crash. Since your code is small, you may even just create a new function and test "line-by-line" with successive deployments. – Rafael Almeida Jun 24 '20 at 15:29
  • @RafaelAlmeida its printing the 'contents' if i remove the code below it but its not working after multipart_data – Kamal Garg Jun 24 '20 at 15:38
  • It might be related to the encoding, since the default encoding in `from_response` is `utf-8`, and you seem to be passing a byte string. Maybe try converting it into a UTF-8 string with `contents.decode('utf-8')` first? I'd still recommend you try to increase the memory and debug locally as well, this way you might at least get a better error message you can add into the question. – Rafael Almeida Jun 24 '20 at 16:04
  • @RafaelAlmeida I tried, its not working. from_response method requires a response object to access the response.content but ours is just a string. Error says str object has no attribute content. Here is link i am looking but its AWS Lambda. they are hardcoding it in some part or getting things from aws lamba context object which is not a part of gcp functions :-------https://stackoverflow.com/questions/50925083/parse-multipart-request-string-in-python/50928156 – Kamal Garg Jun 25 '20 at 10:42
  • Can you tell us more about the context? The function is being triggered from an upload in Storage, alright, but do you have any control over the uploads themselves? It appears the body of a multipart form was uploaded without the headers, you should save both if you can. If this is not an option, you could assume the structure remains the same and the first line always contains the boundary, but then it's getting further into "handmade parser" territory. – Rafael Almeida Jun 25 '20 at 14:52
  • Actually we have different types of files coming from different sources and we have to process them and push it in a database. What is the best suitable solution for it? What extra metadata or info shud I pass to make things clean and easy plus how can I parse it ? I tried a lot but not getting it.Please help – Kamal Garg Jun 25 '20 at 19:10

2 Answers2

3

The string you read from Google Storage is a string representation of a multipart form. It contains not only the uploaded file contents but also some metadata. The same kind of request may be used to represent more than one file and/or form fields along with a file.

To access the file contents you want, you can use a library which supports that, such as requests-toolbelt. Check out this SO answer for an example. You'll need the Content-Type header, which includes the boundary, or to manually parse the boundary just from the content, if you absolutely must.

EDIT: from your answer, it seems that the Content-Type header was available in the Storage Metadata in Google Storage, which is a common scenario. For future readers of this answer, the specifics of where to read this header from will depend on your particular case.

Since this library is present in PyPI (the Python Package Index), you can use it even in Cloud Functions by specifying it as a dependency in the requirements.txt file.

Rafael Almeida
  • 10,352
  • 6
  • 45
  • 60
  • I want to achieve it via cloud func tions so i am nt sure whether this toolbelt is supported or how can i get it work there. is there any other way with which i can access the file data and process it within the same function? – Kamal Garg Jun 24 '20 at 13:22
  • 1
    You can use it in Cloud Functions, I added an edit to the answer with the link explaining the process. – Rafael Almeida Jun 24 '20 at 13:26
  • I tried, its not working. from_response method requires a response object to access the response.content but ours is just a string. Error says str object has no attribute content. Here is link i am looking but its AWS Lambda. they are hardcoding it in some part or getting things from aws lamba context object which is not a part of gcp functions :-------https://stackoverflow.com/questions/50925083/parse-multipart-request-string-in-python/50928156 – Kamal Garg Jun 25 '20 at 10:46
0

Below Code will print the actual text present inside a file.

from requests_toolbelt.multipart import decoder
from google.cloud import storage
storage_client = storage.Client()

def hello_gcs(event, context):
    file = event
    
    bucket = storage_client.bucket(file['bucket'])
    #print('Bucket Name :  {}'.format(file['bucket']))
    #print('Object Name :  {}'.format(file['name']))
    #print('Bucket Object :  {}'.format(bucket))
    
    blob = bucket.get_blob(file['name'])
    #print('Blob Object :  {}'.format(blob))
    
    contentType = blob.content_type
    print('Blob ContentType: {}'.format(contentType))

    #To download the file as byte object
    content = blob.download_as_string()
    print('content: {}'.format(content))

    for part in decoder.MultipartDecoder(content, contentType).parts:
         print(part.text)
Kamal Garg
  • 107
  • 1
  • 2
  • 9