0

I'm trying to insert PDF files into a MongoDB database. The files are small enough (<16 MegaBytes) so I don't think I need to add the complexity of GridFS (even though it looks pretty easy to use based on the tutorials I've seen). How can I do this using flask_pymongo (or even a basic example using pymongo would be great).

Here's what I have so far but I'm getting the following error:

bson.errors.InvalidStringData: strings in documents must be valid UTF-8

flask_app.py:

from flask import Flask, render_template_request
from flask_pymongo import PyMongo

app = Flask(__name__)
app.config['MONGO_DBNAME'] = 'records'
app.config['MONGO_URI'] = 'mongodb://localhost:27017/records'
mongo = PyMongo(app)

@app.route('/', methods=['GET', 'POST'])
def upload():
    if request.method = 'POST':
        files_collection = mongo.db.files_collection  # connect to mongodb collection
        input_file = request.files['input_file']  # get file from front-end
        files_collection.insert_one({'data': input_file.read() })  # error occurs here
        return 'File uploaded'
return render_template('index.html')

index.html:

<form method='POST' action="{{ url_for('upload') }}" enctype='multipart/form-data'>
    <input type='file' name='input_file'>
    <input type='submit' value='Upload'>
</form>

Seems like I just need to convert the data to the proper data type before entering it into mongodb, which appears to be the binData type based on this answer here

Community
  • 1
  • 1
Johnny Metz
  • 5,977
  • 18
  • 82
  • 146

1 Answers1

3

Use the bson.Binary class to store untyped data:

from bson import Binary
my_pdf_data = b'xxx'  # bytes, can be anything, not just UTF-8

db.collection.insert({'data': Binary(my_pdf_data)})
document = db.collection.find_one()
print(repr(document['data']))
print(type(document['data']))

The Binary type inherits from Python's builtin "bytes" type, so you can use it wherever you use bytes - e.g., save it to a file, pass it to a PDF parser. In Python 2 this code prints:

Binary('xxx', 0)
<class 'bson.binary.Binary'>

In Python 3, instances of Binary will be decoded directly to "bytes", so this prints:

b'xxx'
<class 'bytes'>
A. Jesse Jiryu Davis
  • 23,641
  • 4
  • 57
  • 70