1

This is going to be a duplicate of store numpy array in mysql . The reason is that the actual question is not answered there, and 5 years later I (another random person) am struggling with the same problem.

Goal

I have a lot of numpy vectors. I want to store each of them in a BLOB column in a MySQL database. I have already read that this is likely not a good idea, but I'd like to do it anyway. For this, I am trying to create a query. The code looks something like:

query = 'INSERT INTO vectors_table (word_id, vector) VALUES '

for i in vectors:
    query += ({}, {}).format(i.index, i.vector.dumps())

(notice the call to .dumps())

Alternatives

Numpy arrays have a few options of functions to "serialize" its data:

I could also use Pickle. There is a function called pickle.dumps that also gives me a bytes sequence.

Problem

When I try creating the query, it inserts that goddamn b before the string. E.g., if I print one of my vectors, it looks like:

b'\x80\x02cnumpy.core.multiarray\n_reconstruct\nq\x00cnumpy\nndarray\nq\x01K\x00\x85q\x02c_codecs\nencode\nq\x03X\x01\x00\x00\x00bq\x04X\x06\x00\x00\x00latin1q\x05\x86q\x06Rq\x07\x87q\x08Rq\t(K\x01M\x00\x0c\x85q\ncnumpy\ndtype\nq\x0bX\x02\x00\x00\x00f4q\x0cK\x00K\x01\x87q\rRq\x0e(K\x03X\x01\x00\ ...

Unfortunately, as can be seen in the "Alternatives" above, this is true for all alternatives I have found so far.

Attempted solutions

In trying to solve the problem, I find lots of people telling me to use decode, or to cast the bytes into a string.

Calling either .decode('ascii') or .decode('utf-8') has led to UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128) and UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte, respectively.

Trying to convert the bytes into a string creates a string that includes the "b".

Is there any better way of doing this?

I am noob with MySQL and totally believe that I am missing some obvious solution. I am getting to the point where I'm planning to just create a string with

for i in vectors:
    old_str = str(i.vector.dumps())
    new_str = old_str[1:]
    query += ({}, {}).format(i.index, new_str)

And just insert that into my query string. Stil... I can't believe that this is the only way to get this to work.

vaulttech
  • 493
  • 1
  • 5
  • 15

1 Answers1

1

In the end, the answer to my question was in this other question. I'll answer it here in case someone else has a similar problem.

I was doing

query = 'INSERT INTO vectors_table (word_id, vector) VALUES '

for i in vectors:
    query += "({}, {})".format(i.index, i.vector.dumps())

and then running cursor.execute(query).

Instead, what I should do was to write my query with %s:

query = 'INSERT INTO vectors_table (word_id, vector) VALUES (%s, %s)'

then get all the elements I want to insert into the table:

tuples = [(i.index, i.vector.dumps()) for i in vectors]

and let the MySQL library handle the escaping of the bad bytes:

cursor.executemany(query, tuples)
vaulttech
  • 493
  • 1
  • 5
  • 15