0

I have data stored as key value pairs in a leveldb database. The values are the laser vector embedding of sentences and keys are intents of those sentences. When a new sentence is inputted, I compare the vector embedding of that sentence against the values in the leveldb database in order to identify the intent. Here, I have used a nested for loop and this takes more than 5 seconds to execute. Can someone suggest a way to optimize this loop/ code segment?

expose.py

import plyvel
from flask import Flask
from flask_restful import Api
from laserembeddings import Laser
from getters.getIntents import *
from getters.getEntities import *

app = Flask(__name__)
api = Api(app)

si_data_vec = plyvel.DB('levelDB/si_data_vec', create_if_missing=False)

path_to_bpe_codes = 'data/laser_models/93langs.fcodes'
path_to_bpe_vocab = 'data/laser_models/93langs.fvocab'
path_to_encoder = 'data/laser_models/bilstm.93langs.2018-12-26.pt'

laser = Laser(path_to_bpe_codes, path_to_bpe_vocab, path_to_encoder)


@app.route('/lang/si/<keylist>', methods=['GET'])
def get_si(keylist):

    intent = get_intents(keylist, si_data_vec, laser)

    return intent


# Initialize and start the web application
if __name__ == "__main__":
    app.run()

getIntents.py

This contains the loop to be optimized

import io
from itertools import combinations
import numpy as np


def get_intents(key_list, si_data_vec, laser):

    avg = laser.embed_sentences([key_list], lang='si')[0]

    minimum_dist = 1
    intent = ''

    ### LOOP TO BE OPTIMIZED
    for key, value in si_data_vec:
        bio = io.BytesIO(value)
        vec = np.load(bio)

        for pair in combinations([avg, vec], 2):
            dist = distance(list(pair[0]), list(pair[1]))
            if dist < minimum_dist:
                minimum_dist = dist
                intent = key.decode()
    return intent


def distance(list1, list2):
    """Distance between two vectors."""
    squares = [(p-q) ** 2 for p, q in zip(list1, list2)]
    return sum(squares) ** .5

Updated getIntents.py as per the comment

import io
import numpy as np


def get_intents(key_list, si_data_vec, laser):

    avg = laser.embed_sentences([key_list], lang='si')[0]

    minimum_dist = 1
    intent = ''
    for key, value in si_data_vec:
        bio = io.BytesIO(value)
        vec = np.load(bio)

        dist = distance(avg, vec)

        if dist < minimum_dist:
            minimum_dist = dist
            intent = key.decode()

    return intent


def distance(list1, list2):
    """Distance between two vectors."""
    squares = [(p-q) ** 2 for p, q in zip(list1, list2)]
    return sum(squares) ** .5
Kabilesh
  • 1,000
  • 6
  • 22
  • 47
  • 1
    If I am not mistaken, `combinations([avg, vec], 2)` will return exactly one value - `(avg, vec)`. So what this inner loop is for? – Błotosmętek Feb 20 '20 at 09:44
  • Yes. That was unnecessary. Thank you for pointing out. Got rid of that inner loop. See the update. Still, calculating the distance between input vector and 1000+ other vectors take time. Is there any suggestion to optimize this? – Kabilesh Feb 20 '20 at 10:52
  • I'd say your best bet is to look into special data structures optimized for storage/retrieval of spatial info. – Mad Physicist Feb 20 '20 at 11:57

1 Answers1

1

The only thing I can think of is using numpy for distance calculation (as you already import numpy anyway); I am not sure if this will give you much speedup though.

avg = np.array(laser.embed_sentences([key_list], lang='si')[0]) 
for key, value in si_data_vec:
    bio = io.BytesIO(value)
    vec = np.load(bio)
    dist = np.linalg.norm(avg-vec)

See also How can the Euclidean distance be calculated with NumPy?

Błotosmętek
  • 12,717
  • 19
  • 29