As mentioned in the title, I have a bigquery table with 18 million rows, nearly half of them are useless and I am supposed to assign a topic/niche to each row based on an important column (that has detail about a product a website), I have tested NLP API on a sample data with size of 10,000 and it did wonders but my standard approach where I am iterating over the newarr (which is the important details column I am obtaining through querying my bigquery table), here I am sending only one cell at a time, awaiting response from the api and appending it to the results array.
Ideally I want to do this operation on 18 Million rows in the minimum time, my per minute quota is increased to 3000 api requests so thats the max I can make, But I cant figure out how can i send a batch of 3000 rows one after another each minute.
for x in newarr:
i += 1
results.append(sample_classify_text(x))
Sample Classify text is a function straight from Documentation
#this function will return category for the text
from google.cloud import language_v1
def sample_classify_text(text_content):
"""
Classifying Content in a String
Args:
text_content The text content to analyze. Must include at least 20 words.
"""
client = language_v1.LanguageServiceClient()
# text_content = 'That actor on TV makes movies in Hollywood and also stars in a variety of popular new TV shows.'
# Available types: PLAIN_TEXT, HTML
type_ = language_v1.Document.Type.PLAIN_TEXT
# Optional. If not specified, the language is automatically detected.
# For list of supported languages:
# https://cloud.google.com/natural-language/docs/languages
language = "en"
document = {"content": text_content, "type_": type_, "language": language}
response = client.classify_text(request = {'document': document})
#return response.categories
# Loop through classified categories returned from the API
for category in response.categories:
# Get the name of the category representing the document.
# See the predefined taxonomy of categories:
# https://cloud.google.com/natural-language/docs/categories
x = format(category.name)
return x
# Get the confidence. Number representing how certain the classifier
# is that this category represents the provided text.