0

As mentioned in the title, I have a bigquery table with 18 million rows, nearly half of them are useless and I am supposed to assign a topic/niche to each row based on an important column (that has detail about a product a website), I have tested NLP API on a sample data with size of 10,000 and it did wonders but my standard approach where I am iterating over the newarr (which is the important details column I am obtaining through querying my bigquery table), here I am sending only one cell at a time, awaiting response from the api and appending it to the results array.

Ideally I want to do this operation on 18 Million rows in the minimum time, my per minute quota is increased to 3000 api requests so thats the max I can make, But I cant figure out how can i send a batch of 3000 rows one after another each minute.

for x in newarr:
    i += 1
    results.append(sample_classify_text(x))

Sample Classify text is a function straight from Documentation

#this function will return category for the text
from google.cloud import language_v1

def sample_classify_text(text_content):
    """
    Classifying Content in a String

    Args:
      text_content The text content to analyze. Must include at least 20 words.
    """

    client = language_v1.LanguageServiceClient()

    # text_content = 'That actor on TV makes movies in Hollywood and also stars in a variety of popular new TV shows.'

    # Available types: PLAIN_TEXT, HTML
    type_ = language_v1.Document.Type.PLAIN_TEXT

    # Optional. If not specified, the language is automatically detected.
    # For list of supported languages:
    # https://cloud.google.com/natural-language/docs/languages
    language = "en"
    document = {"content": text_content, "type_": type_, "language": language}

    response = client.classify_text(request = {'document': document})
    #return response.categories
    # Loop through classified categories returned from the API
    for category in response.categories:
        # Get the name of the category representing the document.
        # See the predefined taxonomy of categories:
        # https://cloud.google.com/natural-language/docs/categories
        x = format(category.name)
        return x

        # Get the confidence. Number representing how certain the classifier
        # is that this category represents the provided text.
  • Looks like you need something like [How to limit rate of requests to web services in Python?](https://stackoverflow.com/questions/401215/how-to-limit-rate-of-requests-to-web-services-in-python) – fsimonjetz Jul 10 '21 at 17:56
  • Thanks this is helpful, but I want to batch dispatch 3000 documents each minute and get batch results from Google Natural Language API so I can utilise the maximum capicity. – Aavesh T Jul 10 '21 at 18:14
  • Did you consider to use Spark to parallelise requests using BigQuery data? [Example parallelising request](https://stackoverflow.com/questions/61319178/how-can-i-send-a-batch-of-strings-to-the-google-cloud-natural-language-api) [Example using PySpark with BigQuery Data](https://medium.com/@amanmittal1990/reading-bigquery-table-in-pyspark-cb79de236908) [Using BigQuery Connector with Spark](https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example#pyspark) – davidmesalpz Jul 13 '21 at 11:53

0 Answers0