0

I have been using the below script for uploading the data to do load testing of my module:

import json
import ast
import pandas as pd
import sys
import cloudant_connection as cloud

df = pd.read_csv("./deviceId_data/device_starts_"+ sys.argv[1] + ".csv")
print(" checkpoint 1 cleared ")
def push_data_to_cloudant(ID,Key,Value,Database):
    Value = ast.literal_eval(Value)
    temp_doc = {}
    temp_doc["_id"] = ID
    temp_doc["value"] = Value["value"]
    temp_doc["devId"] = Value["devId"]
    temp_doc["eDateTime"] = Key[0]
    temp_doc["eDate"] = Value["eDate"]
    temp_doc["cDateTime"] = Key[0]
    temp_doc["cDate"] = Value["cDate"]
    
    new_doc = Database.create_document(temp_doc)
    
    if new_doc.exists():
        #print("doc created")
        return "Success"
    else:
        print("Failed in pushing document")
        return "Failure"

with open("./connection_config_source.json") as f:
    connect_conf = json.load(f)
print(" checkpoint 2 cleared ")
API_KEY = connect_conf['cloudant_api_key']
ACC_NAME = connect_conf['cloudant_account_name']
print(" checkpoint 3 cleared ")
try:
    client = cloud.connecting_to_cloudant_via_api(ACC_NAME,API_KEY)

    database_name = 'DB_NAME'
    
    Database = client[database_name]
    print(" checkpoint 4 cleared ")
    if Database.exists():
        print("Connected")
      
    status = [push_data_to_cloudant(ID,Key,Value,Database) for (ID,Key,Value) in zip(df['id'],df['key'],df['value'])]
    print(" last checkpoint cleared ")
except Exception as e:
    print("Failed:" + str(e))

I know that there are faster ways than using list comprehension. But I don't know how to use them in this scenario.

I know df.apply() is faster than this, but I wanted to know if I could use Pandas Vectorization or Numpy Vectorization for this use case.

tblaze
  • 138
  • 10
  • Numpy vectorization and list comprehensions are plenty fast for cpu-bound processes. However, you are writing to a database, and it's arguable whether or not you are using the resultant list. This is an I/O bound process, and the list creation itself isn't the bottleneck. I'd say that this is using comprehension syntax for side-effects – C.Nivs Jan 10 '22 at 17:53
  • So how do I transfer 23GBs worth of file to cloudant? Any ideas I would be grateful. By the way I already thought of multithrreading and multiprocessing. This code is taking 2 day to transfer 1.5GBs to the cloudant and I think its pretty slow. – tblaze Jan 10 '22 at 20:47
  • I agree, that's pretty slow. Maybe [this answer](https://stackoverflow.com/a/61379119/7867968) is what you're looking for? – C.Nivs Jan 11 '22 at 14:53

1 Answers1

0

python-cloudant documentation:

bulk_docs(docs) Performs multiple document inserts and/or updates through a single request. Each document must either be or extend a dict as is the case with Document and DesignDocument objects. A document must contain the _id and _rev fields if the document is meant to be updated.

Parameters: docs (list) – List of Documents to be created/updated. Returns: Bulk document creation/update status in JSON format

Just use:

Database = client['DB_name']

Database.bulk_docs(*argv)

The argument here can be a list of dictionaries or json objects.

tblaze
  • 138
  • 10