Python MongoDB query requests segment data into chunks

Question

I am writing a python script that does these steps below.

Query a MongoDB database Parse and aggregate results Upload data to a ServiceNow table via a REST API

The script works, however, the data set is too large, and the REST transaction times out after 60 seconds (the connection is closed by the ServiceNow server at the destination).

I will need to segment the data into chunks and send separate REST transactions for each data chunk to ensure the full data set is sent via POST without hitting the timeout limit.

How can I achieve that goal by modifying the script below?

#!/usr/bin/env python

from config import *

import os, sys

mypath = os.path.dirname(os.path.realpath(__file__))
sys.path.append(os.path.join(mypath, "api-python-client"))

from apiclient.mongo import *

from pymongo import MongoClient

import json

import requests

from bson.json_util import dumps

client = MongoClient(mongo_uri)

#Create ServiceNow URL
svcnow_url = create_svcnow_url('u_imp_cmps')

#BITSDB Nmap Collection
db = client[mongo_db]

#Aggregate - RDBMS equivalent to Alias select x as y
#Rename fields to match ServiceNow field names
computers = db['computer'].aggregate([
        {"$unwind": "$hostnames"},
        {"$project" : {
                "_id":0,
                "u_hostname": "$hostnames.name",
                "u_ipv4": "$addresses.ipv4",
                "u_status": "$status.state",
                "u_updated_timestamp": "$last_seen"
        }}

])

j = dumps({"records":computers})
#print(j)


#Set proper headers
headers = {"Content-Type":"application/json","Accept":"application/json"}

#Build HTTP Request
response = requests.post(url=svcnow_url, auth=(svcnow_user, svcnow_pwd), headers=headers ,data=j)

#Check for HTTP codes other than 200
if response.status_code != 200:
        print('Status:', response.status_code, 'Headers:', response.headers, 'Response Text', response.text, 'Error Response:',response.json())
        exit()

#Decode the JSON response into a dictionary and use the data
print('Status:',response.status_code,'Headers:',response.headers,'Response:',response.json())

UPDATE: I have a plan but I am not sure how to implement this exactly.

Set the cursor to a fixed batch size of 1000 records each
When a batch is full, create the JSON output and send the data via requests
In a loop: continue grabbing new batches and sending each batch to the destination until the entire data set is reached

https://docs.mongodb.com/v3.0/reference/method/cursor.batchSize/

Basically I think I can solve this by creating batches and looping through the batches, with a new API call each time. Please let me know if anyone out there has any ideas if this is a good plan and how to implement a solution. Thanks.

Does anyone have any suggestions on at least where to start? I'm stuck. Thank you. — pengz, Jun 20 '16 at 20:42

score 1 · Accepted Answer · edited May 23 '17 at 11:44

j = dumps({"records":computers}) will return a list, so you can easily point to a single data entry by calling j[x], or iterating through a for loop. Each of these entries should be acceptable to ServiceNow.

# Set proper headers (these are always the same, so this
# can be assigned outside of the for loop)
headers = {"Content-Type":"application/json","Accept":"application/json"}

for data_point in j:

    #Build HTTP Request (Note we are using data_point instead of j)
    response = requests.post(url=svcnow_url, auth=(svcnow_user, svcnow_pwd), headers=headers ,data=data_point)

    #Check for HTTP codes other than 200
    if response.status_code != 200:
        print('Status:', response.status_code, 'Headers:', response.headers, 'Response Text', response.text, 'Error Response:',response.json())
    else:
        # This is a response of success for a single record
        print('Status:',response.status_code,'Headers:',response.headers,'Response:',response.json())

exit()

If you have 100 new entries in MongoDB, this will make 100 POST calls to ServiceNow. Your ServiceNow instance should be able to handle the load, and you can very easily identify records that failed to load.

However, if you need to condense the number of calls for any reason, I'd suggest splitting your list into 'sublists', like the one-liner featured in this answer:

# Set proper headers (these are always the same, so this
# can be assigned outside of the for loop)
headers = {"Content-Type":"application/json","Accept":"application/json"}

# Each POST will send up to 10 records of data
split_size = 10

# Note the two places where our split_size variable is used
for data_point in [j[x:x+split_size] for x in xrange(0, len(j), split_size)]:

    #Build HTTP Request (Note we are using data_point instead of j)
    response = requests.post(url=svcnow_url, auth=(svcnow_user, svcnow_pwd), headers=headers ,data=data_point)

    #Check for HTTP codes other than 200
    if response.status_code != 200:
        print('Status:', response.status_code, 'Headers:', response.headers, 'Response Text', response.text, 'Error Response:',response.json())
    else:
        # This is a response of success for a single record
        print('Status:',response.status_code,'Headers:',response.headers,'Response:',response.json())

exit()

Hi Steve thanks for the answer. I am having an issue where the loop is looping over the data character-by-character (not record-by-record). In other words if I do " for data_point in j: print(data_point) " that will return every single character one by one, rather than each record. Any idea on how I can alleviate this issue? Thanks! — pengz, Jun 30 '16 at 19:28
Ah I think I know why it is doing that. The data type of the variable "j" is a string not a list. "print(type(j))" . — pengz, Jun 30 '16 at 19:36
I was able to fix it by converting the cursor data into a list and iterating the list. Thanks so much again for your help. — pengz, Jun 30 '16 at 19:57

Python MongoDB query requests segment data into chunks

1 Answers1