0

I have created a script that :

Imports a list of IP's from .txt ( around 5K )

Connects to a REST API and performs a query based on the IP ( web logs for each IP)

Data is returned from the API and some calculations are done on the data

Results of calculations are written to a .csv

At the moment it's really slow as it takes one IP at a time does everything and then goes to the next IP . I may be wrong but from my understanding with threading or multiprocessing i could have 3-4 threads each doing an IP which would increase the speed of the tool by a huge margin . Is my understanding correct and if it is should i be looking at threading or multi-processing for my task ?

Any help would amazing

Random info, running python 2.7.5 , Win7 with plenty or resources.

NickDa
  • 57
  • 5

2 Answers2

1

With multiprocessing a primitive way to do this whould be chunck the file into 5 equal pieces and give it to 5 different processes write their results to 5 different files, when all processes are done you will merge the results.

You can have the same logic with Python threads without much complication. And probably won't make any difference since the bottle neck is probably the API. So in the end it does not really matter which approach you choose here.

There are two things two consider though:

  • Using Threads, you are not realy using multiple CPUs hence you are have "wasted resources"
  • Using Multiprocessing will use multiple processors but it is heavier on start up ... So you will benefit from never stoping the script and keeping the processes alive if the script needs to run very often.

Since the information you gave about the scenario where you use this script (or better say program) is limited, it really hard to say which is the better approach.

oz123
  • 27,559
  • 27
  • 125
  • 187
  • Indeed, it doesn't make sense to spawn a separate process for each query. But grouping them and splitting them before so each process has several queries in a row to process could release the bottleneck. @user118704: Can you give some more information which module you use or how you process the query? – HelloWorld Aug 29 '15 at 14:08
  • Indeed the bottleneck would be the API however i think i will be able to increase the speed by doing multiple queries . Basically i am querying a Splunk instance for logs . As for modules im assuming you are asking what module i used to connect to the api? its urllib and urllib2 and then i do some calculations on the returned data using numpy . – NickDa Aug 29 '15 at 14:20
1

multiprocessing is definitely the way to go forward. You could start a process that reads the IPs and puts them in a multiprocessing.Queue and then make a few processes (depending on available resources) that read from this queue, connect to the API and make the requests. These requests shall then be made in parallel, and if the API can handle these requests, your program should finish faster. If the calculations are complex and time consuming, the output from the API can be put into another Queue, from where other processes you start, can read them and make the calculations and store results. You may have to start a collector process to collect the final outputs.

You can find some sample code for such problems in this stackoverflow question. If you require further explanation or sample code, let me know in comments.

Community
  • 1
  • 1
gnub
  • 193
  • 2
  • 11
  • There is a simple example of using Process and Queue in http://stdioe.blogspot.com.tr/2015/09/parallel-numeric-integration-using.html – jbytecode Sep 02 '15 at 14:23