-1

I'm having a problem understanding which could be a way to design a Python worker+API that process data gathered from the Internet and let other services(external) access this data thru an API. I've to tell you that I don't have a specific background in Computer Science just a lot of trial and errors that led me to get to this.

The Question

How can avoid to find my API busy because the worker is fetching the data? Threading and queues seem to be the solution, but I'm having problems adapting my project to them. Could someone suggest me which approach should be used in this case? and projects that maybe are similar to this one?

I've already written a question on this on Stack without any answer, here you can find the code (my first question + code).

This problem can be also framed in a different scale (bigger) into this question ( multiple workers+Flask APIs )

This is my situation More or Less   script structure

 References

I've also checked them out:

Community
  • 1
  • 1
entalpia
  • 101
  • 1
  • 11
  • Perhaps you can look up Flask-Celery – OneCricketeer Jan 23 '18 at 15:12
  • I've just checked how Celery works, it also capable on helping me handling when I can access or not to my Gloval Var? – entalpia Jan 23 '18 at 15:23
  • I'm not sure I understand the global var, but serializing an object across tasks (for example, using `pickle`) should be possible – OneCricketeer Jan 23 '18 at 16:29
  • global var, means a var that is shared between both processes. It wasn't clear. I'm sorry. Pickle it sound interesting but I actually don't understand if it suited for this case. – entalpia Jan 23 '18 at 17:06
  • Pickle is a native Python serialization format. It can be shared between processes, or over a network socket. If you want a more robust persistence mechanism, then you can try a database. For example, this uses Redis. http://allynh.com/blog/flask-asynchronous-background-tasks-with-celery-and-redis/ – OneCricketeer Jan 23 '18 at 20:38
  • Thank you for pickle, I didn't know it. It could be really useful in certain situation. I also saw the example you gave me, thank you. It seems that next time I step in a project like this I will need to make in this way. Sound not that straightforward in the first place, but maybe isn't that bad. Thank you!! – entalpia Jan 24 '18 at 09:42

1 Answers1

1

Use the Threading library. Keep the main thread open for handling responses and spin off 'job' threads that are thread.joined() to each other to form a queue.

You'll need to provide the API user with a job id(best to persist these, and perhaps progress and status update info, outside the app in a database), and then allow them to query their job's status/download their job from another endpoint. You could keep another queue of threads handling anything compute intensive related to collecting/downloading.

All that said, this can all also be accomplished using a micro service architecture in which you have one app scheduling jobs, one app retrieving/processing data, and one app handling status/download requests. These would be joined via http interfaces(restful would be great) and a database for common persistence of data.

The benefit of this last approach is in each app being independently scalable from an availability and resources perspective within some framework like Kubernetes.

UPDATE:

Just read your original post and your main issue seems to be persisting your data in a global variable, rather than a database. Keep your data in a database, and provide it to clients either through a separate application, or a set of threads that are set aside and available in your current app.

UPDATE response to OP comment:

Stefano, in the use case you're describing, there is no need for any of the components to be connected to each other. They all only need to be connected to the database.

The data collection service should collect the data, and then submit it to the database for storage, where the "request data" component can find and retrieve it.

If there is a need for user input to this process, then the "submit request for data" component should accept that request, provide the user with an id, and then store that job's requirements in the database for the data collector component to discover. You would then need one more component for serving a status/progress on the job from the database to the user.

What DB are you using? If it's slow/busy, you can scale the resources available to it (RAM), or you can look at batching your updates from the data collector, which is the most likely culprit of unnecessary DB overhead. How many transactions are you submitting per second? And of what size?

Ed anche, si sei italiano, poui domandarmi in la lingua tua si sia piu facile communicare questi detagli technichi.

John R
  • 1,505
  • 10
  • 18
  • thank you for your answer. I really appreciate it. I've been working on a similar project for a while and I've tried a lot of attempts and tests. I've also tried with a DB (it was my first attempt) but I had to move away cause I was having an issue accessing the DBs when it was busy serving other processes. Now I've divided tasks over multiple mini services (like you suggested, and it works) But the central service, that collect all the information, has to deal from one side in accessing each mini service, and from the other a mini service with a web interface. and sometimes, it's busy. :( – entalpia Jan 24 '18 at 09:38