1

I have various services in my application that make external requests to web to scrape some data, as in Service A makes a requests to imdb.com/query and service B makes to reddit.com/query. I want to add a service between these services and web for these outgoing requests so that

  • Service can cache the responses, with a configurable caching period.

  • Service is inspectable, it can log the requests, response times and various metadata related to requests, possibly with an option to choose cache backend (in memory db, rdbms, files ?)

  • Service should not care about schema of the requests except that they are outbound http/https request, (Client interface should not change, except the target it is sending request)

I can centralize caching, and logging this way

Could not find anything useful after searching despite it feels to me as very common scenario. (I thought of using forward proxy in first place but they are not easy to use painful to setup and extend- -tell me if i'm wrong). Not sure there is a better term for such a scenario ( see made up title : ))

Is there such a tool, saas , OSS out there somewhere that can fulfill these needs? Maybe I am taking the problem from completely wrong perspective ?

woryzower
  • 956
  • 3
  • 15
  • 22

2 Answers2

0

This answer recommends a tool called Squid:

Squid is a caching proxy for the Web supporting HTTP, HTTPS, FTP, and more. It reduces bandwidth and improves response times by caching and reusing frequently-requested web pages. Squid has extensive access controls and makes a great server accelerator. It runs on most available operating systems, including Windows and is licensed under the GNU GPL.

cntlm seems to be another option (source). Also see this answer for a great list of various HTTP proxy tools.

Does either of these fulfill your particular use case?

Aleksi
  • 4,483
  • 33
  • 45
0

Not sure why you need a service for this. You can pretty much write your own Caching service if you link the data with Redis.

Redis is an in-memory database which has got excellent response times. The only thing is, you need a permanent repository of the same data, just in case, redis goes down and you still need access to the data.

Providing a sample nodeJS, Hope it helps. Yes, you can also configure the time period there, if you want to.

module.exports.findURlDataCached = function (db, redis, url, callback) {
    redis.get(title, function (err, reply) {
        if (err) callback(null);
        else if (reply) //Url and response exists in cache
        callback(JSON.parse(reply));
        else {
            //Url doesn't exist in cache - we need to query the main database
            db.collection('text').findOne({
                title: title
            }, function (err, doc) {
                if (err || !doc) callback(null);
                else {\\Url found in database, save to cache and
                    return to client
                    redis.set(url, JSON.stringify(doc), function () {
                        // Sets the expiry time by 24 hours from current time
                        redis.expireat(url, parseInt((+new Date)/1000) + 86400);
                        callback(doc);
                    });
                }
            });
        }
    });
};
Umashankar Das
  • 601
  • 4
  • 12