0

A PHP or Python script periodically fetches a large dataset of IP addresses (/32 netmask) from a remote database. In-between fetches the dataset will be temporarily stored in APC or Memcached key store.

The main job of the script is to check if a given ip-address exists in the database/cache (think: "blacklist").

What would be the most efficient (performance wise) way to:

  1. Store the IP addresses in APC/Memcache
  2. Compare a given IP to the stored IP list.

What i have come up with so far:

Alternative 1 Store all IP addresses as a large array-list as the value of a single key in APC, then do a

if (in_array("192.168.0.1", $ip_list_from_cache))

Alternative 2 Store each IP as key-name in APC, then do a

if (apc_exists('192.168.0.1')

This is a large list and i want the compare check to be very fast.

Thanks in advance for any comments!

Thomas Orozco
  • 53,284
  • 11
  • 113
  • 116
Leo Houer
  • 347
  • 2
  • 4
  • 8

1 Answers1

0

The solution to performance dilemmas is usually to benchmark both solutions.

In this case though, I'd say the cache approach makes a lot more sense: the time complexity of in_array is O(N), that is, a linear sweep. On the other hand, Caches are usually implemented as hash tables, where lookup is O(1).

Also, if you aggregate the records in Memcached, you'll avoid wasting a lot of RAM duplicating the list in memory once per web worker process.

It would also arguably be a much cleaner solution.


On a side note, did you consider doing this at another level? With some light scripting, you could do your checks at the LB (e.g. Nginx) level.

Community
  • 1
  • 1
Thomas Orozco
  • 53,284
  • 11
  • 113
  • 116
  • By second thought, would it speed up things if IP addresses were stored as hashes (e.g. converted to unsigned int) vs string? – Leo Houer Sep 20 '13 at 20:11
  • @LeoHouer That question does have more detail: http://stackoverflow.com/questions/2473989/list-of-big-o-for-php-functions – Thomas Orozco Sep 20 '13 at 20:15