2

I have written a class, ProxyFinder which connects to random ips and first pings them, and if they respond, attempts to create a http proxy connection through common proxy ports.

Currently, it is set up just connecting to random ips. This is relatively fast, discovering a few proxys an hour. However, I would like to somehow check if I have already previously connected to an ip. First I tried keeping them in a list, but that was using over 10GB of ram.. I included a method that I tried in the code below which writes the data to a cache using a RandomAccessFile, but this is incredibly slow to search through the entire file for each connection as it gets larger.

I am storing the data in as small of format as possible, simply four bytes for each ip. Even though, this is 4 * 256 * 256 *256 * 256 bytes.. = 16gb of raw ram.. or a 16gb file to search each time you want to test another ip.

I also tried creating a separate thread to generate ips, check them against the file, and then add them to a queue that the probe threads could pull from. It could not keep up with the probe threads either.

How can I quickly check if I have already connected to an IP or not, without being incredibly slow or using ridiculous amounts of memory?

package net;

import java.io.File;
import java.io.RandomAccessFile;
import java.net.HttpURLConnection;
import java.net.InetAddress;
import java.net.InetSocketAddress;
import java.net.Proxy;
import java.net.URL;
import java.util.Arrays;
import java.util.concurrent.atomic.AtomicInteger;

/**
 *
 * @author Colby
 */
public class ProxyFinder {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) throws Exception {

        int[] ports = {
            1080, 3128, 3128, 8080
        };

        System.out.println("Starting network probe");

        AtomicInteger counter = new AtomicInteger();
        for (int i = 0; i < 500; i++) {
            new Thread(() -> {

                do {
                    try {
                        byte[] addrBytes = randomAddress();//could be getNextAddress also
                        if (addrBytes == null) {
                            break;
                        }

                        InetAddress addr = InetAddress.getByAddress(addrBytes);
                        if (ping(addr)) {
                            float percent = (float) ((counter.get() / (256f * 256f * 256f * 256f)) * 100F);
                            if (counter.incrementAndGet() % 10000 == 0) {
                                System.out.println("Searching " + percent + "% network search");
                            }

                            for (int port : ports) {
                                try {
                                    Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(addr, port));

                                    HttpURLConnection con = (HttpURLConnection) new URL("http://google.com").openConnection(proxy);

                                    con.setConnectTimeout(1000);
                                    con.setReadTimeout(1000);
                                    con.setRequestMethod("GET");
                                    con.setRequestProperty("User-Agent", "Mozilla/5.0");

                                    con.getContent();
                                    con.disconnect();

                                    System.out.println("Proxy found!" + addr.getHostAddress() + ":" + port + "  Found at " + percent + "% network search");

                                } catch (Exception e) {
                                }
                            }

                            //
                            //System.out.println("Ping response: --" + addr.getHostAddress() + "-- Attempt: " + counter.get() + " Percent: " + percent + "%");
                        } else {
                            //System.out.println("Ping response failed: " + addr.getHostAddress() + " attempt " + counter.incrementAndGet());
                        }

                    } catch (Exception e) {
                        //e.printStackTrace();
                    }

                } while (true);

            }).start();
        }
    }

    private static RandomAccessFile cache;

    private static byte[] getNextAddress() throws Exception {
        if (cache == null) {
            cache = new RandomAccessFile(File.createTempFile("abc", ".tmp"), "rw");
        }

        byte[] check;
        checkFile:
        {
            byte[] addr = new byte[4];
            do {
                check = randomAddress();
                inner:
                {
                    cache.seek(0);
                    while (cache.length() - cache.getFilePointer() > 0) {
                        cache.readFully(addr);
                        if (Arrays.equals(check, addr)) {
                            break inner;
                        }
                    }
                    cache.write(check);
                    break checkFile;
                }

            } while (true);
        }
        return check;
    }

    private static byte[] randomAddress() {
        return new byte[]{(byte) (Math.random() * 256), (byte) (Math.random() * 256), (byte) (Math.random() * 256), (byte) (Math.random() * 256)};
    }

    private static boolean ping(InetAddress addr) throws Exception {
        return addr.isReachable(500);
    }
}

Also in case anyone is wondering, I've had this running for 12 hours now and it's discovered about 50 proxys, and pinged about 2.09664E-4% of the ip range which is about 1.2 million ips. not bad for the bandwidth allocated (0.5Mbps)

EDIT: I am starting to think that maybe the overhead of storing and checking all of these IPs would be even greater than simply connecting to many duplicates near the end of searching the ip range..

Colby
  • 452
  • 4
  • 19
  • 1
    why don't you use a database? – JohnnyAW Mar 12 '15 at 12:38
  • @JohnnyAW Surely a database couldn't be faster than raw file access? If it could, would it be able to search through potentially 16GB of records 1000 times a second? – Colby Mar 12 '15 at 12:40
  • it is surely faster than a raw file. However it is obviously slower than RAM, the overall speed will depend on the db-server and how you design your queries. e.g. you could search for multiple ip's in 1 query... – JohnnyAW Mar 12 '15 at 12:44
  • @JohnnyAW That's a good idea. So maybe generate 1000 ips, check to make sure there are no duplicates in those, then send one massive query? I will run some tests on this. – Colby Mar 12 '15 at 12:49
  • @Colby Please see my answer in the chat and the corrected code in my answer below. – SubOptimal Mar 12 '15 at 20:49
  • @SubOptimal Thank you for the help. You've definitely gotten me closer to an answer. However I can still see a possibility where the code would fail. Imagine you have address 1.2.3.4 and 4.3.2.1. Both would map to bit 24 which would cause a false duplicate! – Colby Mar 12 '15 at 23:50
  • @SubOptimal Please take a look at this thread to see the issue I'm speaking of. http://stackoverflow.com/questions/2151084/map-a-2d-array-onto-a-1d-array-c That logic would make your code work completely, however I am having trouble comprehending how to extend it. For example, "array[width * row + col] = value; " works correctly when changing a 2d array to 1d, but this requires changing a 4d array into a 1d so would it be like so: "array[256 * 256 * row + col] = value; "? Or how exactly. Maybe someone else can wrap their head around this and explain it in easier terms. – Colby Mar 13 '15 at 00:05
  • @SubOptimal I have started another question in relation to how to convert multi dimensional arrays to a single dimensional one here http://stackoverflow.com/questions/29022714/java-mapping-multi-dimensional-arrays-to-single – Colby Mar 13 '15 at 00:20

3 Answers3

2

I would not store the whole IP address because of the amount of data. To store them in a array of BitSet would consume less memory.

edit previous code version removed, it was not correct

The version below generates random addresses and persist them in a file. If the persistence file of a previous run is found, the information of the seen addresses is restored from that file.

Following case was not handled correctly in the initial version:

assuming that no address was already seen
   1.0.0.1 - seen false
   2.0.0.2 - seen false
   2.0.0.1 - seen true, which was wrong and is correctly handled by code below

See the comments in the code for further information.

public class KeepSeenAddresses {

    static final int FILE_BUFFER_SIZE = 81_920;
    static final int RANGES_SIZE = 256;

    // to store 256 ranges of 255*255*255+1 addresses
    static BitSet[] ranges;

    // Random(1) is taken only for demonstration purpose, so the second
    // application run will find the same seen addresses from previous run
    static Random random = new Random(1);
    // for normal use it's better to have better randomness
    //static Random random = new Random(System.currentTimeMillis());

    public static void main(String[] args)
            throws IOException, ClassNotFoundException {

        if (!readRanges()) {
            initRanges();
        }

        // this case was failing in the initial solution
        // uncomment this block to see how all edge cases
        // which where mentioned in other comments are handled
        /*
         byte[][] addresses = {
             {1, 0, 0, 1}, 
             {2, 0, 0, 2}, 
             {2, 0, 0, 1},
             {1, 2, 3, 4}, 
             {4, 3, 2, 1}, 
             {(byte)128, 0, 0, 0},
             {(byte)255, (byte)255, (byte)255, (byte)255}
         };
         seenAddress(addresses[0]);
         seenAddress(addresses[1]);
         seenAddress(addresses[3]);
         seenAddress(addresses[5]);
         seenAddress(addresses[6]);
         for (byte[] addressBytes : addresses) {
         System.out.printf("seen %s before: %s%n",
         prettyAddress(addressBytes),
         seenBefore(addressBytes)
         );
         }
         */
        processAddresses();

        persistRanges();
    }

    /**
     * Read the seen addresses from a file.
     *
     * @return <code>true</code> if the file was found and has the expected
     * number of ranges, otherwise <code>false</code>
     * @throws IOException
     * @throws ClassNotFoundException
     */
    private static boolean readRanges() throws IOException, ClassNotFoundException {
        File rangesStore = new File("addresses.bin");
        if (!rangesStore.exists()) {
            return false;
        }
        System.out.print("found previous rangesStore... ");
        try (ObjectInputStream ois = new ObjectInputStream(
                new BufferedInputStream(
                        new FileInputStream(rangesStore), FILE_BUFFER_SIZE
                )
        )) {
            ranges = (BitSet[]) ois.readObject();
        }
        if (ranges.length != RANGES_SIZE) {
            System.out.printf("wrong size of rangesStore: expected %d"
                    + "  found: %d%n", RANGES_SIZE, ranges.length);
            return false;
        } else {
            System.out.printf("restored ranges: %d%n", ranges.length);
            return true;
        }
    }

    /**
     * Initialize the address ranges array. All address flags will be set to
     * <code>false</code>.
     */
    private static void initRanges() {
        System.out.print("initialize new rangesStore... ");
        ranges = new BitSet[RANGES_SIZE];
        for (int i = 0; i < RANGES_SIZE; i++) {
            BitSet bitSet = new BitSet(255 * 255 * 255 + 1);
            for (int j = 0; j < 255 * 255 * 255 + 1; j++) {
                bitSet.clear(j);
            }
            ranges[i] = bitSet;
        }
        System.out.printf("initialized ranges: %d%n", RANGES_SIZE);
    }

    /**
     * For demonstration purpose.<br>
     * Generates some random IPv4 addresses. If the address was not seen before
     * the flag for this address will be set to <code>true</code>.
     */
    private static void processAddresses() {
        for (int i = 0; i < 10; i++) {
            byte[] addrBytes = randomAddress();
            boolean seenBefore = seenBefore(addrBytes);
            if (!seenBefore) {
                seenAddress(addrBytes);
                seenBefore = false;
            }
            System.out.printf("seen %s before: %s%n",
                    prettyAddress(addrBytes),
                    seenBefore
            );
        }
    }

    /**
     * Persist the address ranges array. The file size is around 500MB.
     *
     * @throws IOException
     */
    private static void persistRanges() throws IOException {
        System.out.print("persist rangesStore... ");
        try (ObjectOutputStream oos = new ObjectOutputStream(
                new BufferedOutputStream(
                        new FileOutputStream("addresses.bin"), FILE_BUFFER_SIZE)
        )) {
            oos.writeObject(ranges);
        }
        System.out.printf("written ranges: %d%n", ranges.length);
    }

    /**
     * Keep a flag which address has been seen already.
     *
     * @param addrBytes IPv4 address in four bytes
     */
    static void seenAddress(byte[] addrBytes) {
        int rangeIndex = (int) addrBytes[0] & 0xff;
        int rangeOffset = ((int) addrBytes[1] & 0xff * 0xffff)
                + ((int) addrBytes[2] & 0xff * 0xff)
                + ((int) addrBytes[3] & 0xff);
        ranges[rangeIndex].set(rangeOffset);
    }

    /**
     * Check if the passed address was seen before.
     *
     * @param addrBytes IPv4 address in four bytes
     * @return <code>true</code> if the address was seen before, otherwise
     * <code>false</code>
     */
    static boolean seenBefore(byte[] addrBytes) {
        int rangeIndex = (int) addrBytes[0] & 0xff;
        int rangeOffset = ((int) addrBytes[1] & 0xff * 0xffff) + ((int) addrBytes[2] & 0xff * 0xff) + ((int) addrBytes[3] & 0xff);
        return ranges[rangeIndex].get(rangeOffset);
    }

    /**
     * Convert the IPv4 address into pretty string.
     *
     * @param addrBytes IPv4 address in four bytes
     * @return pretty String of the IPv4 address
     */
    static String prettyAddress(byte[] addrBytes) {
        return String.format("%03d.%03d.%03d.%03d",
                (int) addrBytes[0] & 0xff,
                (int) addrBytes[1] & 0xff,
                (int) addrBytes[2] & 0xff,
                (int) addrBytes[3] & 0xff);
    }

    /**
     * Generate a random IPv4 address.
     *
     * @return four bytes of a random generated IPv4 address
     */
    private static byte[] randomAddress() {
        byte[] bytes = new byte[4];
        for (int i = 0; i < bytes.length; i++) {
            bytes[i] = (byte) random.nextInt(256);
        }
        return bytes;
    }
}
SubOptimal
  • 22,518
  • 3
  • 53
  • 69
  • This seems to be throwing false seenBefore's at almost a 100% rate after adding 20-30 addresses. Am I missing something? It seems like this would work. – Colby Mar 12 '15 at 13:26
  • how many IPs are there? how much ram do u have? what happens if u need to pause and re run your program from where it stopped before? – tgkprog Mar 12 '15 at 13:32
  • 1
    Also it seems like you would need initialization code like the following for this to work. Please correct me if im wrong. private static BitSet[][][][] bs = new BitSet[256][256][256][256]; static { for (int i = 0; i < bs.length; i++) { for (int l = 0; l < bs.length; l++) { for (int a = 0; a < bs.length; a++) { for (int b = 0; b < bs.length; b++) { bs[i][l][a][b] = new BitSet(); } } } } } – Colby Mar 12 '15 at 13:34
  • @Colby From jvadoc [BitSet](http://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html) "A BitSet is not safe for multithreaded use without external synchronization." You missed the synchronization. – SubOptimal Mar 12 '15 at 13:37
  • Consider you have 256.0.0.0, 0.256.0.0, 0.0.256.0, 0.0.0.256. hasSeen would also return true for 256.256.256.256.. which isnt acceptable. Problem remains the same after synchronizing the methods. – Colby Mar 12 '15 at 13:38
  • 1
    @Colby I tought you wanted some suggestions not a fully working example. ;-) The code above has not been tested. So you should invert the meaning of the bits before. – SubOptimal Mar 12 '15 at 13:42
  • @Colby "256.0.0.0, 0.256.0.0, 0.0.256.0, 0.0.0.256." not valid addresses. I fixed my answer. – SubOptimal Mar 12 '15 at 13:46
  • @tgkprog IP4 has 4.294.967.296 addresses (255.255.255.255) but you need only 32bit (8bit + 8bit + 8bit + 8bit) to represent them. For persisiting them you should implement something. `BitSet` is `Serializable` so you could save the object with it's state. – SubOptimal Mar 12 '15 at 13:50
  • But still after adding only 20-30 ips 99% of ips are showing as hasSeen when they are not. Sorry if I am missing how this can be used in the current form. – Colby Mar 12 '15 at 13:58
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/72842/discussion-between-suboptimal-and-colby). – SubOptimal Mar 12 '15 at 14:02
  • The idea of using `BitSet` is very good, but the algorithms shown are incorrect. Please fix them, so this answer can be upvoted. – fps Mar 12 '15 at 14:40
  • @SubOptimal Maybe I'm wrong, but wouldn't you need to change `bs[0]` to `bs[i]` in both `seenAddress()` and `seenBefore()` methods? – fps Mar 12 '15 at 14:48
  • @Magnamag Thanks. Copy and paste error. Was working on my PC. ;-) – SubOptimal Mar 12 '15 at 15:06
  • 1
    @SubOptimal Even though you've fixed that `bs[i]` bug, your algorithm is still wrong. You need a 4-dimensions matrix instead of an array of 4 elements. This is because you need to store a bit for every possible combination of 4 numbers in the range 0..255. Please edit your answer. – fps Mar 12 '15 at 15:48
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/72874/discussion-between-suboptimal-and-magnamag). – SubOptimal Mar 12 '15 at 20:27
0

I have ported code from another solution here to fit this problem: Java- Mapping multi-dimensional arrays to single

The answer to the above question gives an in depth explanation of how the following code works. If anyone else would like to post a more in depth answer on this thread I will award it the answer.

static BitSet set;

static int pos(int i, int j, int k, int m) {
    return ((256*256*256) * i) + ((256*256) * j) + (256 * k) + m;
}

static boolean get(byte[] addr) {
    return set.get(pos(addr[0], addr[1], addr[2], addr[3]));
}

static void set(byte[] addr, boolean flag) {
    set.set(pos(addr[0], addr[1], addr[2], addr[3]), flag);
}
Community
  • 1
  • 1
Colby
  • 452
  • 4
  • 19
-1

Use a data base like MySql and hibernarte with level 1 & 2 cache.

It will be faster than RAM if u configure a cache with hibernate and tune ur db to use few gb of cache too. i think they all do. can configure an external cache like ehcahe when an be configured to live on another process + file with limits on size and time. Db knows how to index and seek things faster than even pure RAM - at such sizes as your IP

Plus you can improve by partitioning table data & index by first char, 2nd char etc

tgkprog
  • 4,493
  • 4
  • 41
  • 70
  • just wondering, why would it be faster than RAM if he would use `HashSet` to store the IP's? – JohnnyAW Mar 12 '15 at 13:02
  • Cause to search a small number (say 90,000 or even 999,999) hash set - ram winner, but when you go to IPs Ip4, Ip6 ... need better data organization, indexing and storing. – tgkprog Mar 12 '15 at 13:30
  • 1
    thats weird, because you have O(1) for lookup on a `HashSet` and it can handle Integer.MAX_VALUE entries. Could you explain deeper on this one? – JohnnyAW Mar 12 '15 at 13:36
  • 1
    I doesn't make sense to use a relational database to store key/value pairs, with the value being a `boolean`. Either use RAM or a key/value-based storage. And it does make even less sense to use a heavy-weight ORM such as Hibernate. – fps Mar 12 '15 at 14:31
  • . The length of an IPv6 address is 128 bits, compared with 32 bits in IPv4. The address space therefore has 2128 or approximately 3.4×1038 addresses. – tgkprog Mar 12 '15 at 19:09
  • @tgkprog `2^128` bits is about `3.8x10^25` tera-bytes. I can hardly conceive how much information this is, but maybe there aren't enough datacenters in the world to store such amount of data. So a solution by extension won't work for IPv6. – fps Mar 12 '15 at 19:58
  • 1
    But it will be 1000x more than you can store in RAM. And in large numbers will be faster. Anyway no more responses - got other things to do. Good points Magnamag! – tgkprog Mar 13 '15 at 07:12