17

I have the following code which is about 7 times faster than inet_addr . I was wondering if there is a way to improve this to make it even faster or if a faster alternative exists.

This code requires that a valid null terminated IPv4 address is supplied with no whitespace, which in my case is always the way, so I optimized for that case. Usually you would have more error checking, but if there is a way to make the following even faster or a faster alternative exists I would really appreciate it.

UINT32 GetIP(const char *p)
{
    UINT32 dwIP=0,dwIP_Part=0;
    while(true)
    {
        if(p[0] == 0)
        {
            dwIP = (dwIP << 8) | dwIP_Part;
            break;
        }
        if(p[0]=='.') 
        {       
            dwIP = (dwIP << 8) | dwIP_Part;                     
            dwIP_Part = 0;
           p++;
        }
        dwIP_Part = (dwIP_Part*10)+(p[0]-'0');
        p++;
    }
    return dwIP;
}
Harry
  • 221
  • 3
  • 8
  • 9
    I think this is better suited to codereview.stackexchange.com – M.M Jul 28 '15 at 14:38
  • 4
    Why? I want to know what the fastest way to get an IP address from a string is. – Harry Jul 28 '15 at 14:41
  • 6
    You also should note that UINT32 might be not suitable for an IP address, without adjusting endianess to network byte order. – πάντα ῥεῖ Jul 28 '15 at 14:43
  • 9
    @Harry CodeReview specializes in reviewing _working code_ (your code works), and they can suggest algorithmic improvements. That's not to say speed hacks aren't on-topic here. – Iwillnotexist Idonotexist Jul 28 '15 at 14:43
  • The order of the UINT32 for my purpose is irrelevant but I'm aware it's usually important. If it's faster to do it the other way then I would be interested, thanks. – Harry Jul 28 '15 at 14:46
  • 3
    @Iwillnotexist Idonotexist I guess if my code is indeed the fastest in the world currently it might be worth going to the codereview to see if it can be faster? ;) I have seen faster integer conversion questions put on here and I think this is related to those types of questions. I couldn't find anything else myself on this topic and it may help others in the future if there are no other alternatives. – Harry Jul 28 '15 at 14:49
  • 2
    Do you parse billions of IP addresses daily? – stgatilov Jul 28 '15 at 14:50
  • 1
    @stgatilov I don't know the relevance of that? Seeking the most optimal performance solution to a problem seems like an ok thing to do. In my case I have to map string IPs to a UINT32 for IP matching and it has to be as fast as possible to maximize my throughput. – Harry Jul 28 '15 at 14:51
  • s/valid null terminated IP address/valid null terminated IPv4 address/ – Andrew McGuinness Jul 28 '15 at 14:58
  • 1
    Nice ! But you can't compare your functionwith inet_addr() as you don't do any error processing. `(in_addr_t)(-1)` / `INADDR_NONE` should be returned for invalid adresses such as "1000.100.100.100". – Christophe Jul 28 '15 at 15:02
  • @Christophe adding that error checking would still make it a lot faster than inet_addr() . It's only adding one more comparison. Even if you also add checks that a character is within '0' and '9' it's still 5 times faster. – Harry Jul 28 '15 at 15:05
  • 1
    Well it's a little bit more than that: you have to check that there are 4 components (not 3, not 5), that each component is between 0 and 255, that no invalid characters (aka no digit, no space, no and no dot) are used. By the way ipv4 addresses can contain spaces before and after each dot. – Christophe Jul 28 '15 at 15:12
  • Feel free to do all that to my code and submit it for an error checking version. You will find it's still significantly faster than inet_addr() :) – Harry Jul 28 '15 at 15:22
  • 3
    @Harry, If you can _firmly_ rely on valid, `decimal.decimal.decimal.decimal\0` stringized IPv4 addresses, add that to your question and make this a speedhack question. It tends to turn on assembler people like me, so do say if you're open to non-standard extensions. – Iwillnotexist Idonotexist Jul 28 '15 at 15:31
  • 2
    A fair summary would be 'who cares? Almost any code that works will be fast enough'. – Martin James Jul 28 '15 at 15:47
  • 1
    No error checking. No surprise it's faster. – user4581301 Jul 28 '15 at 15:47
  • 1
    to do a fair comparison take http://www.opensource.apple.com/source/Libinfo/Libinfo-406.17/dns.subproj/inet_addr.c and remove all error checking, then compare – AndersK Jul 29 '15 at 06:17

2 Answers2

27

Since we are speaking about maximizing throughput of IP address parsing, I suggest using a vectorized solution.

Here is x86-specific fast solution (needs SSE4.1, or at least SSSE3 for poor):

__m128i shuffleTable[65536];    //can be reduced 256x times, see @IwillnotexistIdonotexist

UINT32 MyGetIP(const char *str) {
    __m128i input = _mm_lddqu_si128((const __m128i*)str);   //"192.167.1.3"
    input = _mm_sub_epi8(input, _mm_set1_epi8('0'));        //1 9 2 254 1 6 7 254 1 254 3 208 245 0 8 40 
    __m128i cmp = input;                                    //...X...X.X.XX...  (signs)
    UINT32 mask = _mm_movemask_epi8(cmp);                   //6792 - magic index
    __m128i shuf = shuffleTable[mask];                      //10 -1 -1 -1 8 -1 -1 -1 6 5 4 -1 2 1 0 -1 
    __m128i arr = _mm_shuffle_epi8(input, shuf);            //3 0 0 0 | 1 0 0 0 | 7 6 1 0 | 2 9 1 0 
    __m128i coeffs = _mm_set_epi8(0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1);
    __m128i prod = _mm_maddubs_epi16(coeffs, arr);          //3 0 | 1 0 | 67 100 | 92 100 
    prod = _mm_hadd_epi16(prod, prod);                      //3 | 1 | 167 | 192 | ? | ? | ? | ?
    __m128i imm = _mm_set_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 6, 4, 2, 0);
    prod = _mm_shuffle_epi8(prod, imm);                     //3 1 167 192 0 0 0 0 0 0 0 0 0 0 0 0
    return _mm_extract_epi32(prod, 0);
//  return (UINT32(_mm_extract_epi16(prod, 1)) << 16) + UINT32(_mm_extract_epi16(prod, 0)); //no SSE 4.1
}

And here is the required precalculation for shuffleTable:

void MyInit() {
    memset(shuffleTable, -1, sizeof(shuffleTable));
    int len[4];
    for (len[0] = 1; len[0] <= 3; len[0]++)
        for (len[1] = 1; len[1] <= 3; len[1]++)
            for (len[2] = 1; len[2] <= 3; len[2]++)
                for (len[3] = 1; len[3] <= 3; len[3]++) {
                    int slen = len[0] + len[1] + len[2] + len[3] + 4;
                    int rem = 16 - slen;
                    for (int rmask = 0; rmask < 1<<rem; rmask++) {
//                    { int rmask = (1<<rem)-1;    //note: only maximal rmask is possible if strings are zero-padded
                        int mask = 0;
                        char shuf[16] = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1};
                        int pos = 0;
                        for (int i = 0; i < 4; i++) {
                            for (int j = 0; j < len[i]; j++) {
                                shuf[(3-i) * 4 + (len[i]-1-j)] = pos;
                                pos++;
                            }
                            mask ^= (1<<pos);
                            pos++;
                        }
                        mask ^= (rmask<<slen);
                        _mm_store_si128(&shuffleTable[mask], _mm_loadu_si128((__m128i*)shuf));
                    }
                }
}

Full code with testing is avaliable here. On Ivy Bridge processor it prints:

C0A70103
Time = 0.406   (1556701184)
Time = 3.133   (1556701184)

It means that the suggested solution is 7.8 times faster in terms of throughput than the code by OP. It processes 336 millions of addresses per second (single core of 3.4 Ghz).

Now I'll try to explain how it works. Note that on each line of the listing you can see contents of the value just computed. All the arrays are printed in little-endian order (though set intrinsics use big-endian).

First of all, we load 16 bytes from unaligned address by lddqu instruction. Note that in 64-bit mode memory is allocated by 16-byte chunks, so this works well automatically. On 32-bit it may theoretically cause issues with out of range access. Though I do not believe that it really can. The subsequent code would work properly regardless of the values in the after-the-end bytes. Anyway, you'd better ensure that each IP address takes at least 16 bytes of storage.

Then we subtract '0' from all the chars. After that '.' turns into -2, and zero turns into -48, all the digits remain nonnegative. Now we take bitmask of signs of all the bytes with _mm_movemask_epi8.

Depending on the value of this mask, we fetch a nontrivial 16-byte shuffling mask from lookup table shuffleTable. The table is quite large: 1Mb total. And it takes quite some time to precompute. However, it does not take precious space in CPU cache, because only 81 elements from this table are really used. That is because each part of IP address can be either one, two, three digits long => hence 81 variants in total. Note that random trashy bytes after the end of the string may in principle cause increased memory footprint in the lookup table.

EDIT: you can find a version modified by @IwillnotexistIdonotexist in comments, which uses lookup table of only 4Kb size (it is a bit slower, though).

The ingenious _mm_shuffle_epi8 intrinsic allows us to reorder the bytes with our shuffle mask. As a result XMM register contains four 4-byte blocks, each block contains digits in little-endian order. We convert each block into a 16-bit number by _mm_maddubs_epi16 followed by _mm_hadd_epi16. Then we reorder bytes of the register, so that the whole IP address occupies the lower 4 bytes.

Finally, we extract the lower 4 bytes from the XMM register to GP register. It is done with SSE4.1 intrinsic (_mm_extract_epi32). If you don't have it, replace it with other line using _mm_extract_epi16, but it will run a bit slower.

Finally, here is the generated assembly (MSVC2013), so that you can check that your compiler does not generate anything suspicious:

lddqu   xmm1, XMMWORD PTR [rcx]
psubb   xmm1, xmm6
pmovmskb ecx, xmm1
mov ecx, ecx               //useless, see @PeterCordes and @IwillnotexistIdonotexist
add rcx, rcx               //can be removed, see @EvgenyKluev
pshufb  xmm1, XMMWORD PTR [r13+rcx*8]
movdqa  xmm0, xmm8
pmaddubsw xmm0, xmm1
phaddw  xmm0, xmm0
pshufb  xmm0, xmm7
pextrd  eax, xmm0, 0

P.S. If you are still reading it, be sure to check out comments =)

stgatilov
  • 5,333
  • 31
  • 54
  • Why `lddqu`? It's the same as `movdqu`, except on P4. But on P4, it's worse in store->load forwarding cases. https://software.intel.com/en-us/blogs/2012/04/16/history-of-one-cpu-instructions-part-1-lddqumovdqu-explained. Nice idea for a vector implementation, though. – Peter Cordes Jul 28 '15 at 20:16
  • 1
    If you make sure your lookup table is in the low32 of address space, can you save the `movsxd`, and use `[r13d + ecx*8]` as the address? I forget if an address-size prefix would slow down the decoders on Intel CPUs, like a 16bit operand-size prefix. – Peter Cordes Jul 28 '15 at 20:20
  • 2
    @PeterCordes The `pmovmskb` instruction is guaranteed to zero-fill the entire register (`r32` or `r64`) it's asked to blow out. Unfortunately, the `_mm_movemask_epi8()` and underlying (on GCC) `__builtin_ia32_pmovmskb128()` intrinsics both return `int`, whence the compiler's urge to use the `pmovmskb r32, xmm` form rather than the `pmovmskb r64, xmm` form that would have been profitable. The compiler then feels the need to sign-extend, since the return value is nominally an `int`. I noticed that too and was trying to suppress it but realized the problem is deep-rooted. – Iwillnotexist Idonotexist Jul 28 '15 at 20:44
  • Oh, so you don't actually need the index to be possibly-negative? err, I guess it can't be, since the mask is only 16 bits long. Did you try casting it to `unsigned`? That should tell the compiler it doesn't need to sign-extend the value before the address-calculation, and instead can use the zero-extended result it has. `pmovmskb r64` wouldn't be profitable; I assume the compiler is worried about needing to go from zero-extended to sign-extended, not about writing the full 64bit reg. – Peter Cordes Jul 28 '15 at 20:53
  • @PeterCordes I did; It does a zero-extend instead. This makes sense. The problem lies in that GCC will not generate the `pmovmskb r64, xmm` form, instead using the `r32` form and leaving the top bits of the `r32` register undefined. To then use the full 64-bit registers in an address calculation, those bits must be made defined, so an extension of some kind is a must. Try it! `uint64_t mask; asm("pmovmskb %1, %0\n\t" : "=r"(mask) : "x"(cmp));` – Iwillnotexist Idonotexist Jul 28 '15 at 20:59
  • clang 3.5 does `pmovmskb %xmm0, %eax / shlq $4, %rax` to generate the index (which it uses relative to a fixed-location symbol for the table). Unless that's a clang bug, then `pmovmskb r32` is like every single other x86-64 instruction that writes to a 32bit reg: it zeroes the high 32. gcc 4.9.2 doesn't seem to know this, though, and generates a `mov %eax, %eax` which I guess is to zero the upper 32. I think if `pmovmskb` merged the upper 32 of the old value in 64bit-mode, that would be special enough for Intel's insn ref manual to mention. – Peter Cordes Jul 28 '15 at 21:13
  • So to clarify, I think the zero or sign extension instruction that gcc 4.9.2 inserts is a gcc bug. With `int mask`, gcc emits `cltq` (or `CDQ` in intel syntax) to sign-extend `%eax -> %rax`. With `unsigned mask`, it emits `mov %eax, %eax`. clang uses `pmovmskb` directly into `shlq` in both cases. – Peter Cordes Jul 28 '15 at 21:16
  • @PeterCordes I used `(gdb) disas /r MyGetIP` to verify that both my `asm` hack and the normal version emit _the same opcodes_ (`66 0f d7 c0`) for `pmovmskb %xmm0, %eax`, as though there were no difference between the r32 and r64 forms. So yes. GCC bug. I'm on 4.8.1. – Iwillnotexist Idonotexist Jul 28 '15 at 21:19
  • @IwillnotexistIdonotexist: I think with a `REX.W=1` prefix, the 64bit version is encodable. But I guess `gas` optimizes it `pmovmskb %xmm0, %rax` to the 32bit version, just like `0(%rdi)` optimizes to `(%rdi)` (without an `imm8` displacement in the encoding). This is good confirmation that they have identical behaviour, and that it's a gcc optimization failure. – Peter Cordes Jul 28 '15 at 23:22
  • @PeterCordes Turns out you're right: `asm("rex.w pmovmskb %1, %0\n\t" : "=r"(mask) : "x"(cmp));` compiles to `66 48 0f d7 c0 pmovmskb %xmm0,%rax`. A win for Clang! – Iwillnotexist Idonotexist Jul 29 '15 at 00:01
  • First off thanks for writing this code it is great! Your code is over 8 times faster than mine in Visual C++ ! But there is a problem with your initialization code, lens[s++] = t; . Visual studio caught stack corruption because "s" goes to 16 and the array size is 16 not 17. Is it possible to still be much faster without that lookup table? It seems if there was just a minor drop off in performance without the cost of the lookup table your solution would be all around viable. – Harry Jul 29 '15 at 01:08
  • 2
    @Harry Must you absolutely make the table disappear, or can you afford to make it much smaller (say, 256 entries x 16 bytes/entry?) If so, stgatilov's code can be modified to distill the 81 possible valid digit masks using a _perfect hash_. A secret trick of mine for hashing is to use the single SSE4.2 instruction _CRC_, which you can use with `_mm_crc32_*()`. For this usecase, find `a` and `n` in `mask = _mm_crc32_u16(0, mask << a) >> (32-n);` such that you have `2^n` bins and none of the valid masks collide after hashing. You probably should be able to get down to n=8 (256 entries). – Iwillnotexist Idonotexist Jul 29 '15 at 05:05
  • @Harry Amend my above statement to `_mm_crc32_u32(0, mask << a) >> (32-n)`. Using `_u16` means that you'll quickly shift out bits that actually are significant, and gives you less freedom of choice for `a`. – Iwillnotexist Idonotexist Jul 29 '15 at 05:25
  • @IwillnotexistIdonotexist Thanks! Every great thing has its roots though. I took the key idea (_mm_shuffle_epi8 with table lookup) from Yale Zhang [here](http://stackoverflow.com/q/6270641/556899). @PeterCordes Regarding `lddqu`, I just put it there because I remember it being better. You link says there is no difference now, so let it be =) – stgatilov Jul 29 '15 at 05:50
  • @PeterCordes I have changed type of `mask` to `UINT32`. MSVC2013 also puts `mov ecx, ecx` there =) Anyway, I feel this version has better chances in future. – stgatilov Jul 29 '15 at 05:59
  • 1
    I don't believe that cheap perfect hash allows 256 entry LUT. Random bytes after terminating zero make this even worse. But LUT size could be easily decreased to 512K or even 256K. Since `mask` is always even for any valid IP address, we could just remove `add` instruction from generated assembly. In C/C++ this means either union or type cast for the lookup table. This may result in slight performance improvement. And for 256K table we could mask out most significant bit (but this would make it slower). – Evgeny Kluev Jul 29 '15 at 06:03
  • 1
    @Harry I have reimplemented precomputation, it should not do anything bad now. Lookup table cannot be removed, because it makes possible complex shuffling, which would be too slow otherwise. – stgatilov Jul 29 '15 at 06:27
  • 1
    @IwillnotexistIdonotexist That crc32 trick is a mighty thing indeed! If we suppose that there are 17 valid values for `a`, and only 81 used masks (with zero padding), then we can get about 1150 entries in the table with success probability 1 - 1/e (according to [this formula](https://en.wikipedia.org/wiki/Birthday_problem#Approximations)). So, I bet that 2K entries is minimum possible. I won't add this trick to my answer (slower, too complex, 1Mb is small anyway). – stgatilov Jul 29 '15 at 06:55
  • Those `random trashy bytes after the end of the string` make this approach dangerous. The problem is that the next byte after terminating zero may belong to some unallocated memory page. Probably the only safe way to use this algorithm is to use it in combination with custom memory allocator (stack memory should be also OK). – Evgeny Kluev Jul 29 '15 at 10:58
  • I'd be really happy, if I just understood half of it... You convinced me, that you know what you're doing :)) – Bence Gedai Jul 29 '15 at 13:10
  • 1
    @BenceGedai: some hacks -> a mask with one-bits at positions corresponding to the `.` characters. For every mask, there's a shuffle that will put the bytes where they belong, so shuffle with `table[mask]` as the control. Then more hacks to multiply the bytes by a power of ten and a power of two for their place value, and add the results together into a 32bit IPv4 address. – Peter Cordes Jul 29 '15 at 16:23
  • @EvgenyKluev: yeah, a loop using this will have to make sure not to feed it an address that's within 15bytes of an un-mapped page. For that case, the caller should copy the IP address into a 16byte buffer. – Peter Cordes Jul 29 '15 at 16:25
  • @EvgenyKluev: mask is always even for a valid address? I think that would require the caller to have already parsed the string to find the dotted-quad, and call this routine on a 16B buffer with 0xff bytes (or something) filling the bytes after the IP address. (zero bytes would result in 1-bits in the mask). This is probably going to make for a store->load forwarding problem, if the 16B aren't written with a single write. – Peter Cordes Jul 29 '15 at 16:35
  • Err, I guess the caller already does have to verify that it's a dotted-quad, not something else, because it doesn't error-check. It would be possible to return a byte count, of how many characters long the IPv4 was. (Look up in another table, with the same index as for the shuffle, or possibly some clever bit manipulation on the sign-bit mask.) – Peter Cordes Jul 29 '15 at 16:58
  • 1
    @PeterCordes That is actually much easier. `pcmpeqb` with zero, `pmovmskb` the resulting bitmask, `bsf` to count number of characters up until the first NUL byte. – Iwillnotexist Idonotexist Jul 29 '15 at 17:08
  • @IwillnotexistIdonotexist: oh, I forgot we had the input ending with a zero-byte, rather than just being a pointer into a buffer at the point where an IPv4 addr appears in a log file or something. Good call. This is great as long as the extra burden of tokenizing / copying the IPv4 addresses doesn't make the overall caller + converter combination slower. Store-forwarding fails on all current Intel CPUs when you do a 16B read of memory that was written by multiple smaller writes. So the insns for this routine will be stuck in the ROB for an extra 10-12 cycles. Which is fine, actually. – Peter Cordes Jul 29 '15 at 17:19
  • 3
    @EvgenyKluev I was successful in crafting a 256-bin perfect hash variant of stgatilov's excellent start. [You may view it here](http://pastebin.com/XJ7w9nq1) (Warning, uses a GCC aligned attribute). It is just over half as fast, but occupies 256x less space for the LUT. – Iwillnotexist Idonotexist Jul 29 '15 at 21:26
  • @PeterCordes You too may be interested in the perfect-hash code I've written about in my previous comment. It happens to use the `pcmpeq+pmovmskb+bsf` method I've described above. – Iwillnotexist Idonotexist Jul 29 '15 at 21:43
  • @IwillnotexistIdonotexist: Very nice. That should help for a real program. Having your LUT use up to 81 TLB entries (if each used value is on a different page) could slow down the rest of the program more than a few extra instructions in the now-very-fast conversion function. Parsing the input to *find* the IPv4 values is going to start to be the slow part, I'd guess. – Peter Cordes Jul 29 '15 at 21:49
  • 2
    @IwillnotexistIdonotexist: So you managed to do it by increasing size of search space for hash constants! I have created a [new paste with your solution](http://pastebin.com/YDWg0Lsb) that works on both GCC and MSVC. Note that I have replaced your `bsf` with much simpler bit magic =) I hope it makes no harm. – stgatilov Jul 30 '15 at 04:35
  • @PeterCordes: I have checked my original version, it uses 13 pages of 8kb size on zero-padded strings. To be honest, I feel that it is not worth to mix the routine with lots of other code (if such a fast parsing is wanted). So, anyway, TLB does not seem to be an issue to me. – stgatilov Jul 30 '15 at 04:43
  • @stgatilov: ~20 4k pages isn't too bad. Haswell has a 64 entry L1 TLB and a 1024 entry L2TLB. (http://www.realworldtech.com/haswell-cpu/5/). x86 doesn't support 8kiB pages, only 4k, 2M, and 1G. IDK if this is likely to be significant, or how expensive a TLB miss really is, if the page table is in let's say L2 cache. I haven't spent time profiling larger programs. – Peter Cordes Jul 30 '15 at 05:35
  • @stgatilov Wow, I totally forgot about that alternative for `bsf`! Thanks for getting rid of it, it is extra slow on Atom processors. Also, did you mean _reduce_ the search space? The perfect hash takes the original 16-bit index, which has a huge dynamic range, and hashes it into 32-bit random junk; But I selected my constants in such a way that the top 8 bits of the 32-bit hash are unique for all 81 valid 16-bit constants. I then use only _those_ 8 bits as index, which means there are now 2^8 and not 2^16 entries. The table thus shrinks from 1MB to 4KB, which fits into exactly one page. – Iwillnotexist Idonotexist Jul 30 '15 at 10:35
  • 4
    Gentlemen, I've just successfully removed the `crc` instruction by using a different magic multiplication constant in my perfect hash. [This new version](http://pastebin.com/tXtiJKuE) therefore does not even require SSE4.2, and is slightly faster. Currently, on my i7-4700MQ processor, _Time resources:_ stgatilov @ `0.465`, myself @ `0.645` and Harry @ `2.996` seconds. _Space resources:_ stgatilov @ 1MB, myself @ 4KB, Harry @ negligible. I'm therefore 40% slower but have 0.4% of the memory use, and my hash table load factor is 81/256 ~ 31.6% rather than 81/65536 ~ 0.12%. – Iwillnotexist Idonotexist Jul 30 '15 at 11:07
  • @IwillnotexistIdonotexist: your perfect hash is really impressive, thanks! The way to get rid of the random bits is also nice. It is good to have a choice between raw speed and low memory/TLB usage. By the way, in many cases we could solve TLB problem by placing LUT of original version to single 2Mb memory page. – Evgeny Kluev Jul 30 '15 at 11:11
  • @ Iwillnotexist Idonotexist thanks very much. The 4KB table without any initialization is a much cleaner approach I feel. Have you tested without a table at all if it's still significantly faster than my version? – Harry Jul 30 '15 at 11:22
  • @Evgeny Kluev I agree this much faster version has some drawbacks, but in my case it is completely useable because the IP addresses are in a large buffer already. So anything it reads will already be allocated. – Harry Jul 30 '15 at 11:26
  • @Harry The only change I introduce to stgatilov's code is that the mask is not used directly as an index, but is rather hashed first, compressing the index from 16 bits down to 8 bits. But to test without a table, I forced the hash to 0 and ran the speed test, and got no change in performance. So the cost of my hash function is responsible for the 40% time penalty of my method. – Iwillnotexist Idonotexist Jul 30 '15 at 11:31
  • @IwillnotexistIdonotexist Yes I meant no table at all to get the correct result. ie Doing the initialization code in a fast ASM way each time it's called. I think a 4KB table is pretty reasonable for something like this myself but I was just curious about performance if there was no lookup table at all. – Harry Jul 30 '15 at 11:36
  • @Harry My hash table generator is [here](http://pastebin.com/z49HWwWE). I don't think however, that recomputing the table on every call will be better than a statically-computed version. – Iwillnotexist Idonotexist Jul 30 '15 at 11:42
  • 3
    @Harry: I didn't test it but I expect that if we completely replace lookup by computations, this will slow up the fastest algorithm by 2 to 3 times. Still it is likely to be much faster than your version because many computations are done in parallel. Skylake architecture coming soon allows to vectorize `lzcnt` instruction (with AVX512 instruction set) so almost all computations may be done in parallel with it; but it is unlikely to outperform both approaches mentioned here. – Evgeny Kluev Jul 30 '15 at 12:28
  • 1
    @IwillnotexistIdonotexist I think you should post your pastebin as an answer. – the_drow Feb 02 '16 at 09:38
1

As for alternatives: this is similar to yours but with some error checking:

#include <iostream>
#include <string>
#include <cstdint>

uint32_t getip(const std::string &sip)
{
    uint32_t r=0, b, p=0, c=0;
    const char *s;
    s = sip.c_str();
    while (*s)
    {
        r<<=8;
        b=0;
        while (*s&&((*s==' ')||(*s=='\t'))) s++;
        while (*s)
        {
            if ((*s==' ')||(*s=='\t')) { while (*s&&((*s==' ')||(*s=='\t'))) s++; if (*s!='.') break; }
            if (*s=='.') { p++; s++; break; }
            if ((*s>='0')&&(*s<='9'))
            {
                b*=10;
                b+=(*s-'0');
                s++;
            }
        }
        if ((b>255)||(*s=='.')) return 0;
        r+=b;
        c++;
    }
    return ((c==4)&&(p==3))?r:0;
}

void testip(const std::string &sip)
{
    uint32_t nIP=0;
    nIP = getip(sip);
    std::cout << "\nsIP = " << sip << " --> " << std::hex << nIP << "\n";
}

int main()
{
    testip("192.167.1.3");
    testip("292.167.1.3");
    testip("192.267.1.3");
    testip("192.167.1000.3");
    testip("192.167.1.300");
    testip("192.167.1.");
    testip("192.167.1");
    testip("192.167..1");
    testip("192.167.1.3.");
    testip("192.1 67.1.3.");
    testip("192 . 167 . 1 . 3");
    testip(" 192 . 167 . 1 . 3 ");
    return 0;
}
slashmais
  • 7,069
  • 9
  • 54
  • 80
  • What speed did you get compared to my original? – Harry Jul 30 '15 at 11:23
  • @Harry: I didn't do (don't have data for) specific bulk speed testing; would anyway only relate to the specific machine, but just running both with above data gives about same speed (meaningless here). – slashmais Jul 31 '15 at 07:20