4

memset() is very fast, as it benefits from stosb op code internally. Is there a function for 16 and 32 bit values which similar efficiently benefits from stosb, stosw and/or stosd?

wmemset() is not portable and does not help on 16 bit values.

CoSoCo
  • 71
  • 1
  • 8
  • It works fine for the 256 different 16-bit and 32-bit values in which every component byte has the same value, such as `0`, and `-1` (if two's complement). – Weather Vane Apr 19 '19 at 14:09
  • 2
    `stosw` and `stosd` are not the same as `stosb`. In the past all those instruction were really-really slow because they were supported on uCode level and retained for backward compatibility. Later Intel CPUs started to implement `stosb` in hardware to provide high-speed `memcpy` in hardware, but `stosw`, `stosd` remained the same - based on uCode and thus very slow. I don't think anything has changed, but I didn't really check. –  Apr 19 '19 at 14:13
  • C++ has `::std::fill` that compilers will usually optimize in this way. – Omnifarious Apr 19 '19 at 14:48
  • Unfortunately I can not use C++ functions here in my project. – CoSoCo Apr 21 '19 at 19:40
  • 1
    Also you can read [this](https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy) question on the difference between `stosb` and `stosw`/`stosd`. This was I was talking about in my previous comment. –  Apr 21 '19 at 23:39

2 Answers2

3

There is no such function in standard C, but depending on what you're trying to do, there may be CPU-specific SIMD "intrinsic functions" that you can use to build one. Your mention of stosb makes me think you're using an x86, so review the documentation for the various *mmintrin.h headers and the functions they provide.

zwol
  • 135,547
  • 38
  • 252
  • 361
  • Yes I have such files in `/usr/lib/gcc/x86_64-linux-gnu/7/include`. But where can I find there, what I'm looking for? Do you mean something like this: https://stackoverflow.com/a/25579184/5399598 This is for 32 bit. In first place I'm looking for something for 16 bit. – CoSoCo Apr 24 '19 at 08:52
  • @CoSoCo I'm sorry, I can't help you any more than I already have. – zwol Apr 24 '19 at 14:39
0

Yes it is in many variants.

for example

void *memset16(void *m, uint16_t val, size_t count)
{
    uint16_t *buf = m;

    while(count--) *buf++ = val;
    return m;
}

void *memset32(void *m, uint32_t val, size_t count)
{
    uint32_t *buf = m;

    while(count--) *buf++ = val;
    return m;
}

void *memsetXX(void *m, void *val, size_t size, size_t count)
{
    char *buf = m;

    while(count--)
    {
        memcpy(buf, val, size);
        buf += size;
    }
    return m;
}

Safer version:

void *memset16safe(void *m, uint16_t val, size_t count)
{
    char *buf = m;
    union 
    {
        uint8_t d8[2];
        uint16_t d16;
    }u16 = {.d16 = val};

    while(count--) 
    {
        *buf++ = u16.d8[0];
        *buf++ = u16.d8[1];
    }
    return m;
}

void *memset32(void *m, uint32_t val, size_t count)
{
    char *buf = m;
    union 
    {
        uint8_t d8[4];
        uint32_t d32;
    }u32 = {.d32 = val};

    while(count--) 
    {
        *buf++ = u32.d8[0];
        *buf++ = u32.d8[1];
        *buf++ = u32.d8[2];
        *buf++ = u32.d8[3];
    }
    return m;
}
0___________
  • 60,014
  • 4
  • 34
  • 74
  • One has to be careful with these (at least the fixed-width variants) because unlike with regular `memset`, there is the potential for aliasing violations. If the object pointed to by the `m` argument is not compatible with `uint16_t`--or whichever type is being used--undefined behavior will be invoked. – Christian Gibbons Apr 19 '19 at 15:23
  • @ChristianGibbons I do not see the aliasing problems here. Alignment yes - in some cases. – 0___________ Apr 19 '19 at 16:09
  • 3
    @P__J__ Supposing a variable declared `float fvec[N]`, then `memset32(&fvec, 0x7fe00000, N)` has undefined behavior, even if `float` and `uint32_t` have identical size and alignment requirements. – zwol Apr 19 '19 at 16:32
  • I don't see, why these functions could be faster as simply using a normal loop eg: `uint16_t fill = pattern; for (int x=count; x >= --x; ) data[x] = fill;` It is similar slow as with `uint8_t fill = pattern;` In the 8 bit case I can replace it with the very fast memset() function. But in the 16 bit case, I'm missing/don't know such a fast possibility. – CoSoCo Apr 21 '19 at 19:17