mmap slower than ioremap

Question

I am developing for an ARM device running Linux 2.6.37. I am trying to toggle an IO pin as fast as possible. I made a little kernel module and a user space application. I tried two things :

Manipulate the GPIO control registers directly from the kernel space using ioremap.
mmap() the GPIO control registers without caching and using them from user space.

Both methods work, but the second is about 3 times slower than the first (observed on oscilloscope). I think I disabled all caching mechanisms.

Of course I'd like to get the best of the two worlds : flexibility and ease of development from user space with the speed of kernel space.

Does anybody know why the mmap() could be slower than the ioremap() ?

Here's my code :

Kernel module code

static int ti81xx_usmap_mmap(struct file* pFile, struct vm_area_struct* pVma)
{
  pVma->vm_flags |= VM_RESERVED;
  pVma->vm_page_prot = pgprot_noncached(pVma->vm_page_prot);

  if (io_remap_pfn_range(pVma, pVma->vm_start, pVma->vm_pgoff,
                          pVma->vm_end - pVma->vm_start, pVma->vm_page_prot))
     return -EAGAIN;

  pVma->vm_ops = &ti81xx_usmap_vm_ops;
  return 0;
}

static void ti81xx_usmap_test_gpio(void)
{
  u32* pGpIoRegisters = ioremap_nocache(TI81XX_GPIO0_BASE, 0x400);
  const u32 pin = 1 << 24;
  int i;

  /* I should use IO read/write functions instead of pointer deferencing, 
   * but portability isn't the issue here */

  pGpIoRegisters[OMAP4_GPIO_OE >> 2] &= ~pin;    /* Set pin as output*/

  for (i = 0; i < 200000000; ++i)
  {
     pGpIoRegisters[OMAP4_GPIO_SETDATAOUT >> 2] = pin;
     pGpIoRegisters[OMAP4_GPIO_CLEARDATAOUT >> 2] = pin;
  }

  pGpIoRegisters[OMAP4_GPIO_OE >> 2] |= pin;    /* Set pin as input*/

  iounmap(pGpIoRegisters);
}

User space application code

int main(int argc, char** argv)
{
   int file, i;
   ulong* pGpIoRegisters = NULL;
   ulong pin = 1 << 24;

   file = open("/dev/ti81xx-usmap", O_RDWR | O_SYNC);

   if (file < 0)
   {
      printf("open failed (%d)\n", errno);
      return 1;
   }


   printf("Toggle from kernel space...");
   fflush(stdout);

   ioctl(file, TI81XX_USMAP_IOCTL_TEST_GPIO);

   printf(" done\n");    

   pGpIoRegisters = mmap(NULL, 0x400, PROT_READ | PROT_WRITE, MAP_SHARED, file, TI81XX_GPIO0_BASE);
   printf("Toggle from user space...");
   fflush(stdout);

   pGpIoRegisters[OMAP4_GPIO_OE >> 2] &= ~pin;

   for (i = 0; i < 30000000; ++i)
   {
      pGpIoRegisters[OMAP4_GPIO_SETDATAOUT >> 2] = pin;
      pGpIoRegisters[OMAP4_GPIO_CLEARDATAOUT >> 2] = pin;
   }

   pGpIoRegisters[OMAP4_GPIO_OE >> 2] |= pin;

   printf(" done\n");
   fflush(stdout);
   munmap(pGpIoRegisters, 0x400);    

   close(file);    
   return 0;
}

What do you mean by 3x slower ? Individual oscillation are longer, or overall execution ? — shodanex, Jun 08 '12 at 08:02
Both, the oscillations are 3 times slower on the scope, the processor spends more time in the "str" instruction, but I don't know why. Is it waiting for some hardware (bus transfer ?), or some software like a kernel page fault exception handler ? That's the heart of my question... — Julien, Jun 08 '12 at 08:06
Note that in your sample code the number of iterations of the loops is very different. — ysap, Mar 07 '13 at 19:42
@ysap to compensate the fact the one is faster (in frequency) than the other :-). With 30000000 iterations in kernel space, I barely get the time to measure something on my scope. — Julien, Mar 07 '13 at 21:16

score 8 · Accepted Answer · answered Feb 28 '13 at 21:34

8

This is because ioremap_nocache() still enables the CPU write buffer in your VM mapping whereas pgprot_noncached() disables both bufferability and cacheability.

Apples to apples comparison would be to use ioremap_strongly_ordered() instead.

answered Feb 28 '13 at 21:34

Jesse Off

96
1
1

1

Thanks for your answer ! Can I have the best of both worlds (write buffer in user space) ? What pgprot should I use ? Do you have any docs about this ? – Julien Mar 05 '13 at 09:46

score 3 · Answer 2 · answered Jun 08 '12 at 17:34

3

My guess would be that since mmap has to check to make sure you're writing to memory you're allowed to write to, it's going to be slower than the kernel version (which I believe/assume doesn't do that kind of checking--with a kernel module you're responsible for testing until you're very sure you're not breaking things).

Try using do_mmap (I believe that's the one) to use mmap from kernel space, and see how that compares. If it's comparably faster, then I'm right. If it's not, it's something else.

answered Jun 08 '12 at 17:34

zebediah49

7,467
1
33
50

Good idea ! I just tested and it's the same speed as from user space. So it has something to do with mmap, and maybe not with kernel or user space... – Julien Jun 12 '12 at 12:12
1

The permission check should be a one-time thing done during mmap(). I don't think the kernel can intercept memory access, once the region is mapped into userspace via TLB, I think any the access is hardware-only. – maxy Nov 07 '13 at 13:38

mmap slower than ioremap

Kernel module code

User space application code

2 Answers2

Linked