4

I want to write a kernel module where I am getting TCP/IP packets near 8 mbps. I have to store these packet for 500ms duration. Later these packets should be forwarded sequentially. And these should be done for 30 members. What should be best approach to implement? Should I use kmalloc for once (kmalloc(64000000, GFP_ATOMIC)? Because each time if I do kmalloc and kfree it will take time, leading to a performance issue. Also if I allocate memory in kernel in one shot will the linux kernel will allow me to do that?

glglgl
  • 89,107
  • 13
  • 149
  • 217
kernelCoder
  • 75
  • 2
  • 7

2 Answers2

6

I once wrote a kernel module processing packets on a 10Gbs link. I used vmalloc to allocate about 1GByte of continuous (virtual) memory to put a static size hash table into it to perform connection tracking (code).

If you know how much memory you need, I recommend to pre-allocate it. This has two advantages

  • It is fast (no mallocing/freeing at runtime)
  • You don't have to think of a strategy if kmalloc(_, GFP_ATOMIC) can not return you memory. This can actually happen quite often under heavy load.

Disadvantage

  • You might allocate more memory then necessary.

So, for writing a special-purpose kernel module, please pre-allocate as much memory as you can get ;)

If you write a kernel module for commodity hardware used by many novice users, it would be nice to allocate memory on demand (and waste less memory).


Where do you allocate memory? GFP_ATOMIC can only return a very small amount of memory and should only be used if your memory allocation cannot sleep. You can use GFP_KERNEL when it is safe to sleep, e.g., not in interrupt context. See this question for more. It is safe to use vmalloc during module initialization to pre-allocate all you memory.

Community
  • 1
  • 1
corny
  • 7,824
  • 3
  • 14
  • 20
  • Yes I am agree with you..Run Time kmalloc/kfree is very time consuming..But how to **pre-allocate** memory? – kernelCoder Mar 26 '13 at 08:30
  • If you know how much memory you need, just allocate it in your `module_init` function. You need to manage this block of memory yourself then. But I guess a FIFO queue is a good way to store your packets. Just have two pointers into your large pre-allocated memory block. On pointing to the next free 'slot', the other pointing to the next packet that needs to be forwarded after 500ms have expired. This is also called a ring buffer. – corny Mar 26 '13 at 08:36
  • Thanks corny..I am getting some hope on that..But I have one more question..If these packets are coming for 10 such canditate..And I have to store for all of them..Can I manage these situation using allocation a large memory block..Or will it be a good to allocate large memory for each of them? – kernelCoder Mar 26 '13 at 09:37
  • I don't understand the question. What 10 'such candidate'? You mean like you have 10 'things' where packets are coming out? Depends on what you are doing. If those are 10 individual things that don't need synchronization, use 10 memory chunks, otherwise share one of appropriate size among them all. – corny Mar 26 '13 at 09:40
  • 2
    +1 for the code. And since that code uses round-to-power-of-two, see http://stackoverflow.com/questions/4398711/round-to-the-nearest-power-of-two - the kernel has `ffs()`, see ``. – FrankH. Mar 26 '13 at 16:17
1

Using vmalloc as in corny's answer will be faster with Linux kernel 5.2 (released Q3 2019), because of kernel changes.

From Michael Larabel:

The Linux kernel's vmalloc code has the potential of performing much faster on Linux 5.2, particularly with embedded devices.
Vmalloc is used for allocating contiguous memory in the virtual address space and saw a nice optimization merged today on the expected final day of the Linux 5.2 merge window.

As part of a pull (commit cb6f873) merged minutes ago from Andrew Morton are "large changes to vmalloc, yielding large performance benefits."

The principal change to the vmalloc code is keeping track of free blocks for allocation.
Currently an allocation of the new VA area is done over busy list iteration until a suitable hole is found between two busy areas. Therefore each new allocation causes the list being grown. Due to long list and different permissive parameters an allocation can take a long time on embedded devices(milliseconds).

This patch organizes the vmalloc memory layout into free areas of the VMALLOC_START-VMALLOC_END range. It uses a red-black tree that keeps blocks sorted by their offsets in pair with linked list keeping the free space in order of increasing addresses.

With this patch from Uladzislau Rezki, calling vmalloc() can take up to 67% less time compared to the behavior on Linux 5.1 and prior, at least with tests done by the developer under QEMU.

The commit, as mirrored on GitHub, is here:

It introduces a red-black tree:

/*
 * This augment red-black tree represents the free vmap space.
 * All vmap_area objects in this tree are sorted by va->va_start
 * address. It is used for allocation and merging when a vmap
 * object is released.
 *
 * Each vmap_area node contains a maximum available free block
 * of its sub-tree, right or left. Therefore it is possible to
 * find a lowest match of free area.
 */

With the function:

/*
 * Merge de-allocated chunk of VA memory with previous
 * and next free blocks. If coalesce is not done a new
 * free area is inserted. If VA has been merged, it is
 * freed.
 */
static __always_inline void
merge_or_add_vmap_area(struct vmap_area *va,
    struct rb_root *root, struct list_head *head)

/*
 * Find a place in the tree where VA potentially will be
 * inserted, unless it is merged with its sibling/siblings.
 */

/*
 * Get next node of VA to check if merging can be done.
 */

/*
 * start            end
 * |                |
 * |<------VA------>|<-----Next----->|
 *                  |                |
 *                  start            end
 */
...
/*
 * start            end
 * |                |
 * |<-----Prev----->|<------VA------>|
 *                  |                |
 *                  start            end
 */
VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250