2

I observe that elements are incorrectly evicted in an eBPF LRU hash map (BPF_MAP_TYPE_LRU_HASH). In the following code I insert into an LRU hash map of size 8 and print its contents every second:

package main

import (
    "fmt"
    "github.com/cilium/ebpf"
    "log"
    "time"
)

func main() {
    spec := ebpf.MapSpec{
        Name:       "test_map",
        Type:       ebpf.LRUHash,
        KeySize:    4,
        ValueSize:  8,
        MaxEntries: 8,
    }

    hashMap, err := ebpf.NewMap(&spec)
    if err != nil {
        log.Fatalln("Could not create map:", err)
    }

    var insertKey uint32

    for range time.Tick(time.Second) {
        err = hashMap.Update(insertKey, uint64(insertKey), ebpf.UpdateAny)
        if err != nil {
            log.Printf("Update failed. insertKey=%d|value=%d|err=%s", insertKey, insertKey, err)
        }

        var key uint32
        var value uint64
        count := 0
        elementsStr := ""

        iter := hashMap.Iterate()

        for iter.Next(&key, &value) {
            elementsStr += fmt.Sprintf("(%d, %d) ", key, value)
            count++
        }

        log.Printf("Total elements: %d, elements: %s", count, elementsStr)

        insertKey++
    }
}

When I run the above program, I see this:

2023/03/29 17:32:29 Total elements: 1, elements: (0, 0) 
2023/03/29 17:32:30 Total elements: 2, elements: (1, 1) (0, 0) 
2023/03/29 17:32:31 Total elements: 3, elements: (1, 1) (0, 0) (2, 2) 
2023/03/29 17:32:32 Total elements: 3, elements: (3, 3) (0, 0) (2, 2) 
...

Since the map has eight entries, I would expect the fourth line to show four values, but it shows only three because entry (1, 1) was evicted.

If I change max_entries to 1024, I notice this problem happens after inserting the 200th element, but sometimes it happens after that. It's not consistent.

This issue is not limited to creating/inserting the map from user space because I observe this problem in my XDP program that creates the map and inserts into it; the above reproduces the issue I observe in my real program. In my real program that also had 1024 entries, I noticed this problem happened after inserting the 16 element.

I tested this on our production servers that run Linux kernel 5.16.7.

I do my testing on a Linux VM, and I upgraded my kernel to 6.2.8, and I observe that the eviction policy is different. For example, when max_entries is 8, I observe this:

2023/03/29 20:38:02 Total elements: 1, elements: (0, 0)
2023/03/29 20:38:03 Total elements: 2, elements: (0, 0) (1, 1)
2023/03/29 20:38:04 Total elements: 3, elements: (0, 0) (2, 2) (1, 1)
2023/03/29 20:38:05 Total elements: 4, elements: (0, 0) (2, 2) (1, 1) (3, 3)
2023/03/29 20:38:06 Total elements: 5, elements: (4, 4) (0, 0) (2, 2) (1, 1) (3, 3)
2023/03/29 20:38:07 Total elements: 6, elements: (4, 4) (0, 0) (2, 2) (1, 1) (5, 5) (3, 3)
2023/03/29 20:38:08 Total elements: 7, elements: (4, 4) (0, 0) (2, 2) (1, 1) (6, 6) (5, 5) (3, 3)
2023/03/29 20:38:09 Total elements: 8, elements: (7, 7) (4, 4) (0, 0) (2, 2) (1, 1) (6, 6) (5, 5) (3, 3)
2023/03/29 20:38:10 Total elements: 1, elements: (8, 8)
...

When max_entries is 1024, I notice after the 1025'th element is added, the total elements are 897. I can't test with kernel 6.2.8 on our production servers.

user2233706
  • 6,148
  • 5
  • 44
  • 86

1 Answers1

2

The LRU hashmap doesn't guarantee that there are exactly the maximum number of items, and the implementation is clearly geared towards providing good performance with far more than 8 items. What I see from a fairly quick glance at the code:

  1. The LRU is separated into two parts, an "active list", and an "inactive list", with a task that moves elements from one to the other periodically depending on whether or not they've been accessed recently. It's not a true LRU (items don't get moved to the head every single time they're accessed).

  2. When the map is full, and something needs to be evicted in order to insert a new item, the code will evict up to 128 items from the inactive list in a single pass; only if the inactive list is empty does it evict a single item from the active list.

  3. There is also a per-CPU "local free list" of allocated items waiting to be filled with data; when that runs empty, it attempts to pull from the global free list, and if that is empty it goes to the eviction path. The target size of the local free list is 4 items.

So the behavior in 6.2.8 seems straightforward and consistent: presumably all of your keys are on the "inactive list" (not too surprising for a scanning-type access pattern, or perhaps it's just that none of them had a chance to get promoted yet), and all of them get tossed out. I'm less clear about 5.16, but it probably relates to the local free list and all of the updates running from the same CPU.

Basically I think that the data type wasn't meant to be used in the way you're using it, and the bug is in your expectations. If you disagree, I think you'll have to take it up with the kernel developers.

hobbs
  • 223,387
  • 19
  • 210
  • 288
  • In my actual program I observe this problem when `max_entries` is 1024. What is the minimum number for `max_entries` for a LRU hash map to be useful? In the above program, I still see this problem if `max_entries` is 4096, although to a lesser degree. – user2233706 Mar 30 '23 at 02:01
  • @user2233706 I would think 256. Can you update your question with the actual behavior you see with 1024 max_entries? And your number of CPUs? I'm not sure if that one matters, but it might. – hobbs Mar 30 '23 at 02:07
  • (If it still deletes everything after 4 or 8 inserts, I'd call that a bug. But if it fills up to, say, 800 then I'd say my answer still applies: it's an imprecise LRU and it batches work to improve throughput.) – hobbs Mar 30 '23 at 02:12
  • In my actual program, the max size is 1024, and I observe this problem happens after adding the 16th entry. This is on 5.16.7. Perhaps 6.2.8 behaves better, but I can't test with that kernel on our real servers. – user2233706 Mar 30 '23 at 13:46
  • @user2233706 Based on your edit, I'd say that the behavior on 6.2 is as-expected (in fact, you're left with exactly 1025 - 128 = 897 elements). If that's good enough for you, then you have a case for "this application requires a kernel upgrade". If the 6.2 behavior isn't acceptable to you, then you either need a redesign to not depend on BPF LRU-hash maps, or you need to convince linux upstream to accept a slower "precise LRU" patch... and *then* you need a kernel upgrade :) – hobbs Apr 03 '23 at 16:11
  • Looking at logs, my best guess is that the behavior change between 5.16 and 6.2 wasn't an intentional "fix" but rather a side-effect of https://github.com/torvalds/linux/commit/86fe28f7692d96d20232af0fc6d7632d5cc89a01 which went into 6.1. – hobbs Apr 03 '23 at 16:46