Memcpy performance on /dev/mem outside kernel ram

Question

I'm using a SoC with a custom linux on it. I have reserved the upper 512MB of 1GB total RAM by specifying kernel boot parameter mem=512M. I can access the upper memory from a userspace program by opening /dev/mem and mmap the upper 512MB which is not used by the kernel. Know I want to copy big chunks of memory inside this area by memcpy() but the performance is about 50MB/sek. When I allocate buffers by the kernel and memcpy between them I can reach about 500MB/sek. I'm quite sure is due to the cache is disabled for my special memory area but don't know how to tell the kernel to use cache here.

Has anybody an idea how to solve this?

I don't think caching can explain a 10x discrepancy - you only access each byte once (so no temporal-locality advantage), and in terms of spatial-locality, `memcpy` presumably copies at least word-size (4 bytes) at a time, and a cache line is only 8 bytes. — Oliver Charlesworth, May 22 '16 at 17:24
I have tried several memcpy implementations, byte by byte, 32 bits at one, alignemnt ezt. The 10x improvement is with the same memcpy implementation. — Rafael, May 22 '16 at 17:24
What is your SoC? CPU arch? Clock speed? Memory speed? What speed do you get for memcpy to regular userspace memory? — Craig Estey, May 22 '16 at 17:25
I'm using a Xilinx Zynq SoC. It has a Arm cortex A9 dual core processor running at about 700MHz. When I use memcpy on kernel allocated source and destination buffers I get about 500MB/sek. i have tried this with 1MB chunks. When I'm starting using my special allocated memory as source or destination, the performance drops. It doesent seem to be odd that caching can make this difference. Have a look at [this article](http://www.embedded.com/design/configurable-systems/4024961/Optimizing-Memcpy-improves-speed) — Rafael, May 22 '16 at 17:35
That's talking about the situation where the data is *already* in the cache. Is that true in your case? — Oliver Charlesworth, May 22 '16 at 17:37
I don't think so. Right know i'm using 32MB chunks with one memcpy(). Do you think the situation gets better when I split that up in a for loop. Does the caching system predicting data I may use next and is precaching this? — Rafael, May 22 '16 at 17:41
You may want to try a memcpy purely in the kernel. Also, do short repeated trials [less than cache size]. Also, do single direction access like memset instead [or equiv `for (; ptr < end; ++ptr) junk = *(volatile int *) ptr;`]. Also, what cache policy is set for the pages [vs userspace], writeback or writethrough? I suspect userspace is writeback but /dev/mem is writethrough, or cache is even disabled for the /dev/mem because kernel has to assume it's an I/O device and can't be cached. You told kernel _not_ to use it [via mem=] so it knows nothing about the area so it assumes minimum traits — Craig Estey, May 22 '16 at 18:12
Hello Craig, thanks for suggestions. How can I set or even check parameters like cache policy or state disabled/enabled for /dev/mem? Can I set it during opening /dev/mem file or during mmap operation? — Rafael, May 22 '16 at 18:25
Two cmts from the mem driver: _"Architectures vary in how they handle caching for addresses outside of main memory."_ and _"Remap-pfn-range will mark the range VM_IO"_. Begs the question: Why are you doing it this way? So far, you're not doing anything that `malloc` or ordinary anonymous `mmap` couldn't do [and easier/better]. — Craig Estey, May 22 '16 at 19:05
Hi Craig, I have seperated the upper memory from linux because I have some special devices that have access to this memory. They store lot of measurment data into this area. So I need large sizes of continuous memory, up to 512 MB. I cannot allocate such continuous locations of physical memory with malloc. I also had a loock at CMS which can be enabled for the kernel. The main disadvantage is the complexity. I would like to do it as simple as possible, but faced the performance issues. — Rafael, May 22 '16 at 19:30
I suspected as much (re. devices). The drivers for your device can do the mmap for you: open /dev/mydev, ioctl(fd,REMAP_AREA_FOR_ME,...). That's the more typical way. The driver can set the correct attributes for the memory, whereas just opening /dev/mem can't. Your driver probably already has a method to do this. — Craig Estey, May 22 '16 at 19:57
So, how do you tell the device the address of the memory to use? And when? I presume it only dumps data into it if the driver has been opened and the device initialized and activated. — Craig Estey, May 22 '16 at 20:20
Hi Craig. The devices are custom made. They are custom IPs inside the fpga part of the chip and have some registers. I have implemented "user space" drivers using UIO framework. So I map the registers of the devices and configure them. Here I set the address they shoud use for measurement data. So before I do that I allocate memory in the upper range using physical addresses, then copy the availible address into appropriate register and start the mesurment. So your suggestion is very good, but I have to implemnet it by myself. My device driver cannot know which address is free... — Rafael, May 22 '16 at 20:56
..I'm implemented a basic memory manager in userspace. Following your suggestion I would need to give that address to the driver and from there make it availible over the file descriptor for example /dev/uio1. What do you think? — Rafael, May 22 '16 at 20:56
Maybe I could make a UIO device that maps the 512MB DDR memory instead of fpga registers and access this throu /dev/uiox instead of /dev/mem? Do you think this is feasible? Thanks in advance! — Rafael, May 22 '16 at 21:08
I've got to go out for a while. But, when I get back I'll have lots to say about this because I have 20+ years doing linux kernel/drivers and specific experience with Xilinx FPGAs that do DMA. And, I think I now have enough to do an answer as the comments takes too much space. Per your uio1 cmt, need to think about it, but when I've done similar, if userspace manages the addresses, it feeds them [via ioctl] to the driver, which feeds them to the device, with driver doing all necessary mappings for userspace _and_ device for given address. While I'm out, anything more I need to know? — Craig Estey, May 22 '16 at 21:18
Yes the comments are quite long already. Maybe just one explenation for why I'm needing memcpy(). The Data memory will contain several measurments. The mesurements will be filled one by one in a row. When now a measurment will be deleted I want to make a defragmentation of the memory, so the gab gets filled and I have the rest of free memory in one pease. — Rafael, May 23 '16 at 06:08

score 5 · Accepted Answer · answered May 23 '16 at 19:46

Note: A lot of this is prefaced by my top comments, so I'll try to avoid repeating them verbatim.

About buffers for DMA, kernel access, and userspace access. The buffers can be allocated by any mechanism that is suitable.

As mentioned, using mem=512M and /dev/mem with mmap in userspace, the mem driver may not set optimal caching policy. Also, the mem=512M is more typically used to tell the kernel to just never use the memory (e.g. we want to test with less system memory) and we're not going to use the upper 512M for anything.

A better way may to leave off mem=512M and use CMA as you mentioned. Another way may be to bind the driver into the kernel and have it reserve the full memory block during system startup [possibly using CMA].

The memory area might be chosen via kernel command line parameters [from grub.cfg] such as mydev.area= and mydev.size=. That is useful for the "bound" driver that must know these values during the "early" phases of system startup.

So, now we have the "big" area. Now, we need to have a way for the device to get access and the application to get it mapped. The kernel driver can do this. When the device is opened, an ioctl can set up the mappings, with correct kernel policy.

So, depending on the allocation mechanism, the ioctl can be given address/length by the application, or it can pass them back to the application [suitably mapped].

When I had to do this, I created a struct that described a memory area/buffer. It can be the whole area or the large area can be subdivided as needed. Rather than using a variable length, dynamic scheme equivalent to malloc [like what you were writing], I've found that fixed size subpools work better. In the kernel, this is called a "slab" allocator.

The struct had an "id" number for the given area. It also had three addresses: address app could use, address kernel driver could use, and address that would be given to H/W device. Also, in the case of multiple devices, it might have an id for which particular device it is currently associated with.

So, you take the large area and subdivide like this. 5 devices. Dev0 needs 10 1K buffers, Dev1 needs 10 20K buffers, Dev3 needs 10 2K buffers, ...

The application and kernel driver would keep lists of these descriptor structs. The application would start DMA with another ioctl that would take a descriptor id number. Repeat this for all devices.

The application could then issue an ioctl that waits for completion. The driver fills in the descriptor of the just completed operation. The app processes the data and loops. It does this "in-place"--See below.

You're concerned about memcpy speed being slow. As we've discussed, that may be due to the way you were using mmap on /dev/mem.

But, if you're DMAing from a device into memory, the CPU cache may become stale, so you have to account for that. A real device driver has plenty of in-kernel support routines to handle this.

Here's a big one: Why do you need to do a memcpy at all? If things are set up properly, the application can operate directly on the data without needing to copy it. That is, the DMA operation puts the data in exactly the place the app needs it.

At a guess, right now, you've got your memcpy "racing" against the device. That is, you've got to copy off the data fast, so you can start the next DMA without losing any data.

The "big" area should be subdivided [as mentioned above] and the kernel driver should know about the sections. So, the driver starts DMA to id 0. When that completes, it immediately [in the ISR] starts DMA to id 1. When that completes, it goes onto the next one in its subpool. This can be done in a similar manner for each device. The application could poll for completion with an ioctl

That way, the driver can keep all devices running at maximum speed and the application can have plenty of time to process a given buffer. And, once again, it doesn't need to copy it.

Another thing to talk about. Are the DMA registers on your devices double buffered or not? I'm assuming that your devices don't support sophisticated scatter/gather lists and are relatively simple.

In my particular case, in rev 1 of the H/W the DMA registers were not double buffered.

So, after starting DMA on buffer 0, the driver had to wait until the completion interrupt for buffer 0 before setting the DMA registers up for the next transfer to buffer 1. Thus, the driver had to "race" to do the setup for the next DMA [and had a very short window of time to do so]. After starting buffer 0, if the driver had changed the DMA registers on the device, it would have disrupted the already active request.

We fixed this in rev 2 with double buffering. When the driver setup the DMA regs, it would hit the "start" port. All the DMA ports were immediately latched by the device. At this point, the driver was free to do the full setup for buffer 1 and the device would automatically switch to it [without driver intervention] when buffer 0 was complete. The driver would get an interrupt, but could take almost the entire transfer time to set up the next request.

So, with rev 1 style system, a uio approach could not have worked--it would be way too slow. With rev 2, uio might be possible, but I'm not a fan, even if it's possible.

Note: In my case, the we did not use read(2) or write(2) to the device read/write callbacks at all. Everything was handled through special ioctl calls that took various structs like the one mentioned above. At a point early on, we did use read/write in a manner similar to the way uio uses them. But, we found the mapping to be artificial and limiting [and troublesome], so we converted over to the "only ioctl" approach.

More to the point, what are the requirements? Amount of data transferred per second. Number of devices that do what? Are they all input or are there output ones as well?

In my case [which did R/T processing of broadcast quality hidef H.264 video], we were able to do processing in the driver and application space as well as the custom FPGA logic. But, we used a full [non-uio] driver approach, even though, architecturally it looked like uio in places.

We had stringent requirements for reliability, R/T predictability, guaranteed latency. We had to process 60 video frames / second. If we ran over, even by a fraction, our customers started screaming. uio could not have done this for us.

So, you started this with a simple approach. But, I might take a step back and look at requirements, device capabilities/restrictions, alternate ways to get contiguous buffers, R/T throughput and latency, and reassess things. Does your current solution really address all the needs? Currently, you're already running into hot spots [data races between app and device] and/or limitations. Or, would you be better off with a native driver that gives you more flexibility (i.e. There might be an as yet unknown that will force the native driver).

Xilinx probably provides a suitable skeleton full driver in their SDK that you could hack up pretty quickly.

score 4 · Answer 2 · answered May 24 '16 at 13:34

thanks you a lot for your time. Your answer is very usefull for me. I like the idea of managing the buffers (dma buffers) from the driver itself.

As you clarified by source code of /dev/mem, when I use mmap on area that was excluded with mem=512M from the kernel, the kernel threats is as device memory and disable caching.

I found an intermediate solution for that. I removed the kernel boot argument and added a reserved-memory area in my device tree like this:

/ { 
    reserved-memory {
        #address-cells = <1>;
        #size-cells = <1>;
        ranges;

        my_reserved: databuffer@20000000 {
            reg = <0x20000000 0x20000000>;
        };
    };
};

Doing so, I get system ram of 0x00000000 - 0x3fffffff from cat /proc/iomem. cat /proc/meminfo gives my a free memory of just 500 MB, so my area is not used.

When I now open /dev/mem and mmap this area I got about 260 MB/sek from memcpy() and about 1200 MB/sek from memset(). The area is treated as memory and cached. I sill don't know why it is just half of the performance of a malloc area, but much better.

I think the perfect solution for my case would be somethink like a better /dev/mem like a /dev/cma device driver thats allocates buffers from a cma area I define in bootargs. On that device I could then set stuff like cache, coherency policy through ioctl(). That would give my the opportunity to set the preferences by myself on that area.

I found interesting posts on that issue, how other people have solved it. Continous memory on ARM and cache coherency

Your 2nd to last paragraph is exactly what I was talking about with a native driver and ioctls. Also, if you were to do what the link specifies, you'd [probably] need a custom driver to do that. As to why it's still 1/2, it might be "backing store". That is, do you get direct map or is the mem just a "paging disk"? See my answer: http://stackoverflow.com/questions/37172740/how-does-mmap-improve-file-reading-speed/37173063#37173063 Read it and the two links within it for more info that _may_ help. — Craig Estey, May 24 '16 at 19:21
If you did have a custom driver, you could create an ioctl and feed both malloc and mmap /dev/mem addresses to it. It could probe deep, using all the kernel functions necessary to discern any differences in config/policy for the given page, etc. When I did my buffer mgmt driver, it was completely separate from the video data driver. That is, _two_ drivers, and the buffer driver had to be loaded first and the video driver linked to it. That's because the buffer driver was useful [and reusable] on its own--it was also simple (ie. a way to get your feet wet) — Craig Estey, May 24 '16 at 19:31

score 2 · Answer 3 · answered May 29 '16 at 11:07

Previously, it will introduce a device driver that I made for the same purpose as you. please refer.

https://github.com/ikwzm/udmabuf

udmabuf(User space mappable DMA Buffer)

Overview

Introduction of udmabuf

udmabuf is a Linux device driver that allocates contiguous memory blocks in the kernel space as DMA buffers and makes them available from the user space. It is intended that these memory blocks are used as DMA buffers when a user application implements device driver in user space using UIO (User space I/O).

A DMA buffer allocated by udmabuf can be accessed from the user space by opneing the device file (e.g. /dev/udmabuf0) and mapping to the user memory space, or using the read()/write() functions.

CPU cache for the allocated DMA buffer can be disabled by setting the O_SYNC flag when opening the device file. It is also possible to flush or invalidate CPU cache while retaining CPU cache enabled.

The physical address of a DMA buffer allocated by udmabuf can be obtained by reading /sys/class/udmabuf/udmabuf0/phys_addr.

The size of a DMA buffer and the device minor number can be specified when the device driver is loaded (e.g. when loaded via the insmod command). Some platforms allow to specify them in the device tree.

Architecture of udmabuf

Figure 1. Architecture

Supported platforms

OS : Linux Kernel Version 3.6 - 3.8, 3.18, 4.4
(the author tested on 3.18 and 4.4).
CPU: ARM Cortex-A9 (Xilinx ZYNQ / Altera CycloneV SoC)