2

Is it possible to start reading a file from a specific line or byte. Currently I use this code to read 4 bytes of a file:

section .data
    filename db "file.txt", 0

section .bss
    read_data resb 4

section .text
    global _start

_start:
  mov rax, SYS_OPEN
  mov rdi, filename
  mov rsi, O_RDONLY
  mov rdx, 0
  syscall

  push rax
  mov rdi, rax
  mov rax, SYS_READ
  mov rsi, read_data
  mov rdx, 4
  syscall

  mov rax, SYS_CLOSE
  pop rdi
  syscall

This code always reads the first 4 bytes, but I want to start reading from other parts of the file, like the middle for example. What do I need to add or change?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
JJHH
  • 67
  • 1
  • 9

2 Answers2

4

A freshly-opened file descriptor starts at position = 0. If you keep reading from the same fd in a loop, you'll get successive chunks. (Use a larger buffer like 8kiB and loop over dwords in user-space, though, using the value that read returned as an upper limit! A system call is very expensive in CPU time.)

Is it possible to start reading a file from a specific line or byte.

  • Byte: yes
  • Line: no. In Unix/Linux, the kernel doesn't have an index of line-start byte offsets or any other line-oriented API. The line handling in stdio fgets for example is purely done in user-space. There have been some historical OSes with record-based files, but Unix files are flat arrays of bytes. (They can have holes, unwritten extents, and extended attributes... But the kernel APIs for the main file contents only operate with by byte offsets).

If you want to do lines, read a big block and loop forward until you've seen some number of newlines. If you're not there yet, read another block; repeat until you find the start and end of the line number you want, or you hit EOF. x86-64 can efficiently search 16 bytes at a time with pcmpeqb / pmovmskb / popcnt (popcnt requires SSE4.2 or the specific popcnt feature bit).

Or with just SSE2, or when optimizing for large blocks, with pcmpeqb / psadbw (against all-zero) to hsum bytes to qwords / paddd. Then check how many lines you went every so often with some scalar code. Or keep it simple and branch on finding the first newline in a SIMD vector.

Obviously the slow and simple option is a byte-at-a-time loop that counts '\n' characters - if you know how to do strchr with SSE2 it should be straightforward to vectorize that search using the above suggestions.


But if you only want some specific byte positions, you have two main options:

  • seek with lseek(2) before read(2) (see @Nicolae Natea's answer)

  • Use POSIX/Linux pread(2) to read from a specified offset, without moving the fd's file offset for future read calls. The Linux system call name is pread64 (__NR_pread64 equ 17 from asm/unistd_64.h)

    ssize_t pread(int fd, void *buf, size_t count, off_t offset); The only difference from read is the offset arg, the 4th arg thus passed in R10 (not RCX like the user-space function calling convention). off_t is a 64-bit type simply passed in a single register in 64-bit code.

Other than the pread64 name in the .h, there's nothing special about the asm interface compared to the C interface, it follows the standard system-calling convention. (It exists since Linux 2.1.60 ; before that glibc's wrapper emulated it with lseek.)


There are other things you can do like mmap, or a preadv system call, but pread is most exactly what you're looking for if you have a known position you want to read from.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
2

Before performing the read you should perform a lseek, so that the file position is updated.

so something along the lines:

mov     rdi, rax        ; fd
mov     rax, SYS_LSEEK
mov     rsi, <whatever offset you want>
mov     rdx, 0  ; keep 0 if the offset should be from the begining of the file
syscall

note: RDI will still hold the same fd value after a syscall so you don't need extra save/restore for the fd across lseek / read / close.

Tip: It might be easier to write the code in c and compile it with gcc -g -S -fverbose-asm -Og -c main.c and then look at main.s. (How to remove "noise" from GCC/clang assembly output?). But that will only show the compiler making calls to libc wrapper functions, unless you use inline system call macros like MUSL libc provides.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Nicolae Natea
  • 1,185
  • 9
  • 14
  • `syscall` doesn't destroy `rdi` so it will still hold the FD after an `lseek` system call. syscalls only destroy RAX (with the return value), and RCX, R11 (used by syscall itself to save RIP/RFLAGS for the kernel to `sysret`) – Peter Cordes Mar 14 '20 at 23:36
  • Ok, you're right. So the 'note' is not useful in this particular case(as long as 'mov rdi, rax' is removed also from the read operation), but otherwise the answer is still valid. – Nicolae Natea Mar 14 '20 at 23:44
  • It's still useful to mention the issue instead of removing it entirely. I improved that for you with an edit. I still prefer my answer which suggests `pread` because it's simpler and more efficient to do that than lseek + read, but this is now a good answer; have an upvote. – Peter Cordes Mar 14 '20 at 23:54