1

I apologise if comparisons are not supposed to work this way. I'm new to programming and just curious as to why this is the case.

I have a large binary file containing word embeddings (4.5gb). Each line has a word followed by its embedding which is comprised of 300 float values. I'm simply finding the total number of lines.

For C, I use mmap:

int fd; 
struct stat sb; 
off_t offset = 0, pa_offset;
size_t length, i;
char *addr;
int count = 0;

fd = open("processed_data/crawl-300d-2M.vec", O_RDONLY);
if(fd == -1){
    handle_error("open");
    exit(1);
}

if(fstat(fd, &sb) < 0){
    handle_error("fstat");
    close(fd);
    exit(1);
}

pa_offset = offset & ~(sysconf(_SC_PAGE_SIZE) - 1);
if(offset >= sb.st_size){
    fprintf(stderr, "offset is past end of file\n");
    exit(EXIT_FAILURE);
}

length = sb.st_size - offset;
addr = mmap(0, (length + offset - pa_offset), PROT_READ, MAP_SHARED, fd, pa_offset);
if (addr == MAP_FAILED) handle_error("mmap");

//Timing only this loop
clock_t begin = clock();
for(i=0;i<length;i++){
    if(*(addr+i) == '\n') count++;
}
printf("%d\n", count);
clock_t end = clock();  
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("%f\n", time_spent);

This takes 11.283060 seconds.

Python:

file = open('processed_data/crawl-300d-2M.vec', 'r')
count = 0
start_time = timeit.default_timer()
for line in file:
    count += 1
print(count)
elapsed = timeit.default_timer() - start_time
print(elapsed)

This takes 3.0633065439997154 seconds.

Doesn't the Python code read each character to find new lines? If so, why is my C code so inefficient?

trk
  • 342
  • 4
  • 13
Ujan
  • 103
  • 1
  • 8
  • How is it counting the lines? Doesn't it compare each character to the newline character? If so, how is it different from what I've done in C? – Ujan Jan 28 '18 at 07:56
  • @OlegButuzov yes and to count the lines it needs to find line-endings, not sure what your point is. The difference I presume is just more coarse buffering in the python code – PeterT Jan 28 '18 at 07:56
  • Hmm, what happens if you try something like `fscanf` instead? – cs95 Jan 28 '18 at 07:58
  • try `MAP_HUGETLB ` or just fread into a larger buffer (I'd suggest a couple of megs) – PeterT Jan 28 '18 at 08:04
  • 1
    Since you read the file just once, using `mmap()` May be less efficient than using `fgets()` or POSIX `getline()` to read lines and count them. – Jonathan Leffler Jan 28 '18 at 08:12
  • @cᴏʟᴅsᴘᴇᴇᴅ its slower still. Naively reading character by character in any way is turning out to be a lot slower than the simple python for loop – Ujan Jan 28 '18 at 08:15
  • 1
    You may be interested in this [link](https://stackoverflow.com/questions/3501338/c-read-file-line-by-line). – cs95 Jan 28 '18 at 08:17
  • 1
    I highly suspect you haven't enabled the optimization in your C program. I tested your code on my machine with a file of size 1G, without optimization C is slightly slower, and `time` command show user time is huge; with -O2, C is much fast than Python. – llllllllll Jan 28 '18 at 08:34
  • @PeterT HUGETLB gives unable to map memory error for now. Any suggestions? I will eventually try it out anyway. – Ujan Jan 28 '18 at 08:40
  • @cᴏʟᴅsᴘᴇᴇᴅ Thanks that worked. I'll now have to read how getline works – Ujan Jan 28 '18 at 08:42
  • @JonathanLeffler Thank you. getline() works. Time down to 1 second now. But could you explain why mmap is less efficient if I read the file just once? – Ujan Jan 28 '18 at 08:43
  • @liliscent yes your'e right. Thank you. It does speed up substantially. However ditching mmap and using getline is faster still. – Ujan Jan 28 '18 at 08:48
  • You also could try to use memchr to find new lines in the variant with mmap. – algrid Jan 28 '18 at 08:52
  • FYI, `*(addr+i)` is conventionally spelled `addr[i]`. – n. m. could be an AI Jan 28 '18 at 09:26
  • 1
    [Apropos post by one Linus Torvalds on `mmap()`](https://marc.info/?l=linux-kernel&m=95496636207616&w=2) Summary: in some cases `mmap()` can be ***slow***. This is one of those cases. See also [When should I use mmap for file access?](https://stackoverflow.com/questions/258091/when-should-i-use-mmap-for-file-access) – Andrew Henle Jan 28 '18 at 17:25

1 Answers1

3

Hard to say, because I assume that it will be heavily implementation dependant. But at first glance, the main difference between your Python and C programs is that the C program uses mmap. It is a very powerful tool (that you do not really need here...) and as such can come with some overhead. As the reference Python implementation is written in C, it is likely that the loop

for line in file:
    count += 1

will end in a loop over a tiny function calling fgets. I would bet a coin that a naive C program using fgets will be slightly faster than the Python equivalent, because it will save all the Python overhead. But IMHO there is no surprise that using mmap in C is less efficient than fgets in Python

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • 1
    Furthermore the use of `mmap()` vs `fgets()` may play merry with the OS's built in file caching. I'd like to think that any OS worth a dime that's servicing a single `fgets()` would take the trouble to cache ahead whilst doing so. That would speed up the next `fgets()`. The OS may not bother or have a different cache strategy with `mmap`ed files; a file opened that way is much more of a random access thing than functions like `fgets()` that operate from the current file pointer. – bazza Jan 28 '18 at 13:00
  • 1
    In fact, the way `mmap()` is used in the posted code is probably the *worst* way to use `mmap()`. `mmap()` is *expensive* - instead of just copying the data from disk, the kernel also has to create virtual mappings in the process's address space. That's not free. Paying that cost is usually worthwhile only if the data is reused multiple times. Just passing through the file data once is the worst possible way to use `mmap()`. `mmap()` is good if you're reusing the data, or if you want simple code and don't really care about performance. – Andrew Henle Jan 28 '18 at 17:30