Different execution orders cause differences in performance of a Pthread program

Question

This is my first post on stackoverflow and my native language is not English. Please excuse me for any inconvenience this post brings to you. Maybe it's a little long, so I am looking forward to your patience. Thanks in advance!

I have a C language code snippet. The job is counting the number of words in two files. I use pthreads to solve this problem. But I find the order of these two statements

count_words(argv[1]);

pthread_create(&t1, NULL, count_words, (void *)argv[2]);

affects the program performance, which it's opposite to what I expected. Here is the code:

#include <stdio.h>
#include <pthread.h>
#include <ctype.h>
#include <stdlib.h>

int total_words;

int main(int argc, char *argv[]) {
    pthread_t t1;
    void *count_words(void *);

    if (argc != 3) {
        printf("usage: %s file1 file2\n", argv[0]);
        exit(1);
    }
    total_words = 0;

    count_words(argv[1]); // program runs faster when executing this first
    pthread_create(&t1, NULL, count_words, (void *)argv[2]);

    pthread_join(t1, NULL);
    printf("%5d: total words\n", total_words);
    return 0;
}

void *count_words(void *f) {
    char *filename = (char *)f;
    FILE *fp;
    int c, prevc = '\0';

    if ((fp = fopen(filename, "r")) == NULL) {
        perror(filename);
        exit(1);
    }
    while ((c = getc(fp)) != EOF) {
        if (!isalnum(c) && isalnum(prevc))
            total_words++;
        prevc = c;
    }
    fclose(fp);
    return NULL;
}

Performance:

I run program using "test program_name" on command line to test the running speed. The output is:

If the order like this:

count_words(argv[1]);

pthread_create(&t1, NULL, count_words, (void *)argv[2]);

program runs fast: real 0.014s

If like this:

pthread_create(&t1, NULL, count_words, (void *)argv[2]);

count_words(argv[1]);

program runs slow: real 0.026s

What I expected:

On case 1, program runs count_word() first. After completing the counting job will it continue to run pthread_create(). At that time, the new thread will help do the counting job. So, the new thread does the job after the origin thread completes the job, which is sequential running instead of parallel running. On case 2, program runs pthread_create() first before any counting, so after that there are two threads parallel do the counting. So I expect the case 2 is faster than case 1. But I am wrong. Case 2 is slower. Could anybody give me some useful info on this?

Note

Please ignore that I don't put a mutex lock on the global variable total_words. This is not the part I am concerned about. And the program is just for testing. Please forgive its imperfections.

Edit 1

Below is the supplement and improvement after I read some suggestions.

a) Supplement: The processor is Intel® Celeron(R) CPU 420 @ 1.60GHz. One core.

b) Improvement: I have improved my example, two changes:

1) I enlarged the files. file1 is 2080651 bytes (about 2M), file2 is the copy of file1.

2) I modified count_words(). When reaching the file end, use fseek() to set the fp to the beginning and count again. Repeatedly counts COUNT times. Define COUNT 20. Below is the changed code:

#define COUNT 20

// other unchanged codes ...

void *count_words(void *f) {
    // other unchanged codes ...
  int i;
  for (i = 0; i < COUNT; i++) {
      while ((c = getc(fp)) != EOF) {
          if (!isalnum(c) && isalnum(prevc))
              total_words++;
          prevc = c;
      }
      fseek(fp, 0, SEEK_SET);
  }
  fclose(fp);
  return NULL;
}

Output of fast_version (count_word() first) and slow_version (pthread_create() first):

administrator@ubuntu:~$ time ./fast_version file1 file2

12241560: total words

real 0m5.057s

user 0m4.960s

sys 0m0.048s

administrator@ubuntu:~$ time ./slow_version file1 file2

12241560: total words

real 0m7.636s

user 0m7.280s

sys 0m0.048s

I tried the "time progname file1 file2" command a few times. Maybe there is some difference on tenth or hundredth of a second on each run. But the differences are not much.

Edit 2

This part is added after I have done some experiments according to some hints --

When you launch the second thread after the first thread completes it's execution, there is no context switching overhead.

--by user315052.

The experiment is that I improved the count_word():

void *count_word(void *f) {
// unchanged codes
// ...
    for (i = 0; i < COUNT; i++) {
        while ((c = getc(fp)) != EOF) {
            if (!isalnum(c) && isalnum(prevc))
                total_words++;
            prevc = c;
        }
        fseek(fp, 0, SEEK_SET);
        printf("from %s\n", filename); // This statement is newly added.
    }
// unchanged codes
// ...
}

Add statement " printf("from %s\n", filename); " , so I can tell which file (or thread) is running at that time. The output of fast version is 20 times " from file1 ", then 20 times " from file2 ", and the slow version is " from file1 " and " from file2 " mixed printed.

It looks like fast version is faster because there is no context switching. But fact is that after count_word() finished, the original thread was not dead, but created a new thread and waited for it to terminate. Is there no context switching when new thread is running? I watched the screen closely and found the printing speed of " from file2 " is apparently slower than " from file1 ". Why? Is it because context switching happened when counting from file2?

For the slow version, we can see from the output the printing speed of " from file1 " and " from file2 " is even slower than the printing speed of " from file2 " in fast version, because its context switching costs more time on parallel counting, while in fast version the context switching is not so heavy as one of the threads has finished its job and just waiting.

So I think the main reason is fast version has a light and easy context switching against the slow version. But the "printing speed" is from my observation, and may be not so strict. So I am not sure about it.

What kind of machine are you running on? Do you actually have two cores? — jxh, Apr 06 '13 at 09:52
What if you comment out the `total_words++`? Do the relative timings remain the same? — NPE, Apr 06 '13 at 09:52
The time is quite short - how do you measure it accurately? Better to run with a much large input. — Roger Rowland, Apr 06 '13 at 09:53
Did you activate optimization option when compiling? If you did not, it is -O2 by default for gcc. I disabled the optimization (-O0) for both example and I had almost the same timing. — Bechir, Apr 06 '13 at 10:18
@user315052 The processor is Intel® Celeron(R) CPU 420 @ 1.60GHz. One core. — Yang Wenhao, Apr 06 '13 at 10:37
@NPE If I comment out the 'total_words++', the both timeings are the same. Fast one is 0.014s and slow one is 0.026s. — Yang Wenhao, Apr 06 '13 at 10:44
@roger_rowland Now my file1 is 2080651 bytes (about 2.0M), my file2 is the copy of file1. I run "test progname" on command line 10 times. The average time of the fast one is about 0.266s and the slow one is about 0.387s. — Yang Wenhao, Apr 06 '13 at 11:05
@ Bechir I ran gcc use default setting. Now I tried gcc -O0, and have both the times nearly the same. — Yang Wenhao, Apr 06 '13 at 11:13
Running pthreads on a single core system is not the way to learn (or test) parallel code. It will hide the majority of the issues, and lead to incorrect assumptions. Any timing results you get will be wildly different from a multicore or SMP system. — Randy Howard, Apr 06 '13 at 14:43
@RandyHoward Yes, you are right. But the good part is we can learn something from it. — Yang Wenhao, Apr 07 '13 at 07:40
@WenhaoYang I'm forced to disagree with you there. You are really doing yourself a disservice by learning threaded programming on a single core system. You will not learn some things that really matter about race conditions, locking, etc., and probably mis-learn some other things because they work when they shouldn't. — Randy Howard, Apr 07 '13 at 16:50

score 10 · Accepted Answer · answered Apr 06 '13 at 10:40

10

In a comment, you wrote:

The processor is Intel® Celeron(R) CPU 420 @ 1.60GHz. One core.

Since you only have one core, your thread executions are serialized anyway. Your program with two threads running concurrently pays the overhead of thread context switching as each performs blocking I/O.

When you launch the second thread after the first thread completes it's execution, there is no context switching overhead.

answered Apr 06 '13 at 10:40

jxh

69,070
8
110
193

Excellent answer - this almost certainly explains the OP's confusion of the timings. +1 – Roger Rowland Apr 06 '13 at 12:14
Thanks for the provided information! Following your hint, I did an experiment and updated my post. Now I think your answer may be not so precise. There seems to be context switching after the first count_word() (not first thread!) completed. And I think the reason is the cost of context switching in slow version is more expensive than the fast version. But I am not sure. What's your opinion? – Yang Wenhao Apr 07 '13 at 07:31
@WenhaoYang: What I meant by "no context switching overhead" is that each thread is allowed to run to completion without switching back and forth between them. Yes, there would be two in the fast version. Switching to the second thread, and switching back after the join. Besides the cost of the context switch itself, there are associated costs to a context switch, such as disrupting the data cache. The slow version is most certainly context switching more than twice, and each time also thrashes the data cache. – jxh Apr 07 '13 at 14:42
@WenhaoYang: To test out my explanation, you can create two more versions of your code. One is to move the call to `count_words()` in `main()` to be after the `pthread_join()`. That should still be fast. The second is to get rid of your `pthread...()` calls, and just call `count_words()` twice in `main()`. This should give you almost the same time as your fast version. – jxh Apr 07 '13 at 15:04
@user315052: You are right. No context switching back and forth. But the experiment you suggested did not behave as you expected. I also can not understand it. So I improved my experiment by using timing function to make it stricter and send a new [post](http://stackoverflow.com/questions/15877403/strange-execution-time-of-the-routine-in-a-pthread-program). You can see my new post for my another question. Thank you for your help! – Yang Wenhao Apr 08 '13 at 11:14

Jens · Answer 2 · 2013-04-06T13:56:53.743

2

How did you measure?

Real time is not an indication of how long your program ran. You have to measure user+system time. What's more, meaningful timing at the millisecond level depends very much on the granularity of your timing clock. If it runs, say, at 60Hz, then you have a problem. Coming up with meaningful benchmarks is an art...

As a start, you should find a way to run your threads in a loop, say, 10.000 times and add up numbers. That'll at least get you out of the millisecond timing problem.

edited Apr 06 '13 at 13:56

answered Apr 06 '13 at 09:52

Jens

69,818
15
125
179

Follow your advice, I changed my code , enlarged files and edited my post. Like I said in the edited post, when count_words() reached file end, return to beginning, do counting again. This procedure runs 20 times. And I also posted both user and system time. Maybe the measurement is not very strict nor precise. But we can conclude the fast one is still the fast one, right? – Yang Wenhao Apr 06 '13 at 12:26

score 2 · Answer 3 · answered Apr 06 '13 at 09:56

2

Try to do the same measure but run your program a 100 times and calculate the average time, with such a short time the effect of caching is far from neglictible for an example.

answered Apr 06 '13 at 09:56

Étienne

4,773
2
33
58

Yes. I edited code. Let count_word() do the counting job repeatedly on the same file 20 times. Now I have longer times. But sorry, I just entered "time progname file1 file2" command several times and observed the result. I found the difference is not much and is enough for me. So I did not calculate the average time. – Yang Wenhao Apr 06 '13 at 12:32

Different execution orders cause differences in performance of a Pthread program

3 Answers3