How to do static code logical analysis with AST tree or other tool?

Question

void f1(char *s)
{
 s[20] = 0;
}
void f2()
{
 char a[10];
 if (x + y == 2) {
 f1(a);
 }
}

Cppcheck will report this message: Array 'a[10]' index 20 out of bounds

How could Cppcheck get the connection between ‘a’ in f2 and ‘s’ in f1?

I have built AST tree, But It only supplies information of each symbol, and give little information to me on the logical relationship of symbols. How could computer know ‘a’ in f2 and ‘s’ in f1 are the same thing? As I know, we have to take so many situations into consideration, such as:

void f1(char *s)
{
 char str_arry[30];
 s= str_arry;
 s[20] = 0;
}

In this case 's' and 'a' are not the same things.

You probably need a data flow graph or something similar in order to conclude that `s` points to `a` for some function call in your code. Have a look at https://stackoverflow.com/a/15755160/5265292 — grek40, Jul 17 '19 at 06:34

score 3 · Answer 1 · answered Jul 17 '19 at 16:45

I don't know how exactly Cppcheck works but I'll tell you how to solve this problem in general. There are two main approaches to the analysis of interrelated functions.

In the first case, when an analyzer meets function call it starts analyzing its body considering value of factual arguments transmitted through the function. This happens naturally only if it is known which values are transmitted to the function. This refers to: an exact value, a range, a set of values, null/non-null pointer, etc. The complexity of the transmitted information depends on the analyzer sophistication. For example, it can start analyzing the function body knowing that two of the transmitted pointers refer to the same array.

It's an excellent accurate approach. But there's a serious problem. The analyzers based on this concept are very slow. They have to analyze functions bodies with different input data sets over and over again. The functions in turn call other ones and so on. And at some point the "inside" analysis has to be stopped which, in practice, makes this approach not that accurate and excellent as it might seem in theory.

There's a second approach. It's based on automatic function annotations. The thing is, when analyzing functions the information on how its arguments are used and which values they can't take is being gazed. Let's consider the simple example that I gave in the article called 'Technologies used in the PVS-Studio code analyzer for finding bugs and potential vulnerabilities'.

int Div(int X)
{
  return 10 / X;
}
void Foo()
{
  for (int i = 0; i < 5; ++i)
    Div(i);
}

An analyzer recognizes that X variable is used in Div function as a divider. Based on it, a special Div function annotation is created automatically. Then it takes into account the fact that a range of [0..4] values is transmitted to the function as the X argument. The analyzer concludes that the division by zero should appear.

This approach is more crude and not that accurate as the first one. But it is very fast and allows to create strong correlations between big amount of functions with no loss of productivity.

It can be much more complicated in practice. For example, the PVS-Studio analyzer uses the second approach as the main one but not always. Sometimes when dealing with template functions we analyze them once more (the first approach). In other words, we use a combined approach to maintain the balance between the depth and speed of analysis.

score 2 · Answer 2 · answered Jul 18 '19 at 16:16

In order to analyze the possible sources of some value, it's a good idea to turn all variables into immutables by introducing a new symbol whenever the original was changed and using the new symbol for all following occurences (the original symbol won't be used after the point where it was re-assigned in the original code).

Consider the following code:

// control flow block 1
int i = 1;
if (some_condition()) {
    // control flow block 2
    i = 2;
}
// control flow block 3
int j = i;

With the control flow graph

[1]
 | \     <- if (some_condition())
 |  [2]
 | /     <- join of control flow after the if block ends
[3]

You could write a list of all symbols that are alive (have a value that is used anywhere later in the control flow graph) at the entry and exit point of a block in the control flow graph:

[1] entry: nothing; exit: i
[2] entry: nothing; exit: i
[3] entry: i; exit: i, j (I assume i, j are re-used after the end of this example)

Notice that [2] entry is empty, since i is never read and always written within block [2]. The problem with this representation is, that i is in the exit list of all blocks but it has different possible values for each block.

So, lets introduce the immutable symbols in pseudo-code:

// control flow block 1
i = 1;
if (some_condition()) {
    // control flow block 2
    i_1 = 2;
}
// control flow block 3
// join-logic of predecessor [1] and [2]
i_2 = one_of(i, i_1);
j = i_2;

Now every variable is coupled exactly to its first (and only) assignment. Meaning, a dependency graph can be constructed by analyzing the symbols that are involved in an assignment

i   -> i_2
i_1 -> i_2
i_2 -> j

Now in case there is any constraint on the allowed value of j, a static checker could require that all predecessors of j (namely i_2, in turn originating from i and i_1), satisfy this requirement.

In case of function calls, the dependency graph would contain an edge from every calling argument to the corresponding parameter in the function definition.

Applying this to your example is straight forward if we only focus on the array variable and ignore changes to the array content (I'm not quite sure to what extent a static checker would track the content of individual array items in order to find danger down the road):

Example 1:

void f1(char *s)
{
    s[20] = 0;
}

void f2()
{
    char a[10];
    if (x + y == 2) {
        f1(a);
    }
}

Transforms to

f1(s)
{
    s[20] = 0;
}

f2()
{
    a = char[10];
    if (x + y == 2) {
        call f1(a);
    }
}

With dependency graph including the passed arguments via function call

a -> s

So it's immediately clear that a has to be considered for the static analysis of the safety of s[20].

Example 2:

void f1(char *s)
{
    char str_arry[30];
    s= str_arry;
    s[20] = 0;
}

Transforms to

f1(s)
{
    // control flow block 1
    str_arry = char[30];
    s_1 = str_arry;
    s_1[20] = 0;
}

With dependency graph

str_arry -> s_1

So it's immediately clear that the only value to be considered for the static analysis of the safety of s_1[20] is str_arry.

score 1 · Answer 3 · answered Jul 17 '19 at 06:17

1

How could Cppcheck get the connection between ‘a’ in f2 and ‘s’ in f1?

They are definitely not the same. One of the following can happen:

You pass a to the function, and CPPcheck continues to remember the size of a, even though you access it with the formal parameter s.

You have to keep in mind that static analysis tools and compilers work differently, with different purposes in mind. Static analysis tools were crated EXACTLY for the purpose of catching things like you presented in your question.

In your second example you have:

s= str_arry;

which removes the connection between s and a.

answered Jul 17 '19 at 06:17

virolino

2,073
5
21

As we all know,It's quite easy to understand s= str_arry; will removes the connection between s and a for human, But How can computer know that? Could you please give me some hints ? – White Wang Jul 17 '19 at 09:22
The address pointed by `s` is not the same as the address pointed by `a`. – virolino Jul 17 '19 at 09:41

How to do static code logical analysis with AST tree or other tool?

3 Answers3