Sunday, January 28, 2018

Static code analysis: Traversing the AST (Abstract Syntax Tree) provided by Clang through its Python-bindings and building a CFG (Control Flow Graph) and a CG (Call Graph) for the C programming language

About three years ago, in 2015, when I was in my computer engineering career, I began to be curious about the static analysis of source code. As I did not know much about the subject, I decided to investigate the main parts of which an analyzer is composed. After searching through the internet, I finished reading from start to finish the Dragon Book, a book from which I learned a lot. After reading this book, I realized that the construction of a static analyzer is very similar to the first half of construction of a compiler, mainly because it is much simpler and efficient to traverse the code in an intermediate representation like an AST (Abstract Syntax Tree) or a CFG (Control Flow Graph) that traverse it in its high-level representation. Taking advantage of the effort that I invested in understanding the construction phases of an analyzer and a compiler, I decided to dedicate my career final project to the construction of a static source code analyzer for the C programming language with the aim of finding vulnerabilities in the code. I managed to complete the tool written in Python, and for those who are interested, here I will talk a little about source code static analysis and the intermediate representations that facilitate it.

The first thing I did before starting to program any line of code, was to define very well the main components that I had to develop to get an appropriate representation of the source code on which to perform the analysis. The components that I defined were the following:
  1. Construction of a lexical analyzer that reads the string of characters that forms the source code and groups them into lexemes.
  2. Construction of a syntactic analyzer or parser that uses the lexemes produced in the previous phase to build a parse-tree.
  3. Construction of a semantic analyzer that uses the previously produced representation to analyze the program in search of inconsistencies and perform type resolution.
  4. Construction of an intermediate representation. In this case construction of an Abstract Syntax Tree (AST).
  5. Construction of a suitable representation for the analysis. In this case construction of a Control Flow Graph (CFG).
  6. Construction of a representation of calls between functions. In this case construction of a Call Graph (CG).
  7. Construction of an analysis module.
  8. Presentation of results.
As you can see, they are a lot of components, and when we talk about developing these components for a programming language that has a grammar like that of C, it becomes anything but easy. 

In a general way, we could group the aforementioned components into two main groups,
  • The construction of the intermediate representation that the static analysis algorithms will use as input.
  • The analysis module itself that will implement these algorithms and will provide the results.
In this post, I will focus only on the first group, the construction of an intermediate representation from the source code. 

Intermediate representations of the source code for the C programming language

An intermediate representation of the source code, such as the AST, CFG or CG, can be used for many more things than for the construction of a static source code analyzer. It can be used to make a compiler, a code generator or, for example, a source code syntax checker.

As I mentioned before, the construction of all these components for a programming language like C, which has a relatively complex grammar, is not trivial, and since we have a finite time in our lives, I decided to investigate to see what components had already been made in Python that I could reuse for my system.

One of the first things I discovered after a lot of searching the web, is that I could skip the construction of the first four components of the list (lexical analyzer, parser, semantic analyzer and AST) thanks to Clang, libclang and their Python bindings. At this point, we will stop a few minutes and explain what each of these concepts are. Some are complex to explain, so I accompany the explanation with some images or diagrams that I found online or in different books.

Lexical Analyzer

It consists of reading the source code provided by the file and transforming it into a sequence of characters called lexemes. For each lexeme the lexical analyzer produces an output of the form: 

                                                 <token-name, attribute-value>

That will be passed to the next module, the parser. In a very general way, the process of recognizing tokens is usually done through the construction of Transition Diagrams. A transition diagram is composed of a set of nodes called states and a set of vertices. Each state represents a condition that can occur during the process of scanning a lexeme that matches a certain pattern. The vertices are labeled with a symbol or a set of symbols.

Figure 1: Transition diagram and tokens

Syntactic analyzer

The parser uses the first component of the token produced by the lexical analyzer to construct an intermediate representation in the form of a tree that represents the grammatical structure of the token. A typical representation is a syntax tree in which the inner node represents an operation, and the children the arguments of the operation. 

Figure 2: Syntax tree

Abstract Syntax Tree (AST)

It is possible to perform certain analyzes in the syntax tree because it contains a direct representation of the source code as written by the programmer. However, to perform complex analysis tasks is not convenient due to: 
  • The nodes of the tree derive directly from the rules of grammar production, and therefore can introduce symbols that exist only for the purpose of making the parsing process easier or to eliminate ambiguities.
  • Includes all the syntactic sugar from the source code of the program
Due to all this, it is convenient to build an Abstract Syntax Tree, which provides a standard version of the program that can be used for analysis.

Figure 3: Abstract Syntax Tree


Semantic Analyzer

The semantic analyzer uses the information in the AST and in the symbol table to check semantic consistency in the source code and complete the content of the AST. A very important part of almost all the tools that implement this type of representations is type checking. For example, in a static analysis tool it is not necessary to report errors of type as a compiler does, but the type information remains of vital importance to determine diverse types of vulnerabilities in the code, such as integer overflows.

Building these representations: Clang, libclang and the Python-bindings

Well, we have seen enough theory to explain all the components which are necessary for the construction of an abstract syntax tree, an intermediate representation of the source code. Now let's see in a practical way what tools we can use to carry it out.

First of all, what is Clang?

As it indicates in its web page
The goal of the Clang project is to create a new C based language front-end: C, C++, Objective C/C++, OpenCL C and others for the LLVM compiler. 
His Code is extremely well written, and I do not know if it will continue to be like that, but at least a few years ago, Apple used Clang for autocomplete functions in development tools such as Xcode.

Second, what is the relationship between libclang and Clang?

According to its documentation
The C Interface to Clang provides a small API that exposes facilities for parsing source code in an abstract syntax tree (AST), loading already-parsed ASTs, traversing the AST, associating physical sources locations with elements within the AST, and other facilities that support Clang-based development tools.

And finally, what are the python bindings?

According to the source code
This module provides an interface to the Clang indexing library. It is a low-level interface to the indexing library which attempts to match the Clang API directly while also being "pythonic". 
What this means is that we can use the interface that Clang provides to generate the Abstract Syntax Tree and traverse it through Python thanks to these Python bindings.

A few years ago, and I think that's still the case, the documentation for libclang and the Python bindings was practically non-existent, apart from what you could find in the source code and the self-generated Doxygen HTML. So I hope that the rest of the article will help someone to clarify a little more with this.

Well, it's worth so much theory, let's get down to work and see what an AST looks like and how we can go through it with Python.

Generating the AST of a C program with libclang and the python-bindings

First, we will define a C program which we will use to extract the different intermediate representations:

Now let's see how to configure the entire environment.

The installation is relatively simple and the only step that can lead to complications is locating libclang.

That´s all!

Before starting to go through the AST, we have to understand how Clang constructs this intermediate representation, what notation it uses and the form it has. For this, it is best to consult the documentation about the Abstract Syntax Tree. In it we can find the following video, which illustrates very well how it is:

Another thing that I like to do is to use the console utility offered by Clang to show the AST of a C file, taking the code shown above as an example, we execute the following sentence:

Clang -Xclang -ast-dump example.c

We will obtain a representation similar to the following:

Figure 4: (Note that the code fragment #include <stdio.h> has been removed to simplify the resulting AST)

The resulting representation has many interesting things, on the one hand, we can see how the result includes the four phases that we talked about at the beginning of the article, from lexical analysis to AST through semantic analysis, since we can see that it has taken out type resolution. On the other hand, the AST generated by Clang has the following characteristics:

The first statement is always a TranslationUnitDecl:

Figure 5: TranslationUnitDecl in the Clang AST

Next, we see some internal statements of Clang until we come to the declaration of our function:

Figure 6: Function Declaration in the Clang AST

If we continue looking at the AST, we can see the declaration of the Main function followed by a CompoundStmt, which indicates the definition of the function:

Figure 8: Function definition in the Clang AST

Inside we see the definition of the rest of the elements of the function.

Going through the AST is quite simple, we just have to start traversing it through the root node, the TranslationUnitDecl, and continue recursively until the end, for this we will use the Python bindings.

Let's start with a simple example, we are going to write a code that simply traverses the entire AST and returns all the existing declarations and the line in the source code where they are:

The first thing we see in the code is the creation of an index. What an index represents is a set of TranslationUnit. Next, we execute the parse method, which invokes the function clang_parseTranslationUnit, and that according to what it says in its comments:

This routine is the main entry point for the Clang C API, providing the ability to parse a source file that can be queried by other functions in the API.

Finally, we obtain an object called a cursor. This object is very important, since it is a libclang abstraction that represents all types of AST nodes and provides a set of common operations. It is the way to walk the tree.

The result of executing the script is as follows:

Figure 9: Result of the execution of the script

We can see how we have obtained all the declarations of variables that were in the AST, some of them from other files that have been included in our example.c. In the end, we see the variables that we have defined and the position they occupy in the source code.

It is interesting to observe all the methods that a cursor object provides us, to see what possibilities we have when going through the AST, for this, we can execute the dir() function of Python on the object and analyze the results:

Figure 10: Methods of a cursor

Some of the relevant methods are:

  • Kind: get the node type
  • Get_children: get the child nodes
  • Get_tokens: provides the tokens previously obtained by the lexical analyzer.
  • Location: provides the position in the source code
  • ...



Taking it to the next level, building the CFG and CG from the AST

At this point, we have built the first four parts of the analyzer and we are able to obtain an intermediate representation, specifically the AST, from the source code. We have seen with a very simple example how the fact of having the source code represented in this way facilitates the analysis of program. From here, we will use the Python bindings of libclang in a much more advanced way, and we will build from the AST that they provide, two intermediate representations that can facilitate the analysis of the source code depending on what we want to perform , the Control Flow Graph and the Call Graph.

First of all, we are going to reintroduce a bit of theory and we will see what these two source code intermediate representations exactly are.

Control Flow Graph

Many static analysis algorithms explore the different branches that the program flow can take when it is executed. To do this efficiently, a Control Flow Graph is built from the Abstract Syntax Tree. The CFG consists of:
  • Nodes: these are basic blocks and consist of a sequence of instructions that will always be executed starting with the first instruction until reaching the last, without the possibility of skipping any of the intermediate instructions.
  • Arcs: Are directed arcs and represent the possible paths that the flow of the program between basic blocks can follow.

Figure 11: Control Flow Graph

Call graph

The call graph represents the possible control flows between functions. It consists of the following components:
  • Nodes: Each node represents a function.
  • Arcs: Are directed arcs. Represents the invocation of a function from another function. 


  Building the Control Flow Graph and the Call Graph

Here is where the thing gets really interesting. After a lot of searching, I did not find any program made in Python that would allow us to obtain a CFG from a C source file, therefore, and as a way to learn how the Clang internals work, I decided to program it myself. For this, I was very influenced by the way in which Clang builds its own CFG, a file that can be found here.

To start programming the CFG and the CG from the AST, we must bear in mind that you must be able to recognize all the constructions of the grammar of the language, because the Python bindings (at the moment in which I made this program) they still did not implement all the constructions that libclang had, I applied the Decorator Pattern in the following way:

In this way, I can interpret those generic cursors that the Python bindings had not yet implemented through the tokens obtained from the lexical analysis and I can assign them a concrete decorator. In this way I can represent all ANSI C although there were some constructs that the Python bindings had not implemented.

Here you have the code that will travel through the AST and generate those concrete decorators for each of the constructions of the language.

(Disclaimer: when I wrote this code, I was a newcomer in Python, therefore I am sure that it can be improved a lot, I have now recovered the code and I will be applying modifications until I have it in the best possible way)

Once we have all the localized language structures we can use them to build the CFG. The construction of the CFG is not trivial, and therefore, it would not make much sense to explain all the code here since there are several thousand lines, so I leave the repository with the code that I programmed to build the CFG in my github repository:

There you can see more advanced uses of python bindings as well as using the CFG built for the C programming language for your own projects.

Finally, the call graph is built from the same base and in a much simpler way, soon I will create the repository where you can see the code.

Last words

As I said at the beginning of the article, I created these programs a while ago when I was in the career as a way to learn, therefore, there are probably numerous architectural or style errors in them. Although I am going to work actively to solve them, any suggestion about it is more than well received.