Wednesday, April 18, 2018

Building a Proxy Fuzzer for the MQTT protocol in a nutshell with Polymorph framework

@santiagohramos

https://github.com/shramos/polymorph


This post shows how easy you can build a fuzzer for the MQTT protocol by using the Polymorph framework.

I will start by assuming that the reader knows the MQTT protocol. For those who do not know it, you can find more information here. The first thing we will do is prepare the environment where we will perform the fuzzing, in this case, it will be very simple, a Kali Linux machine in which we will install the following dependencies:

Polymorph framework
apt-get install build-essential python-dev libnetfilter-queue-dev tshark tcpdump python3-pip wireshark
pip3 install --process-dependency-links polymorph

apt-get install mosquitto mosquitto-clients

sudo apt-get install gcc make git wget 
git clone https://github.com/aoh/radamsa.git && cd radamsa && make && sudo make install

With all this installed, we are ready to start!

Before starting the construction of the fuzzer, we are going to test our mosquitto installation in localhost. To do this, we are going to open two terminals and execute a client that is going to publish to a certain topic and another in which we will run the mosquito broker. In the following image you can see the commands and the result.


Well, now that we have tested the communication between both clients, we are going to open Polymorph and begin with the capture and modification of MQTT packets in real time. 

In our particular case, we are going to fuzz the topic field of the MQTTPublish packets, notice that the modification of any other field would be done in exactly the same way. Also, for simplicity, we are going to modify MQTT packets that implement the IPv4 protocol. Sometimes you will see that Polymorph captures the MQTT protocol over IPv6, to temporarily disable IPv6 you can use the following command over the loopback interface:

sudo sh -c 'echo 1 > /proc/sys/net/ipv6/conf/loopback/disable_ipv6'

Having said that, we are going to access the Polymorph main interface, to do that, we only have to introduce the polymorph command from a Linux terminal.


Once here, let´s start with the construction of the fuzzer. As in this case we will not need to intercept the communication between two machines because the client and the broker will be in localhost, we do not need to use any spoofing technique. We can simply use the capture command to start the packet sniffing process.


Our goal at this point is to capture one of the packets that we want to modify, so that the framework converts it into a template and we can work on it. Therefore, while the tool is sniffing packets, we place ourselves in our MQTT client and send an MQTTPublish message to the broker.

Once this is done, we use Ctr-C to finish the sniffing process in Polymorph and use the show command to show the captured packets on the screen.


As you can see, most of the packets include a last Raw layer, which means that, at first glance, they have not been interpreted/dissected correctly by the primary dissectors. With the dissect command, we use more advanced dissectors that give us a representation of the part of the packets that have not been represented.


Now that we have a more concrete representation of all the bytes of the packets that we have captured, what we must do is choose the template that corresponds to the packet that we want to modify. We can use the wireshark command to open this application and perform a more detailed filtering. Once the template is selected, we access it using the template command.


Right now, we are in the context of the selected template. With the show command we can see the different layers and fields that it has, as well as the type of them. The template concept is the most important abstraction of the framework, and it is what allows the user to access the captured packets in real time using simple syntax in the code he writes to perform complex processing on them. Furthermore, it is the container in which all the conditional functions and structures of the framework are stored when we save a session.

At this point, the conditional functions come into play (preconditions, postconditions and executions). When the user enters the intercept command in the template interface, the machine that hosts Polymorph will stop forwarding the packets at the kernel level and start sending those packages to the tool to be processed before being forwarded. The conditional functions defined by the user will be executed in each of the intercepted packets. 

Let's see a simple example of how these functions work, we are going to add the following precondition to our current template using the command precs -a test_condition.

if you are using the default editor, pico, remember not to mix tabs and spaces, better use only spaces to indent the code. (You can specify another editor that is in your PATH using the option -e):


Enter "y" to keep the code on disk and enter precs -s to visualize the added precondition.


Now, introduce the intercept command with the next iptables rule (by default the framework will set a forward rule):


Look how all the packets that flow through the machine are processed by the tool in real time and the precondition we have added is executed on each of them. We can test it just by sending a MQTTPublish message from one MQTT client to the other.

The conditional functions are another important abstraction of the framework and work as follows. When a packet is intercepted in real time, the conditional functions defined by the user are executed on it following a certain order, first the preconditions are executed in the order in which the user has added them to the template, then the executions and finally the postconditions. If at any point of the execution of any of the three types of conditional functions the value None is returned by one of them, the execution chain is broken and the packet is forwarded as it is at that moment. On the other hand, if the packet that is received as an argument is returned, the chain of execution of conditions continues. Remember that the packet that is received as argument is the packet that has been intercepted in real time at that moment.

Once this is understood, we are going to exit the intercept mode in Polymorph by entering Ctr-C (in this way, the machine that hosts Polymorph only forwards the packets without passing them through the tool). After that, we are going to add the following preconditionsexecutions and postconditions, which, I insist, when we start intercepting will be executed on each of the packets that are intercepted. To eliminate the test precondition that we added before, use precs -d test_condition.

Preconditions

Two preconditions have been added using the commands:

precs -a global_vars -e editor
precs -a filter_mqttpublish -e editor

The first precondition, global_vars, is creating a global variable that will remain constant for all intercepted packets. It will be used to store all the test cases that we will use to fuzz the MQTTPublish packets.


On the other hand, the second precondition, filter_mqttpublish, will filter the incoming packets so that they only continue executing the rest of the conditional functions those whose msgtype field is equals to 48. Notice that thanks to the template abstraction, Polymorph knows the position that the msgtype field occupies within the bytes of the intercepted packet, and therefore, the user can access it much more easily.


Executions
            
We are going to add one execution function that is a bit longer than the precondition, but it remains simple. The piece of code shown below performs the following tasks:
  1. Transforms the intercepted packet into a Scapy representation. We do this to be able to interact more easily with the fields of the MQTT layer, especially with the lengths, which are encoded. I wrote the MQTT specification for Scapy a while ago, you can find it here.
  2. We check if fuzzing values ​​remain in our list of test cases. The list is stored in the global variable created above. If the list is empty, we invoke Radamsa to generate more test cases and we stored them in the global variable.
  3. Finally, we use Scapy to insert the fuzzing value in the topic field of the packet and we eliminate that value from our list, so that it is not inserted twice. In addition, we recalculate the control fields, such as lengths and chksums.

That's all we need to build a Proxy Fuzzer for the MQTT protocol using Polymorph!

To put it into operation, we will create two directories in the PATH from which we have run Polymorph, one called valid_cases and another called fuzz_cases. These directories will be used by Radamsa to read valid test cases and mutate them in cases that may unravel in a possible vulnerability. We can add some valid cases like the following ones.


Once this is done we simply go to the template interface in Polymorph and enter the intercept -ipt "iptables -A INPUT -j NFQUEUE --queue-num 1" command. After that, we go to our MQTT client and publish a message, preferably with a long value so that there are no problems with the sequence numbers of the TCP/IP session.



We can observe how the packet is modified in real time and the value produced by Radamsa is introduced. What we could do now is making a simple loop in Bash and let it test a significant number of test values. Also, the most common thing would be that we run the application that we are testing under a debugger, so we can capture the exceptions that occur and analyze them.

Finally, we can use the save command from the template interface to export the template and import it into Polymorph in another machine, so you can share it with your colleagues!

Friday, April 13, 2018

Polymorph: Modificando paquetes ICMP en tiempo real

@santiagohramos

https://github.com/shramos/polymorph


En este post vamos a ver lo sencillo que sería modificar el contenido de paquetes pertenecientes al protocolo ICMP utilizando Polymorph.

Lo primero de todo, vamos a instalar la herramienta y a establecer el entorno de pruebas. Para empezar todos en el mismo punto, vamos a utilizar el despliegue con Docker que incorpora Polymorph:

cd polymorph
docker-compose up -d

Una vez ejecutados los commandos se empezará a formar el entorno, cuando termine, accedemos a las tres máquinas mediante:

docker exec -ti polymorph bash
docker exec -ti bob bash
docker exec -ti alice bash

Las tres maquinas se encuentran en la misma red, que tiene una estructura muy sencilla:
  • Polymorph: 10.24.0.2
  • Alice: 10.24.0.3
  • Bob: 10.24.0.4 

Vamos a instalar la utilidad traceroute en la máquina bob, y vamos a ejecutarlo para ver la ruta que siguen los paquetes hasta alice:

apt-get install traceroute
Traceroute 10.24.0.3

Como podéis ver, el primer y ultimo salto es alice:


Nos situamos en la maquina polymorph, y desde cualquier parte ejecutamos el comando polymorph, esto abrirá la herramienta. Una vez abierta, vamos a utilizar el módulo de arp spoofing para interceptar la comunicación entre alice y bob:


Si ahora nos situamos en bob y volvemos a ejecutar el comando traceroute, veremos que la comunicación pasa por la máquina polymorph:


Ya tenemos la comunicación interceptada entre las dos máquinas legítimas, ahora los paquetes están fluyendo por la máquina polymorph y siendo reenviados. Vamos a poner la herramienta a hacer sniffing con el objetivo de capturar un paquete ICMP, para ello utilizamos el comando capture y tras ejecutarlo, haremos un ping desde bob a alice o viceversa.


Utilizamos Ctr-C para terminar el proceso de sniffing, y automáticamente la herramienta genera una plantilla para cada uno de los paquetes capturados. Podemos verlo mediante el comando show.


Como inciso, recalcar que esta primera disección y generación de la plantilla se hace con los disectores de Scapy, que muchas veces no serán capaces de diseccionar algunos protocolos complejos, para utilizar disectores más avanzados sobre las plantillas, utilizaríamos el comando dissect.


Volviendo al ejemplo, vamos a buscar entre las plantillas una que se corresponda con el paquete que queremos modificar y vamos a acceder a ella mediante el comando template:


Ahora estamos en el contexto de una plantilla, y podemos realizar modificaciones sobre ella. El concepto de plantilla es importante, es lo que le permite al usuario acceder en tiempo real a paquetes de distintos protocolos mediante una sintaxis sencilla. Vamos a verlo con el ejemplo.
Si ejecutamos el comando show, podemos ver el contenido de la plantilla generada a partir del paquete, en este caso, vamos a modificar el campo load de la capa RAW, que es donde se encuentran los datos.


En este punto es donde entran en juego las funciones condicionales (precondiciones, postcondiciones y ejecuciones). Cuando el usuario, en el contexto de la plantilla introduzca el comando intercept, la máquina polymorph dejará de reenviar los paquetes a nivel del kernel y comenzará a enviar esos paquetes a la herramienta para que los procese antes de que sean reenviados, sobre cada uno de los paquetes que pasen por la máquina,  se ejecutarán las funciones condicionales que hayamos definido. Vamos a ver un ejemplo sencillo, vamos a añadir la siguiente precondición a nuestra plantilla actual mediante el comando precs -a condición_prueba. 

Si utilizais el editor por defecto, pico, recordad no meter tabulaciones y espacios mezclados, mejor tabular el código con espacios. (podéis indicar otro editor que este en el path con la opción -e):


Introducimos "y" para mantenerla en disco e introducimos precs -s para ver la precondición añadida.


Ahora introducimos el comando intercept:


Fijaros como todos los paquetes que fluyen a través de la máquina son procesados por la herramienta y se ejecuta la precondición que hemos añadido, como estamos interceptando la comunicación entre la máquina alice y bob, si hacemos un ping de una a otra, los paquetes pasan por la máquina polymorph y la herramienta ejecuta la precondición sobre ellos.

Una vez entendido esto, vamos a salir del modo interceptar con Ctr-C (de tal forma que la máquina polymorph únicamente reenvie los paquetes sin pasarlos por la herramienta) y vamos a añadir las siguientes precondiciones, ejecuciones y postcondiciones, que, insisto, cuando estemos interceptando se ejecutarán sobre todos los paquetes que se intercepten (para eliminar la precondición de prueba precs -d condición_prueba). 

Si utilizais el editor por defecto, pico, recordad no meter tabulaciones y espacios mezclados, mejor tabular el código con espacios. (podéis indicar otro editor que este en el path con la opción -e)
  • Precondición
  • Ejecución
  • Postcondición 


Las funciones condicionales funcionan de la siguiente forma, primero se ejecutan las precondiciones en el orden en el que las hayamos metido, después las ejecuciones y por último las postcondiciones. Si algún punto de la ejecución de cualquiera de las tres se devuleve el valor None, se rompe la cadena de ejecución y se reenvía el paquete tal cual esté en ese momento, si por el contrario, se devuelve el paquete que se recibe como argumento, la cadena de ejecución de condiciones continúa. El paquete que se recibe como arguemento es el paquete que se ha interceptado en tiempo real en ese momento.

En este caso hemos añadido una precondición que dice: 
"si en el paquete que llegue existe una capa IP con un campo proto con el valor 1 (que indica que se trata de un paquete ICMP), retorna el paquete y continua ejecutando funciones condicionales, en el caso contrario retorna None y reenvía el paquete sin ejecutar nada más."

La ejecución dice: 
"en la capa RAW y el campo load del paquete que ha pasado la precondición introdúceme este nuevo valor. Si os fijais en la plantilla cuado ejecutamos el comando show, el tipo del campo load es hex, es muy importante que los tipos que se asignan o comparan sean los mismo, sino se producirá un error de tipos."

Por último la postcondición dice:
"el paquete que me llega después de ejecutar la precondición y ejecución sobre él, transfórmamelo a un paquete scapy, recalculame los campos de control para que quede consistente, vuelve a asignarlo a la variable paquete de polymorph y retornalo para que sea reenviado."

Como ya no hay más condiciones después de la postcondición el paquete se reenvía.

Una vez añadidas las funciones, vamos a la máquina bob e instalamos tcpdump mediante apt-get install tcpdump y ejecutamos el siguiente comando.


Una vez que tcpdump este escuchando, vamos a alice y enviamos un ping a bob, (aun no hemos ejecutado el comando intercept en la máquina polymorph), en bob deberíamos ver el contenido de los paquetes.


Ahora vamos a la máquina polymorph y ejecutamos el comando intercept, regresamos a alice y hacemos un ping a bob, fijaros como no se pierde ningún paquete pero el contenido del ICMP que recibe bob es el introducido por el atacante.



Una vez terminado, podemos pulsar Ctr-C para salir del modo intercept en la herramienta y con el comando save guardar la plantilla, que podremos intercambiar o transportar a otro pc. Para importarla en la herramienta utilizamos la sentencia polymorph -t plantilla.json desde la consola de comandos.



Sunday, January 28, 2018

Static code analysis: Traversing the AST (Abstract Syntax Tree) provided by Clang through its Python-bindings and building a CFG (Control Flow Graph) and a CG (Call Graph) for the C programming language


About three years ago, in 2015, when I was in my computer engineering career, I began to be curious about the static analysis of source code. As I did not know much about the subject, I decided to investigate the main parts of which an analyzer is composed. After searching through the internet, I finished reading from start to finish the Dragon Book, a book from which I learned a lot. After reading this book, I realized that the construction of a static analyzer is very similar to the first half of construction of a compiler, mainly because it is much simpler and efficient to traverse the code in an intermediate representation like an AST (Abstract Syntax Tree) or a CFG (Control Flow Graph) that traverse it in its high-level representation. Taking advantage of the effort that I invested in understanding the construction phases of an analyzer and a compiler, I decided to dedicate my career final project to the construction of a static source code analyzer for the C programming language with the aim of finding vulnerabilities in the code. I managed to complete the tool written in Python, and for those who are interested, here I will talk a little about source code static analysis and the intermediate representations that facilitate it.

The first thing I did before starting to program any line of code, was to define very well the main components that I had to develop to get an appropriate representation of the source code on which to perform the analysis. The components that I defined were the following:
  1. Construction of a lexical analyzer that reads the string of characters that forms the source code and groups them into lexemes.
  2. Construction of a syntactic analyzer or parser that uses the lexemes produced in the previous phase to build a parse-tree.
  3. Construction of a semantic analyzer that uses the previously produced representation to analyze the program in search of inconsistencies and perform type resolution.
  4. Construction of an intermediate representation. In this case construction of an Abstract Syntax Tree (AST).
  5. Construction of a suitable representation for the analysis. In this case construction of a Control Flow Graph (CFG).
  6. Construction of a representation of calls between functions. In this case construction of a Call Graph (CG).
  7. Construction of an analysis module.
  8. Presentation of results.
As you can see, they are a lot of components, and when we talk about developing these components for a programming language that has a grammar like that of C, it becomes anything but easy. 

In a general way, we could group the aforementioned components into two main groups,
  • The construction of the intermediate representation that the static analysis algorithms will use as input.
  • The analysis module itself that will implement these algorithms and will provide the results.
In this post, I will focus only on the first group, the construction of an intermediate representation from the source code. 


Intermediate representations of the source code for the C programming language


An intermediate representation of the source code, such as the AST, CFG or CG, can be used for many more things than for the construction of a static source code analyzer. It can be used to make a compiler, a code generator or, for example, a source code syntax checker.

As I mentioned before, the construction of all these components for a programming language like C, which has a relatively complex grammar, is not trivial, and since we have a finite time in our lives, I decided to investigate to see what components had already been made in Python that I could reuse for my system.

One of the first things I discovered after a lot of searching the web, is that I could skip the construction of the first four components of the list (lexical analyzer, parser, semantic analyzer and AST) thanks to Clang, libclang and their Python bindings. At this point, we will stop a few minutes and explain what each of these concepts are. Some are complex to explain, so I accompany the explanation with some images or diagrams that I found online or in different books.

Lexical Analyzer


It consists of reading the source code provided by the file and transforming it into a sequence of characters called lexemes. For each lexeme the lexical analyzer produces an output of the form: 

                                                 <token-name, attribute-value>

That will be passed to the next module, the parser. In a very general way, the process of recognizing tokens is usually done through the construction of Transition Diagrams. A transition diagram is composed of a set of nodes called states and a set of vertices. Each state represents a condition that can occur during the process of scanning a lexeme that matches a certain pattern. The vertices are labeled with a symbol or a set of symbols.

Figure 1: Transition diagram and tokens

Syntactic analyzer


The parser uses the first component of the token produced by the lexical analyzer to construct an intermediate representation in the form of a tree that represents the grammatical structure of the token. A typical representation is a syntax tree in which the inner node represents an operation, and the children the arguments of the operation. 

Figure 2: Syntax tree

Abstract Syntax Tree (AST)


It is possible to perform certain analyzes in the syntax tree because it contains a direct representation of the source code as written by the programmer. However, to perform complex analysis tasks is not convenient due to: 
  • The nodes of the tree derive directly from the rules of grammar production, and therefore can introduce symbols that exist only for the purpose of making the parsing process easier or to eliminate ambiguities.
  • Includes all the syntactic sugar from the source code of the program
Due to all this, it is convenient to build an Abstract Syntax Tree, which provides a standard version of the program that can be used for analysis.


Figure 3: Abstract Syntax Tree

 

Semantic Analyzer


The semantic analyzer uses the information in the AST and in the symbol table to check semantic consistency in the source code and complete the content of the AST. A very important part of almost all the tools that implement this type of representations is type checking. For example, in a static analysis tool it is not necessary to report errors of type as a compiler does, but the type information remains of vital importance to determine diverse types of vulnerabilities in the code, such as integer overflows.


Building these representations: Clang, libclang and the Python-bindings


Well, we have seen enough theory to explain all the components which are necessary for the construction of an abstract syntax tree, an intermediate representation of the source code. Now let's see in a practical way what tools we can use to carry it out.

First of all, what is Clang?

As it indicates in its web page
The goal of the Clang project is to create a new C based language front-end: C, C++, Objective C/C++, OpenCL C and others for the LLVM compiler. 
His Code is extremely well written, and I do not know if it will continue to be like that, but at least a few years ago, Apple used Clang for autocomplete functions in development tools such as Xcode.

Second, what is the relationship between libclang and Clang?

According to its documentation
The C Interface to Clang provides a small API that exposes facilities for parsing source code in an abstract syntax tree (AST), loading already-parsed ASTs, traversing the AST, associating physical sources locations with elements within the AST, and other facilities that support Clang-based development tools.

And finally, what are the python bindings?

According to the source code
This module provides an interface to the Clang indexing library. It is a low-level interface to the indexing library which attempts to match the Clang API directly while also being "pythonic". 
What this means is that we can use the interface that Clang provides to generate the Abstract Syntax Tree and traverse it through Python thanks to these Python bindings.

A few years ago, and I think that's still the case, the documentation for libclang and the Python bindings was practically non-existent, apart from what you could find in the source code and the self-generated Doxygen HTML. So I hope that the rest of the article will help someone to clarify a little more with this.

Well, it's worth so much theory, let's get down to work and see what an AST looks like and how we can go through it with Python.


Generating the AST of a C program with libclang and the python-bindings


First, we will define a C program which we will use to extract the different intermediate representations:


Now let's see how to configure the entire environment.

The installation is relatively simple and the only step that can lead to complications is locating libclang.


That´s all!

Before starting to go through the AST, we have to understand how Clang constructs this intermediate representation, what notation it uses and the form it has. For this, it is best to consult the documentation about the Abstract Syntax Tree. In it we can find the following video, which illustrates very well how it is:


Another thing that I like to do is to use the console utility offered by Clang to show the AST of a C file, taking the code shown above as an example, we execute the following sentence:

Clang -Xclang -ast-dump example.c

We will obtain a representation similar to the following:

Figure 4: (Note that the code fragment #include <stdio.h> has been removed to simplify the resulting AST)

The resulting representation has many interesting things, on the one hand, we can see how the result includes the four phases that we talked about at the beginning of the article, from lexical analysis to AST through semantic analysis, since we can see that it has taken out type resolution. On the other hand, the AST generated by Clang has the following characteristics:

The first statement is always a TranslationUnitDecl:

Figure 5: TranslationUnitDecl in the Clang AST

Next, we see some internal statements of Clang until we come to the declaration of our function:

Figure 6: Function Declaration in the Clang AST

If we continue looking at the AST, we can see the declaration of the Main function followed by a CompoundStmt, which indicates the definition of the function:

Figure 8: Function definition in the Clang AST

Inside we see the definition of the rest of the elements of the function.

Going through the AST is quite simple, we just have to start traversing it through the root node, the TranslationUnitDecl, and continue recursively until the end, for this we will use the Python bindings.

Let's start with a simple example, we are going to write a code that simply traverses the entire AST and returns all the existing declarations and the line in the source code where they are:


The first thing we see in the code is the creation of an index. What an index represents is a set of TranslationUnit. Next, we execute the parse method, which invokes the function clang_parseTranslationUnit, and that according to what it says in its comments:

This routine is the main entry point for the Clang C API, providing the ability to parse a source file that can be queried by other functions in the API.

Finally, we obtain an object called a cursor. This object is very important, since it is a libclang abstraction that represents all types of AST nodes and provides a set of common operations. It is the way to walk the tree.

The result of executing the script is as follows:

Figure 9: Result of the execution of the script

We can see how we have obtained all the declarations of variables that were in the AST, some of them from other files that have been included in our example.c. In the end, we see the variables that we have defined and the position they occupy in the source code.

It is interesting to observe all the methods that a cursor object provides us, to see what possibilities we have when going through the AST, for this, we can execute the dir() function of Python on the object and analyze the results:

Figure 10: Methods of a cursor

Some of the relevant methods are:

  • Kind: get the node type
  • Get_children: get the child nodes
  • Get_tokens: provides the tokens previously obtained by the lexical analyzer.
  • Location: provides the position in the source code
  • ...

 

 

Taking it to the next level, building the CFG and CG from the AST


At this point, we have built the first four parts of the analyzer and we are able to obtain an intermediate representation, specifically the AST, from the source code. We have seen with a very simple example how the fact of having the source code represented in this way facilitates the analysis of program. From here, we will use the Python bindings of libclang in a much more advanced way, and we will build from the AST that they provide, two intermediate representations that can facilitate the analysis of the source code depending on what we want to perform , the Control Flow Graph and the Call Graph.

First of all, we are going to reintroduce a bit of theory and we will see what these two source code intermediate representations exactly are.

Control Flow Graph


Many static analysis algorithms explore the different branches that the program flow can take when it is executed. To do this efficiently, a Control Flow Graph is built from the Abstract Syntax Tree. The CFG consists of:
  • Nodes: these are basic blocks and consist of a sequence of instructions that will always be executed starting with the first instruction until reaching the last, without the possibility of skipping any of the intermediate instructions.
  • Arcs: Are directed arcs and represent the possible paths that the flow of the program between basic blocks can follow.

Figure 11: Control Flow Graph

Call graph


The call graph represents the possible control flows between functions. It consists of the following components:
  • Nodes: Each node represents a function.
  • Arcs: Are directed arcs. Represents the invocation of a function from another function. 

 

  Building the Control Flow Graph and the Call Graph


Here is where the thing gets really interesting. After a lot of searching, I did not find any program made in Python that would allow us to obtain a CFG from a C source file, therefore, and as a way to learn how the Clang internals work, I decided to program it myself. For this, I was very influenced by the way in which Clang builds its own CFG, a file that can be found here.

To start programming the CFG and the CG from the AST, we must bear in mind that you must be able to recognize all the constructions of the grammar of the language, because the Python bindings (at the moment in which I made this program) they still did not implement all the constructions that libclang had, I applied the Decorator Pattern in the following way:

In this way, I can interpret those generic cursors that the Python bindings had not yet implemented through the tokens obtained from the lexical analysis and I can assign them a concrete decorator. In this way I can represent all ANSI C although there were some constructs that the Python bindings had not implemented.

Here you have the code that will travel through the AST and generate those concrete decorators for each of the constructions of the language.

(Disclaimer: when I wrote this code, I was a newcomer in Python, therefore I am sure that it can be improved a lot, I have now recovered the code and I will be applying modifications until I have it in the best possible way)

Once we have all the localized language structures we can use them to build the CFG. The construction of the CFG is not trivial, and therefore, it would not make much sense to explain all the code here since there are several thousand lines, so I leave the repository with the code that I programmed to build the CFG in my github repository:


There you can see more advanced uses of python bindings as well as using the CFG built for the C programming language for your own projects.

Finally, the call graph is built from the same base and in a much simpler way, soon I will create the repository where you can see the code.

Last words


As I said at the beginning of the article, I created these programs a while ago when I was in the career as a way to learn, therefore, there are probably numerous architectural or style errors in them. Although I am going to work actively to solve them, any suggestion about it is more than well received.