Source Code Obfuscation
By Romain Tuesday, September 4 2007 - 16:31 UTC - Vulnerabilities - Permalink
By Romain Tuesday, September 4 2007 - 16:31 UTC - Vulnerabilities - Permalink
Source Code Obfuscation is actually a powerful tool for testers. Whether you use it to obfuscate your bytecode (Java, .NET etc.) or increasing the code complexity of your current source code.
Working at SAMATE we are also playing, tweaking, testing, stressing source code analyzers. And now you see the relation. I'm writing a source code obfuscater in order to increase the complexity of our test cases and see if the tools are still doing well.
Thus, I was able (with good documentation, and yaxx) to create one. It currently only add control flow complexity (and of course renaming classes, functions and variables).
You may have heard about obfuscation in a sense of making the code unreadable for users. This is not what I'm interested in. I want to modify the actually source code, adding some information in it, some tests... I need the outputs of the original program and the obfuscated one to be the same, otherwise we cannot consider the source code as being equivalent.
So for example if I do:
if (var == 0) {
echo 0;
}
I will have the same behavior with this source code:
x = some_value;
if (var == 0 or x*x < 0) {
echo 0;
}
Even though they have the same output, the second one is more complicated since it adds one other test.
To see what it exactly does given this original source code
<?php
$b = 0;
$c = "Salut";
$len = strlen($c);
function fct($a) {
return $a . "_1\n";
}
class T {
function foo() {
echo "test\n";
}
}
echo fct(0);
echo T::foo();
?>
...and by choosing one of my control flow obfuscating pattern
<?php
class rand_class_name {
function rand_func_name_2($rand_name_1) {
return $rand_name_1 + 1;
}
}
function rand_func_name_1($rand_name_2) {
return $rand_name_2 + 1;
}
if (rand_func_name_1(0) > 0 && rand_class_name::rand_func_name_2(0)) {
$enter_the_new_statement;
}
?>
To have this result:
<?php
function HXvE5Plwxp0RSoQM ( $ZMfP98Az96Rq67j6 ) {
return $ZMfP98Az96Rq67j6 + 1 ;
}
class TF03COvMuzXRQcCK {
function Ltghf3a0McCI8RaZ ( $V309os5vQo15ak9b ) {
return $V309os5vQo15ak9b + 1 ;
}
}
$b = 0 ;
$c = "Salut" ;
$len = strlen ( $c ) ;
function fct ( $a ) {
return $a . "_1\n" ;
}
class T {
function foo ( ) {
echo "test\n" ;
}
}
if ( HXvE5Plwxp0RSoQM ( 0 ) > 0 && TF03COvMuzXRQcCK :: Ltghf3a0McCI8RaZ ( 0 ) ) {
echo fct ( 0 ) ;
}
if ( HXvE5Plwxp0RSoQM ( 0 ) > 0 && TF03COvMuzXRQcCK :: Ltghf3a0McCI8RaZ ( 0 ) ) {
echo T :: foo ( ) ;
}
?>
First of all, the engine only works on Abstract Syntax Tree (AST) in order to do powerful manipulation and code refactoring. The idea is to take a couple of transformation patterns (the second source code is in fact a complicated one), and fitting this patterns with the original source code.
The patterns are meta code. You can see that they are in PHP using some names such as $rand_name_1 etc. this means that the engine will generate one unique name for each of them and replace it before the actual refactoring.
Select what I want to obfuscate is not a real problem, but for now I only selected the top statements and will apply the whole modifications to each of them.
A little schema explaining a little how it works is available here: schema_obfuscation.png
The applied control flow obfuscating pattern is on of the many I do have for now (many more to come), and I guess this is kinda promising, lots of interesting studies should come now.
Currently the tools is only for PHP but I should make it general by using my own AST nodes names and then be able to do code transformation on C, C++, Java etc.
There is no release of the tool (written in C++) right now, I will wait until it's more than correct and clean. I also need to do data obfuscation (using indirections etc.). The program will of course be public and free for everybody when it's gonna be ready.
Comments
comments about this at IRC..
<sirdarckcat> it would be nicer if the random generated vars wouldn't be so big
<sirdarckcat> it could aid to compress code
<kuza55_> well, its not meant to compress code - there are compression algorithms for that, :p - its meant to add more data to confuse people, e.g. unnecessary and confusing control structures
<sirdarckcat> hehe, it would be more confusing if the variable/fnction names were similar
<sirdarckcat> for example..
<sirdarckcat> OIOo1lO0OlO0OI10O
Greetz!!
As kuza55 said, the goal is only to modify the complexity of the source code, my target is really messing up with source code security scanners... It's actually kinda easy to make them crazy about the complexities...
So, frankly, thinking about structures of the program (and the AST beside) the names are not a real problem directly (maybe talking about the context of the variables, but I guess not more than this).
Btw, I used random names for functions/variables only not to have collision with other functions/variables in the actual source code.