Okay, I know for the halting problem etc. Some theoretical stuff... But now that I'm working on one, I have to say:

Damn! That so complicated to do a source code scanner!

The dataflow is a real pain in the ass, and we know that it's impossible to have a real and full dataflow. But well, we need to do some. The dataflow is more complicated theoretically but what about the control flow? No really easier! I mean... that's easier but there are so many things to understand, so many patterns to recognize in order to build the model of the source code... And I'm not even talking about inter procedural stuff, multi-file source code etc.

So, I'd like to apologize to "I don't remember who are these people" but some source code scanners are good :) Well... for the moment! I'm really waiting for to see more high-tech stuff and AI in these kind of programs...

Anyway, I'm currently building a core engine working on a AST tree generated by yaxx (XML version). I have two short terms targets:

  • Real Obfuscation (from one source code to an equivalent with a different control flow... yes, not only rename the variables, functions, classes etc.)
  • A variable tracer (tool for pen-tester: $_GET['foo'] -> ($foo <- htmlentities()) -> echo or this kind of stack...)