Subscribe to the RSS feed

Keyword - C

Entries feed - Comments feed

Sunday, August 10 2008

Why the "line of code" is indeed a good metric

When I first learned about source code metrics, I was amazed about people using the line of code for doing comparison with software. It was for me a lack of imagination.

At the beginning of the week, I started a small and fast experiment: extracting metrics from the SATE 2008 test cases. This experiment focuses on function-wise properties and therefore, I have to extract for each functions a couple of metrics:

  • McCabe's cyclomatic complexity which computes the code complexity, this is indeed a good metric to estimate the difficulty that a human will have to understand a given piece of code (very important for security related problems)
  • Line of Code
  • Line of Comments
  • Number of local variables
  • Number of parameters (which represents the coercion between the function and the whole program)
  • Number of function call
  • Number of function that are ``sources''
  • Number of function that are ``sinks''
  • Number of C standards functions (obviously, only for C test cases)

At first the the line of code was implemented cause it's an easy one to compute and it also gives an important value if we want to normalize the other metrics. We also decided to introduce the number of ``source/sinks'' for studying input validation weaknesses later on...

Anyway, after running some statistics on the output results, I was amazed by observing that the Pearson correlation coefficient between McCabe and Line of Code was never less than 0.90 (which could be compare to 90% as a correlation rate) (but I have to say that there is huge limitations in the parsers we are using for extracting information, for instance, the C is not pre-processed etc.). This result is only valid for C test cases, actually, the average of observed correlation in Java test case is around 0.60...

Of course further statistical analysis will be necessary to conclude anything on this subject, but if we were unlucky with the test cases selection, this may have been a source of the problem, but I don't think we were. Actually, this seems quite logical to think that these metrics a related, the longer the code is, the more complex in term of tests, loops etc. it can be, there is indeed more chance that a longer code contains more cycles :)

Oh well, I'll keep writing about especially since I expect to get results pretty soon...

Monday, July 28 2008

Trie based fast and massive replacement (Algorithm)

While working on the C++ version of scalp, I had to do massive simple transformations of a given text, ie. replacements of words by others.

Since the main way to do this (a loop which does a replacement at the time), is very inefficient, I decided to find something faster. I then came up with a tree based replacement algorithm; I believe this is kinda famous but I never heard about such algorithm, it basically uses a non compact trie in order to have an efficient search of the current word.

The main algorithm is very simple and similar to a state machine where the state depends on the next character in the trie. For example, if we want to to replace the words: "ba", "me", "mp" in a text, the trie will be this following one:

The idea is then to iterate over all the characters in the text, and for each letter determines whether this is a possible word to replace or not (simply by looking if the letter is a child of the trie root). Then, we iterate over the next letters in the text in order to see if the sequence of letters are an actual word to replace or not (every time, the same methodology is used: look in the children at the current state of our iterator in the trie).

This algorithm seems more efficient than the simple replace used in a loop since we will perform a descent in a tree and therefore replace a linear search by a logarithm one.


I ran a little statistical comparison between two algorithms: mine and the simple loop one. The test bed is quite simple and uses randomly generated text which contains the words to replace with a certain density. In order to create statistics, I made all the sizes varying and I aggregated the results from the same dictionary size. So, for a given size of a dictionary (let's say, 200 words to replace), a text has been generated with a density that vary from 0.1 to 0.5 (from 10% to 50% of the words in the text will be words to replace) and finally, the size of the text vary from 25 to 200 words (and words are randomly generated to be from a size 5 to 32).
As I said previously, the results from a same dictionary size has been aggregated since I've seen practically that the result mainly depends on the dictionnary size (it also obviously depends on the size of the text, but as this is a constant for the 2 algorithm, I can compute the mean of the different data to extract the average gain for a particular dictionary size).

Finally, here is the curve that shows the logarithm progress of the gain compared to the classical method):

The reference replace implementation which has been compared to the one I developed is the following (STL/C++ implementation):

void str_replace(string& where, const string& what, const string& by) {
  for (string::size_type i  = where.find(what);
                                 i != string::npos;
                                 i  = where.find(what, i + by.size()))
    where.replace(i, what.size(), by);
}
and has been used M times (M is the size of the dictionary).
I also decided to release a very-early version of this replace algorithm (which is not template yet): stree.h which use the great STL friendly tree structure from Kasper Peeters.

As for data information, the here is the code I used to generate the dictionary, and text with a certain density: genRandData.cpp
I <3 Bots!