When I first learned about source code metrics, I was amazed about peopleusing the line of code for doing comparison with software. It was for me a lackof imagination.
At the beginning of the week, I started a small and fast experiment:extracting metrics from the SATE 2008 test cases. Thisexperiment focuses on function-wise properties and therefore, I have to extractfor each functions a couple of metrics:
- McCabe's cyclomatic complexity which computes the code complexity, this is indeed a good metric to estimatethe difficulty that a human will have to understand a given piece of code (very important for security related problems)
- Line of Code
- Line of Comments
- Number of local variables
- Number of parameters (which represents the coercion between the functionand the whole program)
- Number of function call
- Number of function that are ``sources''
- Number of function that are ``sinks''
- Number of C standards functions (obviously, only for C test cases)
At first the the line of code was implemented cause it's an easy one tocompute and it also gives an important value if we want to normalize the othermetrics. We also decided to introduce the number of ``source/sinks'' forstudying input validation weaknesses later on...
Anyway, after running some statistics on the output results, I was amazed by observing that the p-value between McCabe and Line of Code was never less than 0.90 (which could be compare to 90% as a correlation rate) (but I have to saythat there is huge limitations in the parsers we are using for extractinginformation, for instance, the C is not pre-processed etc.). This result isonly valid for C test cases, actually, the average of observed correlation inJava test case is around 0.60...
Of course further statistical analysis will be necessary to concludeanything on this subject, but if we were unlucky with the test cases selection,this may have been a source of the problem, but I don't think we were.Actually, this seems quite logical to think that these metrics a related, thelonger the code is, the more complex in term of tests, loops etc. it can be,there is indeed more chance that a longer code contains more cycles :)
Oh well, I'll keep writing about especially since I expect to get results pretty soon...