Thursday, January 17, 2008

Reusable Component Identification from Existing Object-Oriented Programs

Software Reuse comprises in many different strategies, varying from technical perspective to the organizational and managerial perspective. Among the technical factors in software reuse, a reusable asset repository plays an important role in reuse programs since it stores valuable, experienced knowledge. Despite its benefits, an asset repository must be populated with reusable artifacts in order to be useful to developers; otherwise, its adoption is definitively compromised. On the other hand, already developed software is available from several open repositories on the internet and from companies’ own private repository. The effort needed do identify reusable artifacts from existing sources must be considered.

My master dissertation lies under this motivation. We are trying to answer questions such as, How can we assist engineers in the process of identifying candidates of components from existing source code? What kind of heuristics and metrics should be blend (and how) in order to get better results? How can we make it scalable to large systems?

We have analyzed component identification techniques and tools, mainly focused on software clustering. One early approach is presented by Caldiera and Basili in 1991, in a paper entitled “Identifying and Qualifying reusable software components”. They proposed cost, quality and usefulness as reusability factors which should be addressed by cyclomatic complexity, regularity, reuse frequency and code volume metrics. The approach was fully automated in a tool called “CARE”.

Another method to identify architectural component candidates in a hierarchy of procedural modules has been proposed in Girard and Koschke in 1997. The dominance analysis of the relation on the call graph is performed to group functions/variables into modules and subsystems as component candidates. In short, dominance analysis attempts to identify nodes in a graph that can be grouped from the “dominance” degree of a node over the others.

In the same year, Sahraoui et al. presented an object identification approach based on Galois lattice, used for concept analysis. The concept analysis is a branch of lattice theory that can be used to identify similarities among a set of objects based on their common attributes. The objects are then clustered based on these commonalities.

However, I’m more inclined to think Mitchell’s approach is one of the best due to its capability to arrange many possibilities at the time. He has developed a software clustering tool called Bunch. Bunch produces subsystem decompositions by partitioning a graph of the entities and their relations in the source code. It uses a Hill-Climbing algorithm to iterate over partitions until it find the best one.

Most of current methods on software clustering are concerned in find partitions from the edge strength among the nodes. Although there are many possible ways to that, combining different approaches is a good start to overcome the downsides of a particular approach.

I’ve presented the current research on the Software Reuse Seminar discipline. The slides can be downloaded here.

2 comments:

Fred Durao said...

As a reuser, I would be quite happy if your application could be deliverable as an Eclipse plug-in. Will your research fulfill may expectations? Else, sleep on it!

Cassio Melo said...

Fred, sure, the first version is running as an Eclipse plugin; however, for batch processing of a large amount of code I dont think it makes much sense. Eclipse plugin is in the "on development" context and our tool is designed to act on "legacy" code, scattered in repositories. Thanks for the comments.