Saturday, September 29, 2007

Using data mining to improve search engines

I'll present some ideas related with the use of data mining to improve search engines. Thus, the first question is: Where are your data that you will extract the knowledge? The focus in this discussion is use historic data as log files like reference of a real use of search engine. In other hand we need to select the techniques that we use to extract the knowledge hidden of the raw data. In the literature, there are several techniques as classification, clustering, sequence analysis, association rules; among others [see 1].
The direction selected in this discussion is using the association rules [see 2] to analyze the relations between the data stored in the log file. These relations are used to aid the users through of suggestions like queries or options to download.
The paper selected for the RiSE`s Discussion was “Using Association Rules to Discover Search Engines Related Queries” that shows the use of association rules to extract related queries from a log generated by a web site.
The first question was related with the transformation of log files in the term called “user sessions”, why do it? It is important because the algorithms used to extract association rules needs that the records are grouped in a transactions set. However, when log files are used, these groups are not perfectly separated. The classic situation of association rules is the Market Basket Analysis [1] that is associated with the organization’s products in a super market. In this case, the sessions are defined by the consumer ticket. In this ticket, the products bought are perfectly described and separated of other consumers. However, in a log file the records are not sorted and it is necessary to separate these lines in transactions set. Each transaction will contain several log lines that represent the use of the users. In the paper the IP address and a window time was used. My work uses the session number id to identify the users during a time window.
The quality of recommendations was cited too. This quality is measured using metrics like support, confidence. However the parameter used is specific for each situation.
This approach is common used in the web paradigm, but this idea can be used to improve component search engines using recommendations to downloads, queries and any other information that is stored in log files.
I have some critiques about this paper like the details of data mining process; several algorithms can be used, what was used? Other question is related with the experiment, I think that the choice of a specific domain to extract the rules could help the validation of suggestions using a specialist in this domain.

Wednesday, September 26, 2007

Open Call for M.Sc. and Ph.D. students in RiSE

The Reuse in Software Engineering – research area – is looking for new M.Sc. and Ph.D. candidates interested in software reuse. If you are a good student and have interesting contact us. The post graduation degree will be hosted at Federal University of Pernambuco which is among the top five universities in Latin America. See the registration site.

Monday, September 24, 2007

What views are necessary to represent a SOA?

Service-Oriented Architecture (SOA) is a system architecture in which a collection of loosely coupled services communicate with each other using standard interfaces and message-exchanging protocols. As an emerging technology in software development, SOA presents a new paradigm, and some authors affirms that it affects the entire software development cycle including analysis, specification, design, implementation, verification, validation, maintenance and evolution [see 1, 2 and 3].

In this context, we discussed about the paper "SOA Views: A Coherent View Model of the SOA in the Enterprise", published at IEEE International Conference on Services Computing in 2006. The authors, Ibrahim and Misic, proposed a set of nine views to represent an SOA-based architecture software: Business view, Interface view, Discovery view, Transformation view, Invocation view, Component view, Data view, Infrastructure view, and Test view.

In our discussion, the first question was: Do current approaches, such as RUP 4+1 Model View and ADD method by SEI, attend the particularities within context of SOA design?

We agree with some views and we considerate interesting within SOA approach, such as Interface view and Discovery view. The first describes the service contract, and the second provides the information necessary to discover, bind, and invoke the service.

Additionally, I agree with the paper about to have several views for SOA, because they can conduct the architects to construct a solution with the particularities of SOA and to address the quality attributes of this kind of enterprise system.

Finally, I think that misses in this paper the relation among the stakeholders and the quality attributes that each view can be address. Besides, the paper does not show how each view can be represented. For architects, it is important to have models in order to help the architects to design the solution for each view. One example of this, it is using the UML sequence diagram for Discovery view, showing how the consumer can find the services in the service registry.

Wednesday, September 19, 2007

No Evolution on SE?

Two weeks ago I have participated in the EUROMICRO CONFERENCE on Software Engineering and Advanced Applications (SEAA) which was held on August 27-31, in Lübeck, Germany. Since 2005 I have participated in this Conference (2005 was held in Porto/Portugal and 2006 was held in Dubronick/Croatia).

This conference has a very interesting public from a set of software companies, such as Philips, Nokia, Sony/Ericsson, HP, among others and a set of recognizable institutes like Fraunhofer Institute, Finland Research, C.E.S.A.R., among others. In this way, interesting discussions and partnerships (with the industry and academia) usually takes place.

I have presented two papers there: (1) a paper about software component maturity model, in which I described the component quality model and the evaluation techniques proposed by our group in order to achieve a quality degree in software components; (2) a paper about an experimental study on domain engineering, which was an interesting work accomplished by our group together with the university in order to evaluate a domain engineering process at a post-graduate course. Some researchers that watched those presentations believe the component certification is the future of software components and like the work that we have been developing because this area is vague, sometimes. On the other hand, the researchers liked the experimental study report and commented that this is an interesting area that could be improved in order to increase the number of proved and validated works (in academia or industry) in software engineering area. The experimental engineering area has received a special attention in the last years by the software engineering community due to the lack of works and the difficulty to evaluate the software researches.

A very interesting keynote speech was given by Ralf Reussner who started his presentation with the question presented on the title of this post (No Evolution on SE?). He told that since NATO Conference (the first Software Engineering Conference) we have seen the same questions/problems in the Software Engineering Conferences around the world like software project management problems, requirements changes, software project risks/mitigation, software reuse aspects, among others. Thus, the problems continue to be presented and discussed until nowadays.
Additionally, an interesting topic pointed out by Ralf Reussner is why we don’t have any books from other areas like “Heart Transplantation in 21 Days” or “Nuclear Weapons for Dummies”. So, in our area the science/engineering is not considered like other sciences/engineering. Perhaps this is the reason why we have been discussing since 1968 until now the same problems and questions about software engineering. And the question remains… “No evolution on SE?”

Tuesday, September 18, 2007

The bad side of Bug Repositories

In the last eight years, approximately, bug repositories, especially in Open Source Software, has gained much more focus by researchers, increasingly considerably the literature about it. These repositories are being analyzed by information retrieval perspective for Software Engineering (see 1 and 2), in an attempt to improve and automate some processes related to them. Bug repositories are systems to collect bugs founded by users and developers during a software usage.

As some people has noticed, the majority of open source software, and proprietary software too, has been organized their development processes around a bug repository system. This means that bugs resolution, new features and even improvements in the process, are being dictated by bug reports. Here, we mean by bug a software defect, change requests, features requests, issues in general.

The task of analyzing reported bugs is called bug tracking or bug triage, where the word "bug" could, reasonably, be replaced by issue, ticket, change request, defect, problem, as many others. But the more interesting is to know that bug tracking tasks are done, in general, by developers and a precious time is taken for this. Beside many others sub-tasks in bug triage, we can cite: analyzing if a bug is valid; trying to reproduce it; dependency checking -- that is, verify if other bugs block this bug and vice-versa; verify if a similar bug has been reported -- duplication detect; assign a reported bug to a developer.

Many other sub-tasks can be identified, however, in attempt to show the problem that bug triage could be the in software final quality, we'll concentrate our efforts on bug duplication detect task, witch actually is manually made, as many others.

In a paper by Gail Murphy, entitled Coping with an open bug repository, we can see that almost 50% of reported bugs during the development and improving phase are invalid. That is, are bugs that could not be reproduced (here we include the well know "works for me" bugs), bugs that wont be resolved, duplicated bugs, bugs with low priority, and so on. And 20% of this invalid bugs are only duplicated bugs, that is, bugs that was early reported.

Putting it in numbers, lets suppose that a project receive about 120 bug reports by day (in some projects this average is much more bigger), and that a developer spent about 5 minutes to analyze one bug. Doing simple arithmetic operations, we see that 10 hours per day, or 10 persons-hour, are wasted only in this task (bug tracking), and about 5 hours are wasted only with bug that does not improve the software quality. And only for duplicated bugs we have 2 wasted hours. Now calculate it for a month, for a year! That is, the automated invalid bugs detection, in special duplicated bug detection, is a field to continue being explored; many techniques has been tested. A good technique can save these wasted hours and put them in a health task.

Another thing which we can mention is that if a software product line approach is used, the problem of duplicated bug reports can increase significantly. Since, products have a common platform, many components are reused. That is, as the same component are used in many products, the probability of reporting the same bug by different people are higher. Moreover, the right component must be correctly identified in attempt to solve the bug, if not the problem still occurring in the product line.

One could not see at a first glance, but the bug repositories analysis, specially the detection of duplicated bugs, has much to see with software reuse. Software reuse try to reduce costs, make software development process faster, increase the software quality and other benefits. Improvements in bug triage processes aims to do exactly this!

Bug repositories came as a new challenge for emergence Data Mining for Software Engineering field. Many techniques from intelligent information retrieval, data mining, machine learn and even data clustering, could be applied to solve these problems. The actually researches results has achieved only 40% (as a maximum) of effectiveness on trying to automate these tasks, witch characterize a semi-automated solution.

Post by Yguaratã C. Cavalcanti, M.Sc. candidate at CIn-UFPE and RiSE member.

Monday, September 17, 2007

RiSE members visit Virginia Tech in Falls Church

On Friday, September 14th 2007, me (Liana Barachisio) and Daniel Lucrédio visited the Virginia Tech building in Falls Church, VA, to have a meeting with professors Bill Frakes and Gregory Kulzczycki. They briefly discussed their current research works, on formal methods being applied in reengineering, domain engineering (DARE process), tests, COTS, object-oriented metrics and code generation.

We also presented RiSE's works, like the Reuse Maturity Model, the Model Driven Reuse approach, component certification and testing and the RiSE tools – B.A.R.T., CORE, ToolDAy and LIFT. They were particularly interested in Lift, which is a tool for retrieving legacy systems information, aiding the system documentation, because of its results in a real project, and also because they are currently working with reengineering themselves.

Frakes was also interested in B.A.R.T.'s query reformulation work. Regarding the ToolDAy, even though the adopted process is different from DARE's, he liked to see that the tool is well developed and assembled, and said that DARE could use some of improvement in this aspect.

Frakes also gave us a more detailed presentation about the DARE environment. He also presented the main concepts and current trends on software reuse, and we were pleased to see that RiSE has relevant works in most of them.

Besides getting to know each other's works, another goal of this meeting was to find options for possible cooperations between RiSE and their research group at Virginia Tech. One of the suggestions is to pursue co-founded projects between us; another option is to send Ph.D. and M.Sc. students to Virginia Tech, to exchange ideas and experience, and vice-versa; we also discussed the possibility of joint development and tool integration. Since one of RiSE's goals is to develop practical tools for reuse, we could benefit from the experience of both groups to deliver good solutions to the industry.

The meeting ended with many possibilities, and the next step is to start defining concrete options and suggestions to make this collaboration happen.

Software Reuse Knowledge Base

Often, students asked me about how to keep a record about the papers which we have to read. I remember when I started my reuse studies for my Ph.D. and my advisor professor Dr. Silvio Meira convinced me to write some abstract about the papers with my point of view. I do not have all the papers in this base, my Ph.D. thesis had about 217 references, however, I published about 66 papers about the theme. For students, this procedure is useful to learn to write, especially in English (my first abstract was horrible), and keep the history about the papers read during the M.Sc. or Ph.D. degree. Moreover, it is useful to write papers and the dissertation/thesis.

C.E.S.A.R and Avaya Labs started cooperation

The Recife Center for Advanced Studies and Systems (C.E.S.A.R) and Avaya Labs (U.S.) started an agreement for cooperation in the next three years involving efforts in the software reuse area. The agreement defined by the Reuse in Software Engineering (RiSE) group – reuse group from C.E.S.A.R – and the research director from Avaya, Dr. David Weiss, started with a project in the software reuse tools area.

In this project, Liana Barachisio, software engineer and software reuse researcher at C.E.S.A.R, moved to Avaya Labs during five weeks to work together with Dr. David’s team. In this project, C.E.S.A.R and Avaya are identifying requirements for a software product line automation tool based on Avaya’s process.

The idea is to participate on the development of an artifact in the software product line as a way to understand the process’ guidelines. Therefore, a better know-how can be taken to C.E.S.A.R., whose software product line area is starting. After that, a comparison can be done between Avaya’s consolidated software product line process and the one being applied in C.E.S.A.R., with the goal of identifying possible improvements in both sides.

Wednesday, September 12, 2007

Investments in reusable software

Today, we had another interesting discussion in the RiSE group involving a work of David Rine and Robert Sonnemann entitled "Investments in reusable software. A study of software reuse investment success factors", published at Journal of Systems and Software, v. 41 pages 17-32, 1998. Rine and Sonemann support the theory that a set of success factors that are common among the organizations exist and have some previsibility relationships with software reuse. Besides, the research of Rine and Sonemann also investigated if the reuse really has influence in the software productivity and quality.

The success factors were grouped into the following categories: administration (management) commitment; investment strategy; business strategy; technological transfer; organizational structure; process maturity, product line approach; software architecture, components availability; and components quality. To measure the experience in software reuse (reuse capability), productivity, quality and the set of success factors,
Rine and Sonemann developed a questionnaire.

During the discussion some topics and positions adopted by
Rine and Sonemann were questioned, such as: (1) why specify five levels to reuse capability model? Is the level approach the better choice? why do not use scenarios, for example, to suggest and to aid the organizations to identify their position in the reuse practices? (2) the model to calculate the overall probability of reuse success is subjective. (3) the success factors is not a big surprise for us, but this occur because we are reuse practitioners and researches or because is it the obvious choice? (4) the focus on productivity and quality is a better way to target the organizations and to advocate in favor of reuse practices. (5) to get the support of the management people, and the industry, we need ways to show the benefits and the better way is measure the activities, so the utilization of metrics to measuring the software reuse capability. (6) more studies and details are needed to explain the reuse capability model, specially the process of assessment of reuse capability of an organization and the process of reuse implementation in this organization, according to reuse capability model.

So, I think that this work is a very good contribution in the reuse adoption area. We know that reuse adoption is a great advantage for organizations, and this is the main issue to "hide" some activities, tasks and some information about that. I believe that our reuse adoption model can evolve in our environments (such as
CESAR and PITANG) to reach a maturity to be shared with other organizations.

Tuesday, September 11, 2007

Software Product Lines in action

In this week, started the main conference involving software product lines around the world, the 11th International Conference on Software Product Lines (SPLC), in Kyoto, Japan. In this conference, the attendee will have several tutorials about the theme involving product lines adoption, domain-specific languages, reusable tests, generative programming, variability, etc. Moreover, the conference is the right place to meet the main names in the area from industry and university. About the industry, I believe that nowadays, SPLC is one of the conferences with more participants from the industry. In this direction, an important topic there is the Product Line Hall of Fame organized by our partner, David Weiss from Avaya. Additionally, the state-of-the-art is also discussed with key papers in the area. If you lost it, next year the conference will be held in Limerick, Ireland.

RiSE publishes a survey about software reuse in the Brazilian industry scenario

The paper entitled "Software Reuse: The Brazilian Industry Scenario", authored by Daniel Lucrédio, Kellyton Brito, Alexandre Alvaro, Vinicius Garcia, Eduardo Almeida, Renata Fortes and Silvio Meira, will be published in the Journal of Systems and Software, one of the world's most important vehicles in the Software Engineering area. The study analyzed 57 small, medium and large companies in the country, with the objective of identifying the decisive factors for adopting a software reuse program. The study aimed at answering the main doubts and concerns of the companies seeking to promote software reuse.

Similar studies were already conducted in other countries, including surveys from Bill Frakes, from VirginiaTech, Maurizio Morisio, from Politecnico di Torino and David Rine, from George Mason University. Now, with this survey from the RiSE group being published, the Brazilian scenario begins to figure as an important part of the reuse literature, serving as basis for other reuse researchers and practitioners.

Tuesday, September 4, 2007

RiSS 2007 - RiSE Summer School on Software Reuse

Software reuse. It is sure a hot topic in software development even it started in 1968. Last week, we announced the ICSR conference. Now, we would like to announce the First RiSE Summer School on Software Reuse (RiSS), organized by the Recife Center for Advanced Studies and Systems (C.E.S.A.R) through its Reuse in Software Engineering (RiSE) group. This event will be the first one around the world in the software reuse area. There, you will have the opportunity to learn and discuss with the main researchers and practitioners in software reuse. The speakers include: Bill Frakes, Charles Krueger, Dirk Muthig, Ivica Crnkovic, Rubén Prieto-Díaz and Wayne Lim. In this school, reuse topics are related to: Domain Engineering, Software Product Lines, Component-Based Development (CBD), Organizational Economics aspects, Component Libraries and Software Reuse tools. However, the interested should be fast because the attendance is limited to 100 people.

What should Model-Driven Reuse look like?

Eclipse's GMF project has just reached version 2.0 in June, 2007. After more than 2 years of development and presentations at ECOOP 2006 and OOPSLA 2006, among others, the project has reached the necessary stability to start being used in the industry. Having developed a modeling tool myself, I was really impressed with the level of details that is possible to achieve with GMF. Of course, I was also amazed with the fact that the work that took approximately 6 months of my M.Sc. can now be done in 15 minutes. When combined to a code generation framework, such as JET, the possibilities are literally endless. With not so much training, most developers can start creating their own modelers and generating Java, C#, VB, Javascript, XML, and other kinds of source code.

This is, from my point of view, the major achievement of these two particular projects. Code generation and domain-specific modeling are no longer technologies restricted to extremely highly skilled (and expensive) employees, researchers or companies. Maintaining modelers and generators does not require months of planning and implementation, but can be done directly by the developers.

This is where the problem begins. The most obvious (at least for me) application for this technology is to improve software reuse using product families/domain engineering ideas. Therefore, what is the best way to combine software reuse technology (components, repositories, design patterns, product lines, domain engineering, certification, ...) with model-driven development technology (platform-independent models, platform-specific models, model-to-text transformations, model-to-model transformations, ...) ?

The best starting point for answering this question is the modelware initiative. Several research groups and companies are gathered around different areas, having already delivered interesting reports, including a MDD Maturity Model and a MDD Process Framework. However, these not only fail to include specific reuse concern, but are also more suited for european companies, which already have MDD in their knowledge base.

Thinking about introducing these technologies from scratch, we from the RiSE group are developing a model-driven reuse approach, including the needed activities and guidelines. Initial focus is being placed on engineering-related activities, and mainly in the implementation phase, with code generation and platform-specific modeling. The following figure shows a preliminary draft, showing three basic cycles.

The basic cycle is domain engineering, which is being represented as the RiDE (Rise process for Domain Engineering) approach. Based on the results of the domain design phase, the modeler engineering cycle begins. This is where a domain-specific modeler is developed, based on the domain's architecture's elements, such as variability points and architectural patterns.

During domain implementation, components are developed and the transformation engineering cycle starts. This cycle is responsible for developing transformations to be used together with the domain-specific modeler. A design by-example approach is used.

The result of these cycles includes not only source code components, but also transformations that can be used to generate parts of the final product. For example, some specific components may be handcrafted, while controller components and basic infrastructure code can be generated. One practical example is the Web Domain. Specific components for building dynamic web pages, such as a dynamic list or a date picker component, may be handcrafted, while the navigation code, such as Struts's descriptor file, can be generated from a web navigation modeler.

According to Modelware's MDD Maturity Model, the next step regarding the engineering perspective is to incorporate MDD up in the analysis and design phases, allowing the domain engineer to benefit from model-to-model transformations to generate parts of the design or to automatically apply design patterns, performing some kind of model refactoring.

However, the terrain is a little more obscure in these cases than in the implementation. The problem here, I think, is not even the lack of tools, because there are model-to-model transformation engines based on eclipse and EMF available, such as ATL, which have already been tested and proven to be practical. For me, the problem is that the kind of work that is performed during analysis and design is much more conceptual, and therefore, more likely to be erroneously performed by non-human workers, such as a computer-based transformer.

Therefore, except for some basic helper refactoring-like transformations, I think that the use of MDD in these higher-level models will still have to wait some years before reaching the same levels of automation that we can now use in implementation.