Saturday, September 29, 2007
Using data mining to improve search engines
I'll present some ideas related with the use of data mining to improve search engines. Thus, the first question is: Where are your data that you will extract the knowledge? The focus in this discussion is use historic data as log files like reference of a real use of search engine. In other hand we need to select the techniques that we use to extract the knowledge hidden of the raw data. In the literature, there are several techniques as classification, clustering, sequence analysis, association rules; among others [see 1].
The direction selected in this discussion is using the association rules [see 2] to analyze the relations between the data stored in the log file. These relations are used to aid the users through of suggestions like queries or options to download.
The paper selected for the RiSE`s Discussion was “Using Association Rules to Discover Search Engines Related Queries” that shows the use of association rules to extract related queries from a log generated by a web site.
The first question was related with the transformation of log files in the term called “user sessions”, why do it? It is important because the algorithms used to extract association rules needs that the records are grouped in a transactions set. However, when log files are used, these groups are not perfectly separated. The classic situation of association rules is the Market Basket Analysis [1] that is associated with the organization’s products in a super market. In this case, the sessions are defined by the consumer ticket. In this ticket, the products bought are perfectly described and separated of other consumers. However, in a log file the records are not sorted and it is necessary to separate these lines in transactions set. Each transaction will contain several log lines that represent the use of the users. In the paper the IP address and a window time was used. My work uses the session number id to identify the users during a time window.
The quality of recommendations was cited too. This quality is measured using metrics like support, confidence. However the parameter used is specific for each situation.
This approach is common used in the web paradigm, but this idea can be used to improve component search engines using recommendations to downloads, queries and any other information that is stored in log files.
I have some critiques about this paper like the details of data mining process; several algorithms can be used, what was used? Other question is related with the experiment, I think that the choice of a specific domain to extract the rules could help the validation of suggestions using a specialist in this domain.
Wednesday, September 26, 2007
Open Call for M.Sc. and Ph.D. students in RiSE
Monday, September 24, 2007
What views are necessary to represent a SOA?
Service-Oriented Architecture (SOA) is a system architecture in which a collection of loosely coupled services communicate with each other using standard interfaces and message-exchanging protocols. As an emerging technology in software development, SOA presents a new paradigm, and some authors affirms that it affects the entire software development cycle including analysis, specification, design, implementation, verification, validation, maintenance and evolution [see 1, 2 and 3].
In this context, we discussed about the paper "SOA Views: A Coherent View Model of the SOA in the Enterprise", published at IEEE International Conference on Services Computing in 2006. The authors, Ibrahim and Misic, proposed a set of nine views to represent an SOA-based architecture software: Business view, Interface view, Discovery view, Transformation view, Invocation view, Component view, Data view, Infrastructure view, and Test view.
In our discussion, the first question was: Do current approaches, such as RUP 4+1 Model View and ADD method by SEI, attend the particularities within context of SOA design?
We agree with some views and we considerate interesting within SOA approach, such as Interface view and Discovery view. The first describes the service contract, and the second provides the information necessary to discover, bind, and invoke the service.
Additionally, I agree with the paper about to have several views for SOA, because they can conduct the architects to construct a solution with the particularities of SOA and to address the quality attributes of this kind of enterprise system.
Finally, I think that misses in this paper the relation among the stakeholders and the quality attributes that each view can be address. Besides, the paper does not show how each view can be represented. For architects, it is important to have models in order to help the architects to design the solution for each view. One example of this, it is using the UML sequence diagram for Discovery view, showing how the consumer can find the services in the service registry.
Wednesday, September 19, 2007
No Evolution on SE?
This conference has a very interesting public from a set of software companies, such as Philips, Nokia, Sony/Ericsson, HP, among others and a set of recognizable institutes like Fraunhofer Institute, Finland Research, C.E.S.A.R., among others. In this way, interesting discussions and partnerships (with the industry and academia) usually takes place.
I have presented two papers there: (1) a paper about software component maturity model, in which I described the component quality model and the evaluation techniques proposed by our group in order to achieve a quality degree in software components; (2) a paper about an experimental study on domain engineering, which was an interesting work accomplished by our group together with the university in order to evaluate a domain engineering process at a post-graduate course. Some researchers that watched those presentations believe the component certification is the future of software components and like the work that we have been developing because this area is vague, sometimes. On the other hand, the researchers liked the experimental study report and commented that this is an interesting area that could be improved in order to increase the number of proved and validated works (in academia or industry) in software engineering area. The experimental engineering area has received a special attention in the last years by the software engineering community due to the lack of works and the difficulty to evaluate the software researches.
Tuesday, September 18, 2007
The bad side of Bug Repositories
As some people has noticed, the majority of open source software, and proprietary software too, has been organized their development processes around a bug repository system. This means that bugs resolution, new features and even improvements in the process, are being dictated by bug reports. Here, we mean by bug a software defect, change requests, features requests, issues in general.
The task of analyzing reported bugs is called bug tracking or bug triage, where the word "bug" could, reasonably, be replaced by issue, ticket, change request, defect, problem, as many others. But the more interesting is to know that bug tracking tasks are done, in general, by developers and a precious time is taken for this. Beside many others sub-tasks in bug triage, we can cite: analyzing if a bug is valid; trying to reproduce it; dependency checking -- that is, verify if other bugs block this bug and vice-versa; verify if a similar bug has been reported -- duplication detect; assign a reported bug to a developer.
Many other sub-tasks can be identified, however, in attempt to show the problem that bug triage could be the in software final quality, we'll concentrate our efforts on bug duplication detect task, witch actually is manually made, as many others.
In a paper by Gail Murphy, entitled Coping with an open bug repository, we can see that almost 50% of reported bugs during the development and improving phase are invalid. That is, are bugs that could not be reproduced (here we include the well know "works for me" bugs), bugs that wont be resolved, duplicated bugs, bugs with low priority, and so on. And 20% of this invalid bugs are only duplicated bugs, that is, bugs that was early reported.
Putting it in numbers, lets suppose that a project receive about 120 bug reports by day (in some projects this average is much more bigger), and that a developer spent about 5 minutes to analyze one bug. Doing simple arithmetic operations, we see that 10 hours per day, or 10 persons-hour, are wasted only in this task (bug tracking), and about 5 hours are wasted only with bug that does not improve the software quality. And only for duplicated bugs we have 2 wasted hours. Now calculate it for a month, for a year! That is, the automated invalid bugs detection, in special duplicated bug detection, is a field to continue being explored; many techniques has been tested. A good technique can save these wasted hours and put them in a health task.
Another thing which we can mention is that if a software product line approach is used, the problem of duplicated bug reports can increase significantly. Since, products have a common platform, many components are reused. That is, as the same component are used in many products, the probability of reporting the same bug by different people are higher. Moreover, the right component must be correctly identified in attempt to solve the bug, if not the problem still occurring in the product line.
One could not see at a first glance, but the bug repositories analysis, specially the detection of duplicated bugs, has much to see with software reuse. Software reuse try to reduce costs, make software development process faster, increase the software quality and other benefits. Improvements in bug triage processes aims to do exactly this!
Bug repositories came as a new challenge for emergence Data Mining for Software Engineering field. Many techniques from intelligent information retrieval, data mining, machine learn and even data clustering, could be applied to solve these problems. The actually researches results has achieved only 40% (as a maximum) of effectiveness on trying to automate these tasks, witch characterize a semi-automated solution.
Monday, September 17, 2007
RiSE members visit Virginia Tech in Falls Church
We also presented RiSE's works, like the Reuse Maturity Model, the Model Driven Reuse approach, component certification and testing and the RiSE tools – B.A.R.T., CORE, ToolDAy and LIFT. They were particularly interested in Lift, which is a tool for retrieving legacy systems information, aiding the system documentation, because of its results in a real project, and also because they are currently working with reengineering themselves.
Frakes was also interested in B.A.R.T.'s query reformulation work. Regarding the ToolDAy, even though the adopted process is different from DARE's, he liked to see that the tool is well developed and assembled, and said that DARE could use some of improvement in this aspect.
Frakes also gave us a more detailed presentation about the DARE environment. He also presented the main concepts and current trends on software reuse, and we were pleased to see that RiSE has relevant works in most of them.
Besides getting to know each other's works, another goal of this meeting was to find options for possible cooperations between RiSE and their research group at Virginia Tech. One of the suggestions is to pursue co-founded projects between us; another option is to send Ph.D. and M.Sc. students to Virginia Tech, to exchange ideas and experience, and vice-versa; we also discussed the possibility of joint development and tool integration. Since one of RiSE's goals is to develop practical tools for reuse, we could benefit from the experience of both groups to deliver good solutions to the industry.
The meeting ended with many possibilities, and the next step is to start defining concrete options and suggestions to make this collaboration happen.
Software Reuse Knowledge Base
C.E.S.A.R and Avaya Labs started cooperation
In this project, Liana Barachisio, software engineer and software reuse researcher at C.E.S.A.R, moved to Avaya Labs during five weeks to work together with Dr. David’s team. In this project, C.E.S.A.R and Avaya are identifying requirements for a software product line automation tool based on Avaya’s process.
The idea is to participate on the development of an artifact in the software product line as a way to understand the process’ guidelines. Therefore, a better know-how can be taken to C.E.S.A.R., whose software product line area is starting. After that, a comparison can be done between Avaya’s consolidated software product line process and the one being applied in C.E.S.A.R., with the goal of identifying possible improvements in both sides.
Wednesday, September 12, 2007
Investments in reusable software
The success factors were grouped into the following categories: administration (management) commitment; investment strategy; business strategy; technological transfer; organizational structure; process maturity, product line approach; software architecture, components availability; and components quality. To measure the experience in software reuse (reuse capability), productivity, quality and the set of success factors, Rine and Sonemann developed a questionnaire.
During the discussion some topics and positions adopted by Rine and Sonemann were questioned, such as: (1) why specify five levels to reuse capability model? Is the level approach the better choice? why do not use scenarios, for example, to suggest and to aid the organizations to identify their position in the reuse practices? (2) the model to calculate the overall probability of reuse success is subjective. (3) the success factors is not a big surprise for us, but this occur because we are reuse practitioners and researches or because is it the obvious choice? (4) the focus on productivity and quality is a better way to target the organizations and to advocate in favor of reuse practices. (5) to get the support of the management people, and the industry, we need ways to show the benefits and the better way is measure the activities, so the utilization of metrics to measuring the software reuse capability. (6) more studies and details are needed to explain the reuse capability model, specially the process of assessment of reuse capability of an organization and the process of reuse implementation in this organization, according to reuse capability model.
So, I think that this work is a very good contribution in the reuse adoption area. We know that reuse adoption is a great advantage for organizations, and this is the main issue to "hide" some activities, tasks and some information about that. I believe that our reuse adoption model can evolve in our environments (such as CESAR and PITANG) to reach a maturity to be shared with other organizations.
Tuesday, September 11, 2007
Software Product Lines in action
RiSE publishes a survey about software reuse in the Brazilian industry scenario
The paper entitled "Software Reuse: The Brazilian Industry Scenario", authored by Daniel Lucrédio, Kellyton Brito, Alexandre Alvaro, Vinicius Garcia, Eduardo Almeida, Renata Fortes and Silvio Meira, will be published in the Journal of Systems and Software, one of the world's most important vehicles in the Software Engineering area. The study analyzed 57 small, medium and large companies in the country, with the objective of identifying the decisive factors for adopting a software reuse program. The study aimed at answering the main doubts and concerns of the companies seeking to promote software reuse.
Similar studies were already conducted in other countries, including surveys from Bill Frakes, from VirginiaTech, Maurizio Morisio, from Politecnico di Torino and David Rine, from George Mason University. Now, with this survey from the RiSE group being published, the Brazilian scenario begins to figure as an important part of the reuse literature, serving as basis for other reuse researchers and practitioners.
Tuesday, September 4, 2007
RiSS 2007 - RiSE Summer School on Software Reuse
What should Model-Driven Reuse look like?
This is, from my point of view, the major achievement of these two particular projects. Code generation and domain-specific modeling are no longer technologies restricted to extremely highly skilled (and expensive) employees, researchers or companies. Maintaining modelers and generators does not require months of planning and implementation, but can be done directly by the developers.
This is where the problem begins. The most obvious (at least for me) application for this technology is to improve software reuse using product families/domain engineering ideas. Therefore, what is the best way to combine software reuse technology (components, repositories, design patterns, product lines, domain engineering, certification, ...) with model-driven development technology (platform-independent models, platform-specific models, model-to-text transformations, model-to-model transformations, ...) ?
The best starting point for answering this question is the modelware initiative. Several research groups and companies are gathered around different areas, having already delivered interesting reports, including a MDD Maturity Model and a MDD Process Framework. However, these not only fail to include specific reuse concern, but are also more suited for european companies, which already have MDD in their knowledge base.
Thinking about introducing these technologies from scratch, we from the RiSE group are developing a model-driven reuse approach, including the needed activities and guidelines. Initial focus is being placed on engineering-related activities, and mainly in the implementation phase, with code generation and platform-specific modeling. The following figure shows a preliminary draft, showing three basic cycles.
During domain implementation, components are developed and the transformation engineering cycle starts. This cycle is responsible for developing transformations to be used together with the domain-specific modeler. A design by-example approach is used.
The result of these cycles includes not only source code components, but also transformations that can be used to generate parts of the final product. For example, some specific components may be handcrafted, while controller components and basic infrastructure code can be generated. One practical example is the Web Domain. Specific components for building dynamic web pages, such as a dynamic list or a date picker component, may be handcrafted, while the navigation code, such as Struts's descriptor file, can be generated from a web navigation modeler.
According to Modelware's MDD Maturity Model, the next step regarding the engineering perspective is to incorporate MDD up in the analysis and design phases, allowing the domain engineer to benefit from model-to-model transformations to generate parts of the design or to automatically apply design patterns, performing some kind of model refactoring.
However, the terrain is a little more obscure in these cases than in the implementation. The problem here, I think, is not even the lack of tools, because there are model-to-model transformation engines based on eclipse and EMF available, such as ATL, which have already been tested and proven to be practical. For me, the problem is that the kind of work that is performed during analysis and design is much more conceptual, and therefore, more likely to be erroneously performed by non-human workers, such as a computer-based transformer.
Therefore, except for some basic helper refactoring-like transformations, I think that the use of MDD in these higher-level models will still have to wait some years before reaching the same levels of automation that we can now use in implementation.