Showing posts with label search and retrieval. Show all posts
Showing posts with label search and retrieval. Show all posts

Tuesday, November 20, 2007

Towards a Query Reformulation Approach for Component Retrieval

This post presents topics related in the seminar that we had today for I.N.1.0.3.8 - Advanced Seminars in Software Reuse course at Cin/UFPE. The title is "Towards a Query Reformulation Approach for Component Retrieval". The presentation (see) evaluates several papers about Query Reformulation, exploring theirs approaches and strategies.

Software construction is done more quickly when a reuse process is adopted. But this is not enough if there is no market to absorb these software components. The component market still faces a wide range of difficulties such as lack of a efficient search engine.

One of the biggest problems in component retrieval and search is to increase the significance of the result since the user normally doesn´t formulate the query of the appropriate way. The searcher has a vision of the problem that is not necessarily the components repository reality. Several approaches try to solve this problem. The seminar focuses on the query reformulation technique which reduces the conceptual gap between problem and solution through query refinement based on formulated queries stored previously.

The Code Finder was one of the first attempts to implement component search and retrieval by query reformulation. The papper "Interactive Internet search: keyword, directory and query reformulation mechanisms compared" evaluates that query refomulation improves the relevance of documents, but increase search time. The work "Using Ontologies Using Ontologies for Database Query Reformulation" do query reformulation using ontology rules for query optimization and for data integration. Another very interesting study is "Lexical analysis Lexical analysis for modeling web query reformulation" that analyzes lexicaly the searcher behavior through that Query Clarity and Part-of-Speech.

My initial proposal is to develop a query reformulation engine for BART, using techniques that will be evaluated such ontology, keywords order and other. Approaches comparison matrix will be prepared to compare the several existing techniques e helps in the correct choose.

by Dimitri Malheiros

Tuesday, October 23, 2007

Changing the focus on Search and Retrieval: From Software Assets to Interactive Multimedia Diary for Home

Often in this blog [1, 2], we have discussed an old and important topic in software reuse: the search and retrieval of software assets. For this complex problem, there are several approaches in order to improve it such as folksonomy, facets, ontologies, data mining, context, etc. In the RiSE, headed in different moments by Vinicius Garcia and Daniel Lucredio we discussed several questions about it. However, another point of view in this area is being explored by researchers of the University of Tokyo. In their research [see the full paper], search and retrieval is still the main problem, but the point of view is a little bit different. Can you imagine a multimedia diary for your home? Yes, that is their focus. Imagine questions [for a system] such as: When did I get up on the first of this morning? Or [this one can be good for some classmates] who left the lights on in the study room last night? Or Am I working at home during my stay?

Their motivation is that automated capture and retrieval of experiences tracking place at home is interesting for several reasons. First, the home offers an environment where a variety of memorable events and experiences take place [imagine your first soup, steps, etc]. Thus, the work on multimedia capture and retrieval focuses on the development of algorithms for person tracking, key frame extraction, media handover, lighting change detection and the design of strategies that help to navigate huge amounts of multimedia data. The studies were conducted at the National Institute of Information and Communications Technology’s Ubiquitous Home in Kyoto, Japan, in an environment simulating a two-bed room house equipped with 17 cameras and 25 microphone for continuous video and audio acquisition, in conjunction with pressures-based floor sensors. Some challenges are associated to floor sensor data retrieval, audio retrieval, lighting changes, besides user interaction. In their prototype, the user retrieves video, audio, and key frames through a graphical interface based on some queries. But¸ about the queries you can think about the gap between the user queries and the semantic levels. For example, consider a query as: “retrieve video showing the regions of the house people were at 20:00 p.m” and “What was I doing after dinner?”.
The preliminary evaluation with real-life families shown that the research are going well. The algorithms results [retrieval] involved 73 percent for foot step segmentation accuracy, 80 percent for frames and 92 percent for audio.
As you can think, the researchers said that the main difficulty in the capture is the large amount of disk space it consumes. Moreover, for faster access, the video data is stored as frames and the audio as 1-minute clips, resulting in low compression of the data.
In our case, RiSE is having this problem also. However, our focus is on island of source code and docs. But, it is another history.

Saturday, September 29, 2007

Using data mining to improve search engines


I'll present some ideas related with the use of data mining to improve search engines. Thus, the first question is: Where are your data that you will extract the knowledge? The focus in this discussion is use historic data as log files like reference of a real use of search engine. In other hand we need to select the techniques that we use to extract the knowledge hidden of the raw data. In the literature, there are several techniques as classification, clustering, sequence analysis, association rules; among others [see 1].
The direction selected in this discussion is using the association rules [see 2] to analyze the relations between the data stored in the log file. These relations are used to aid the users through of suggestions like queries or options to download.
The paper selected for the RiSE`s Discussion was “Using Association Rules to Discover Search Engines Related Queries” that shows the use of association rules to extract related queries from a log generated by a web site.
The first question was related with the transformation of log files in the term called “user sessions”, why do it? It is important because the algorithms used to extract association rules needs that the records are grouped in a transactions set. However, when log files are used, these groups are not perfectly separated. The classic situation of association rules is the Market Basket Analysis [1] that is associated with the organization’s products in a super market. In this case, the sessions are defined by the consumer ticket. In this ticket, the products bought are perfectly described and separated of other consumers. However, in a log file the records are not sorted and it is necessary to separate these lines in transactions set. Each transaction will contain several log lines that represent the use of the users. In the paper the IP address and a window time was used. My work uses the session number id to identify the users during a time window.
The quality of recommendations was cited too. This quality is measured using metrics like support, confidence. However the parameter used is specific for each situation.
This approach is common used in the web paradigm, but this idea can be used to improve component search engines using recommendations to downloads, queries and any other information that is stored in log files.
I have some critiques about this paper like the details of data mining process; several algorithms can be used, what was used? Other question is related with the experiment, I think that the choice of a specific domain to extract the rules could help the validation of suggestions using a specialist in this domain.