Saturday, September 29, 2007

Using data mining to improve search engines

I'll present some ideas related with the use of data mining to improve search engines. Thus, the first question is: Where are your data that you will extract the knowledge? The focus in this discussion is use historic data as log files like reference of a real use of search engine. In other hand we need to select the techniques that we use to extract the knowledge hidden of the raw data. In the literature, there are several techniques as classification, clustering, sequence analysis, association rules; among others [see 1].
The direction selected in this discussion is using the association rules [see 2] to analyze the relations between the data stored in the log file. These relations are used to aid the users through of suggestions like queries or options to download.
The paper selected for the RiSE`s Discussion was “Using Association Rules to Discover Search Engines Related Queries” that shows the use of association rules to extract related queries from a log generated by a web site.
The first question was related with the transformation of log files in the term called “user sessions”, why do it? It is important because the algorithms used to extract association rules needs that the records are grouped in a transactions set. However, when log files are used, these groups are not perfectly separated. The classic situation of association rules is the Market Basket Analysis [1] that is associated with the organization’s products in a super market. In this case, the sessions are defined by the consumer ticket. In this ticket, the products bought are perfectly described and separated of other consumers. However, in a log file the records are not sorted and it is necessary to separate these lines in transactions set. Each transaction will contain several log lines that represent the use of the users. In the paper the IP address and a window time was used. My work uses the session number id to identify the users during a time window.
The quality of recommendations was cited too. This quality is measured using metrics like support, confidence. However the parameter used is specific for each situation.
This approach is common used in the web paradigm, but this idea can be used to improve component search engines using recommendations to downloads, queries and any other information that is stored in log files.
I have some critiques about this paper like the details of data mining process; several algorithms can be used, what was used? Other question is related with the experiment, I think that the choice of a specific domain to extract the rules could help the validation of suggestions using a specialist in this domain.


Eduardo Almeida said...

Martins, my question is related to this your affirmation "...classification, clustering, sequence analysis, association rules; among others...". Do you have any data|experiment|or argument to convince people that association rules be the best approach to perform it? I would like to see more discussion in this direction. Moreover, how do you think about the evaluation process since you need to have some data.

Alexandre Martins said...

Each technique is used for specific proposes. My direction is related with the relation between the stored assets. Thus, association rules aids the discovery of these relations.

The question related with the experiment is so important. I must have the data in order to extract the knowledge. I think that in the first step the data are not real and this step is used to build the your first prototype. However, the validation depends of the real data. Maybe, the projects are based in existent data to mining it or in future systems that will provide it. The second option has an dependency and it force the generation of artificial data (but coherent) to validate your project.

I need to have data to validate and tunning my system. If I don't have it, then I need to created it in a controlled environment, not so big as real data, but coherent.