×

You are using an outdated browser Internet Explorer. It does not support some functions of the site.

Recommend that you install one of the following browsers: Firefox, Opera or Chrome.

Contacts:

+7 961 270-60-01
ivdon3@bk.ru

Information content of the frequency characteristics of N-gram of the websites text fragments for search systems

Abstract

Information content of the frequency characteristics of N-gram of the websites text fragments for search systems

V.A. Strotsev

Incoming article date: 29.12.2012

  The work in the Internet is impossible without the use of search systems. The quality of the responses to user’s requests to a large extent depends on the keywords. However, due to some circumstances the user is not able to accurately enough to formulate a request and the number of received responses is great. In these situations, additional feature relevant answers selection can be text documents belonging to a particular implicit group. Implicitness group is shown in the fact that the text belongs to it is determined not by direct comparison with the reference (key) words, but by matching the semantic features, the wording of which is absent in the search text. Implementation of text classification can be made on the basis of the frequency characteristics of N-grams. The purpose of the work is to assess the possibility of using the frequency characteristics of N-gram of the websites text fragments to improve search engine based on a study of their information content. In this work designed a method based on the selection of informative indicators of N-gram with low computational requirements and with application of the correlation analysis. Based on its application has been shown that the frequency characteristics of N-grams have sufficient information content to improve search engines. The developed method has many common with systems of writing, because it relies not only on the alphabetic systems.  

Keywords: N-gram, request, semantic features, search engine