skip to main content

Efficient text searching

Ricardo Baeza-Yates Gaston H Gonnet

1989

Localização: IME - Inst. Matemática e Estatística  IMRE SIMON  (CIS QA758.T B142e e.2 )(Acessar)

  • Título:
    Efficient text searching
  • Autor: Ricardo Baeza-Yates
  • Gaston H Gonnet
  • Assuntos: Search theory; Algorithms; Text processing (Computer science); ALGORITMOS E ESTRUTURAS DE DADOS (TESE DOUTORADO); ALGORITMOS PARA PROCESSAMENTO (TESE DOUTORADO)
  • Notas: Thesis (Ph. D.)--University of Waterloo, 1989 ; Includes bibliographical references and index
  • Descrição: Abstract: "This thesis presents and analyses new algorithms for text searching. We place special emphasis on the average case of each algorithm, because we are interested in practical applications. We consider plain text and preprocessed text. In preprocessed (indexed) text our main result is to show that automaton searching over a digital tree requires time sublinear in the size of the text for any regular expression; this is the first algorithm with a searching time complexity better than linear. We also show that for a restricted set of regular expressions, searching requires logarithmic time in the expected case. In all these algorithms, the time is independent of the size of the answer.
    For searching in plain text, our main results are the following. First, we analyze the expected case of the best known string matching algorithms, obtaining bounds for the Knuth-Morris-Pratt algorithm, and the Boyer-Moore algorithm and several of its variations. We also present improvements to Boyer-Moore type algorithms, in particular for searching English text. Second, we analyze three new, simple, and fast algorithms to solve the problem of string matching with mismatches. These algorithms are eminently practical and can be extended to other variations of this problem. Finally, we present a new approach to pattern matching, based on a numerical representation of the state of the search. We apply this method to patterns including classes of symbols, complement, 'don't care' symbols, multiple patterns, and mismatches. The search time is linear for most practical cases. One of the advantages of these algorithms is that they are suitable for hardware implementation."
  • Títulos relacionados: Série:University of Waterloo Computer Science Dept Research report CS-89-17
  • Data de criação/publicação: 1989
  • Formato: xi, 129 p ill 26 cm.
  • Idioma: Inglês

Buscando em bases de dados remotas. Favor aguardar.