Text analyzer for search indexing
Provides text processing pipelines:
Tokenization
Lowercasing
Stopword removal
Stemming
Synonym expansion
Public fields
lowercaseConvert to lowercase
remove_stopwordsRemove stopwords
stopwordsSet of stopwords
stemmerStemmer object
synonymsSynonym dictionary
min_token_lengthMinimum token length
max_token_lengthMaximum token length
token_patternRegex pattern for tokens
Methods
Method new()
Create a new TextAnalyzer
Usage
TextAnalyzer$new(
lowercase = TRUE,
remove_stopwords = FALSE,
stopwords = NULL,
use_stemmer = FALSE,
synonyms = NULL,
min_token_length = 1,
max_token_length = 100,
token_pattern = "[a-zA-Z0-9]+"
)Arguments
lowercaseLowercase text (default: TRUE)
remove_stopwordsRemove stopwords (default: FALSE)
stopwordsCustom stopwords (default: ENGLISH_STOPWORDS)
use_stemmerUse stemming (default: FALSE)
synonymsNamed list of synonyms
min_token_lengthMin length (default: 1)
max_token_lengthMax length (default: 100)
token_patternRegex pattern