public class ArabicLetterTokenizer extends LetterTokenizer
The problem with the standard Letter tokenizer is that it fails on diacritics. Handling similar to this is necessary for Indic Scripts, Hebrew, Thaana, etc.
AttributeSource.AttributeFactory, AttributeSource.State
Constructor and Description |
---|
ArabicLetterTokenizer(AttributeSource.AttributeFactory factory,
java.io.Reader in) |
ArabicLetterTokenizer(AttributeSource source,
java.io.Reader in) |
ArabicLetterTokenizer(java.io.Reader in) |
Modifier and Type | Method and Description |
---|---|
protected boolean |
isTokenChar(char c)
Allows for Letter category or NonspacingMark category
|
end, incrementToken, next, next, normalize, reset
close, correctOffset
getOnlyUseNewAPI, reset, setOnlyUseNewAPI
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString
public ArabicLetterTokenizer(java.io.Reader in)
public ArabicLetterTokenizer(AttributeSource source, java.io.Reader in)
public ArabicLetterTokenizer(AttributeSource.AttributeFactory factory, java.io.Reader in)
protected boolean isTokenChar(char c)
isTokenChar
in class LetterTokenizer
LetterTokenizer.isTokenChar(char)
Copyright © 2000-2016 Apache Software Foundation. All Rights Reserved.