Supported languages
The following table lists all languages and scripts with dedicated tokenization support in Charabia:| Language / Script | Segmentation | Normalization |
|---|---|---|
| Latin (English, French, Spanish, Italian, Portuguese, etc.) | CamelCase segmentation | Decomposition, lowercase, nonspacing-marks removal |
| German | CamelCase + compound word decomposition | Same as Latin |
| Swedish | Specialized normalization | Decomposition, lowercase |
| Greek | Default | Decomposition, lowercase, final sigma handling |
| Cyrillic / Georgian (Russian, Ukrainian, Bulgarian, etc.) | Default | Decomposition, lowercase |
| Armenian | Default | Decomposition, lowercase |
| Arabic | Article (ال) segmentation | Decomposition, digit conversion, nonspacing-marks removal |
| Persian | Specialized segmentation | Decomposition, normalization |
| Hebrew | Default | Decomposition, nonspacing-marks removal |
| Turkish | Default | Specialized case folding (dotted/dotless i) |
| Chinese (CMN) | jieba-based dictionary segmentation | Decomposition, kvariant conversion |
| Japanese | lindera IPA dictionary segmentation | Decomposition |
| Korean | lindera KO dictionary segmentation | Decomposition |
| Thai | Dictionary-based segmentation | Decomposition, nonspacing-marks removal |
| Khmer | Dictionary-based segmentation | Decomposition |
Multilingual hybrid search
Meilisearch’s keyword-based search relies on Charabia for tokenization, but hybrid search and semantic search use embedding models that can handle languages independently of the tokenizer. Many embedding providers offer multilingual models that work across 100+ languages out of the box:| Provider | Multilingual model | Dimensions |
|---|---|---|
| Cohere | embed-multilingual-v3.0 | 1024 |
| Cohere | embed-multilingual-light-v3.0 | 384 |
| Voyage AI | voyage-multilingual-2 | 1024 |
| AWS Bedrock | cohere.embed-multilingual-v3 | 1024 |
| Hugging Face | sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | 384 |
- Search across languages: a query in English can match documents written in French, German, or Japanese.
- Simplify multilingual indexing: instead of creating one index per language, a single index with a multilingual embedder can serve multiple languages.
- Complement keyword search: combine Charabia’s keyword tokenization with semantic embeddings in hybrid search for the best of both approaches.
For guidance on structuring multilingual datasets, see Handling multilingual datasets.