Language - Meilisearch Documentation

Meilisearch is multilingual and works with datasets in any language. Its tokenizer, Charabia, provides optimized segmentation and normalization for a wide range of languages and scripts.

Supported languages

The following table lists all languages and scripts with dedicated tokenization support in Charabia:

Language / Script	Segmentation	Normalization
Latin (English, French, Spanish, Italian, Portuguese, etc.)	CamelCase segmentation	Decomposition, lowercase, nonspacing-marks removal
German	CamelCase + compound word decomposition	Same as Latin
Swedish	Specialized normalization	Decomposition, lowercase
Greek	Default	Decomposition, lowercase, final sigma handling
Cyrillic / Georgian (Russian, Ukrainian, Bulgarian, etc.)	Default	Decomposition, lowercase
Armenian	Default	Decomposition, lowercase
Arabic	Article (ال) segmentation	Decomposition, digit conversion, nonspacing-marks removal
Persian	Specialized segmentation	Decomposition, normalization
Hebrew	Default	Decomposition, nonspacing-marks removal
Turkish	Default	Specialized case folding (dotted/dotless i)
Chinese (CMN)	jieba-based dictionary segmentation	Decomposition, kvariant conversion
Japanese	lindera IPA dictionary segmentation	Decomposition
Korean	lindera KO dictionary segmentation	Decomposition
Thai	Dictionary-based segmentation	Decomposition, nonspacing-marks removal
Khmer	Dictionary-based segmentation	Decomposition

Languages not listed above still work with Meilisearch. Any language that uses whitespace to separate words benefits from the default Latin pipeline. Results may be less relevant for unlisted languages that do not use spaces between words. We aim to provide global language support, and your feedback helps us move closer to that goal. If you notice inconsistencies in your search results or the way your documents are processed, please open an issue in the Meilisearch repository. Read more about our tokenizer

Multilingual hybrid search

Meilisearch’s keyword-based search relies on Charabia for tokenization, but hybrid search and semantic search use embedding models that can handle languages independently of the tokenizer. Many embedding providers offer multilingual models that work across 100+ languages out of the box:

Provider	Multilingual model	Dimensions
Cohere	`embed-v4.0`	256, 512, 1,024, or 1,536
Cohere	`embed-multilingual-v3.0`	1,024
Voyage AI	`voyage-4`	256, 512, 1,024, or 2,048
Jina	`jina-embeddings-v4`	128, 256, 512, 1,024, or 2,048
AWS Bedrock	`cohere.embed-v4:0`	256, 512, 1,024, or 1,536
Hugging Face	`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`	384

Using a multilingual embedding model allows you to:

Search across languages: a query in English can match documents written in French, German, or Japanese.
Simplify multilingual indexing: instead of creating one index per language, a single index with a multilingual embedder can serve multiple languages.
Complement keyword search: combine Charabia’s keyword tokenization with semantic embeddings in hybrid search for the best of both approaches.

For multilingual datasets, consider using hybrid search with a multilingual embedder alongside localized attributes for keyword matching. This gives you accurate tokenization per language for keyword search and cross-language understanding for semantic search.

For guidance on structuring multilingual datasets, see Handling multilingual datasets.

Improving our language support

While we have employees from all over the world at Meilisearch, we don’t speak every language. We rely almost entirely on feedback from external contributors to understand how our engine is performing across different languages. If you’d like to request optimized support for a language, please upvote the related discussion in our product repository or open a new one if it doesn’t exist. If you’d like to help by developing a tokenizer pipeline yourself: first of all, thank you! We recommend that you take a look at the tokenizer contribution guide before making a PR.

FAQ

What do you mean when you say Meilisearch offers optimized support for a language?

Optimized support for a language means Meilisearch has implemented internal processes specifically tailored to parsing that language, leading to more relevant results. This includes specialized segmentation (how text is split into words) and normalization (how characters are standardized for matching).

My language does not use whitespace to separate words. Can I still use Meilisearch?

Yes. For keyword search, results may be less relevant than for fully optimized languages. However, you can use hybrid search with a multilingual embedding model to get strong semantic results regardless of tokenization support.

My language does not use the Roman alphabet. Can I still use Meilisearch?

Yes. Charabia supports many non-Latin scripts including Cyrillic, Greek, Arabic, Hebrew, Armenian, Thai, Chinese, Japanese, and Korean. Multilingual embedding models also work across all writing systems.

Does Meilisearch plan to support additional languages in the future?

Yes, we definitely do. The more feedback we get from native speakers, the easier it is for us to understand how to improve performance for those languages. Similarly, the more requests we get to improve support for a specific language, the more likely we are to devote resources to that project.

​Supported languages

​Multilingual hybrid search

​Improving our language support

​FAQ

​What do you mean when you say Meilisearch offers optimized support for a language?

​My language does not use whitespace to separate words. Can I still use Meilisearch?

​My language does not use the Roman alphabet. Can I still use Meilisearch?

​Does Meilisearch plan to support additional languages in the future?

Supported languages

Multilingual hybrid search

Improving our language support

FAQ

What do you mean when you say Meilisearch offers optimized support for a language?

My language does not use whitespace to separate words. Can I still use Meilisearch?

My language does not use the Roman alphabet. Can I still use Meilisearch?

Does Meilisearch plan to support additional languages in the future?