Go to homeMeilisearch's logo
Back to articles

Patching LMDB: how we made Meilisearch’s vector store 3× faster

We patched LMDB to support nested read transactions on uncommitted writes - eliminating full database scans and making Meilisearch's vector store 3× faster

18 Mar 202612 min read
Patching LMDB: how we made Meilisearch’s vector store 3× faster

Patching LMDB: How we made Meilisearch's vector store 3× faster

Read another blog post in a series:

This post was originally published on Clément Renault's blog.


Meilisearch uses LMDB, a lightning-fast memory-mapped key-value store, for its indexes under the hood. To perform approximate nearest neighbors (ANN) search, we developed a homemade LMDB-based HNSW named Hannoy.

If you are wondering what a Hierarchical Navigable Small World (HNSW) is, we recommend reading this Wikipedia article. If you're feeling lazy: it is a multi-layer graph with increasing connections between nodes that improves the time needed to find the nearest neighbors of a target embedding. An embedding, also commonly defined as a vector, is a set of floating-point numbers; the length of it is called its dimensions, usually between 512 and 3072, and represents the meaning of a text, an image, a sound, etc. Embeddings that are mathematically closer are considered semantically closer to each other by the embedder that generated them.

This article will not dive into the gears of an HNSW but rather into those of LMDB and our vector store. We will talk about a missing feature of our key-value store that we eventually implemented to unlock an even more powerful search engine, enabling indexation to scale linearly with the number of CPUs and to have lower time complexity.

Reminder of the facts

It's been some time since we talked about an unsafe but safe approach to reading the entries of an LMDB write transaction — one that is not committed yet and whose content is not visible to a newly created transaction until it is committed. Here, you can read the more detailed article describing this technique.

The trick is to do a full scan of all the entries we will need to access in order to insert the newly arrived embeddings, and collect the pointers and associated lengths in a very basic Rust HashMap. We confirmed the validity and safety of this approach with Howard Chu, author and maintainer of LMDB. This way, we can wrap this HashMap in a type and unsafe-ly implement Sync on it so we can use multiple threads to access those memory-mapped pointers (embeddings, HNSW graph nodes).

For people who prefer reading code to long paragraphs, here is the data structure we are talking about. You can clearly see the full scan — that's the only for loop in this code sample. That's where the problem lies, and we'll discuss the impact of this loop soon.

pub struct ImmutableItems<'t, D> {
    item_ids: RoaringBitmap,
    constant_length: Option<usize>,
    offsets: Vec<*const u8>,
    _marker: marker::PhantomData<(&'t (), D)>,
}

impl<'t, D: Distance> ImmutableItems<'t, D> {
    pub fn new(rtxn: &'t RoTxn, database: Database<D>, index: u16) -> heed::Result<Self> {
        let mut item_ids = RoaringBitmap::new();
        let mut constant_length = None;
        let mut offsets = Vec::new();

        for result in database.items()? {
            let (item_id, bytes) = result?;
            assert_eq!(*constant_length.get_or_insert(bytes.len()), bytes.len());
            item_ids.push(item_id);
            offsets.push(bytes.as_ptr());
        }

        Ok(ImmutableItems { item_ids, constant_length, offsets, _marker: marker::PhantomData })
    }

    pub fn get(&self, item_id: ItemId) -> heed::Result<Option<Leaf<'t, D>>> {
        let len = match self.constant_length {
            Some(len) => len,
            None => return Ok(None),
        };
        let ptr = match self.item_ids.rank(item_id).checked_sub(1).and_then(|offset| self.offsets.get(offset)) {
            Some(ptr) => *ptr,
            None => return Ok(None),
        };
        let bytes = unsafe { slice::from_raw_parts(ptr, len) };
        ItemCodec::bytes_decode(bytes).map(Some)
    }
}

unsafe impl<D> Sync for ImmutableItems<'_, D> {}

Here is an extract of what happens in Hannoy for a customer that has nearly 19M (1024-dimension) binary-quantized embeddings in its vector store and is inserting 7,000 embeddings. As you may have noticed, a lot of time is spent retrieving the item IDs, fetching item and link pointers, and finally writing the items to disk:

"progressTrace": {
    // [..]
    "processing tasks > indexing > writing embeddings to database > retrieving the items ids": "559.56s",
    "processing tasks > indexing > writing embeddings to database > retrieve the updated items": "3.31ms",
    "processing tasks > indexing > writing embeddings to database > fetch item pointers": "5.54s",
    "processing tasks > indexing > writing embeddings to database > fetch links pointers": "134.15s",
    "processing tasks > indexing > writing embeddings to database > resolve graph entry points": "228.10µs",
    "processing tasks > indexing > writing embeddings to database > building the graph > inserting items": "57.39ms",
    "processing tasks > indexing > writing embeddings to database > building the graph": "57.43ms",
    "processing tasks > indexing > writing embeddings to database > patch old new deleted links": "21.40s",
    "processing tasks > indexing > writing embeddings to database > writing the items": "413.43s",
    "processing tasks > indexing > writing embeddings to database > deleting the links": "2.04s",
    "processing tasks > indexing > writing embeddings to database > write the metadata": "4.10s",
    "processing tasks > indexing > writing embeddings to database": "1140.28s",
    // [..]
    "processing tasks > indexing > finalizing > committing": "16.23s",
    "processing tasks > indexing > finalizing > computing stats": "22.37ms",
    "processing tasks > indexing > finalizing": "16.62s",
    "processing tasks > indexing": "2750.89s",
    "processing tasks": "2751.21s"
  },
  "writeChannelCongestion": {
    "attempts": 48317068,
    "blocking_attempts": 48126087,
    "blocking_ratio": 0.9960473775863647
  },
  "internalDatabaseSizes": {
    // [..]
    "vectorStore": "8.65 GiB (+132 KiB)",
    "documents": "60.8 GiB (+1.05 MiB)"
  }
}

Stop computing useless stats

The part where we retrieve the item IDs corresponds to when we fetch the internal IDs of newly inserted or removed embeddings - and this part was computing the mean degree to help us debug the vector store at the time. We recently removed this, and it was performing full scans of the vector store, which is an operation to avoid, especially as the total number of embeddings grows over time. Removing this reduces the time from 559s to about 2–3s and brings the complexity down to O(n), where n is the number of newly inserted or removed embeddings.

Computing the mean degree stat was added to Hannoy very early on as a sanity check, since we were doing concurrent insertions into the graph and weren't sure each point would be connected enough. Now that we have plenty of tests in our CI, we are much more confident the graph is well-formed.

Full scans are harmful

Here is where the removal of the ImmutableItems and ImmutableLinks data structure tricks have the most effect: fetching the item/link pointers. This step is no longer necessary; we can simply remove it from the Meilisearch progress trace. Removing these full scans is very beneficial, as we were polluting the cache with useless, unnecessary disk pages just by iterating over all of them to fetch pointers. Retrieving the item IDs was already setting up this cache pollution in order to compute the useless stats. That's why the two fetching steps didn't look that large — once you let them do the full work, you can clearly see them taking a long time.

But what's the new trick?

This is the core of the article, and it's very important to understand LMDB's semantics here. LMDB is an MVCC, ACID-first, memory-mapped key-value store. A lot of complex words - but what's important is that it allows opening a write transaction at the same time you open any number of read transactions, without the write transaction blocking the read transactions. However, it doesn't support concurrent writes; opening a write transaction blocks any thread attempting to open another one until the existing transaction commits or aborts.

Read transactions can only see entries that were committed by a write transaction before the read one was opened. This also means that opening a read transaction from the thread using the write transaction will only show entries without the changes currently being made in that write transaction. It's a very useful semantic that we already use in the new Meilisearch indexer, but it doesn't help in this particular situation. Note that write transactions can also read their own writes.

In Meilisearch, when we receive new documents to index, we want them to become available to search atomically. We don't want the database to be in a half-valid state, with references to internal document/embedding IDs pointing to not-yet-available data. Exposing half-valid internal data structures could break search requests that start in the middle of two valid states. A way to fix this could be to ignore certain changes until they are marked as valid - e.g., tombstones for not-yet-valid document IDs. But we don't plan to work on this; we already have enough complexity around soft deletion.

We prefer opening a write transaction (i.e., RwTxn) at the beginning of task processing, passing it as &mut RwTxn, and letting the task-processing function write to the database as needed. Once it is done, it returns, and we commit the transaction if everything went right (i.e., RwTxn::commit(self)), thereby consuming it.

The tricky part is when we receive embeddings to insert: we parse them from JSON and store them into our vector store as raw, optimal floating-point values using a write transaction — they're therefore only visible to that write transaction, but we need to unlock the power of parallelization: being able to read the HNSW graph from multiple threads. As described above, our previous method was to store all key-value entry pointers (pointers into the memory map), since we didn't know in advance which ones we would need to update the HNSW layers. LMDB scales linearly with the number of cores used to read a database.

Ideally, new embeddings should be readable from multiple threads without committing the write transaction and opening new read transactions. While committing and opening multiple read transactions might work, it would expose orphan embeddings (those not yet connected to the graph) to search requests. This would complicate the algorithm and make it less clean, as it would require tracking and potentially removing these embeddings in case of errors or other issues.

We wondered whether it would be possible to patch LMDB to run multiple readers on an uncommitted write transaction and query uncommitted changes from multiple threads. If we could do this, we could get rid of our large, slow-to-initialize HashMap and read uncommitted graph nodes and connections lazily, on demand.

Designing a new API for LMDB

We documented the kind of API LMDB exposed and stopped on a very interesting one: LMDB lets you create nested write transactions. If you provide a parent write transaction when creating a new one, you get a new write transaction that applies its own set of changes and can either commit or abort, rolling back only the subset of changes it made. The child transaction can read the parent's changes. The parent transaction is unusable while its only child transaction is alive. The idea was to be able to create as many read transactions as needed from the write transaction — and to move those read transactions onto different threads to read uncommitted changes. I briefly discussed what the API could look like with Howard (@hyc), and Howard gave us permission to work on this very complex C codebase 😮‍💨

So we went on an adventure to implement this new API. It was mostly about modifying the behavior of the mdb_txn_begin function, which was throwing an error when called with a parent transaction and the MDB_RDONLY flag simultaneously. We allowed starting another nested read-only transaction even when the parent transaction had been suspended, enabling multiple concurrent nested read transactions. As you can see from the linked thread, we weren't following C99 but rather C11, as we had to reimplement an atomically reference-counted (ARC) system to ensure that only the last dropped nested read transaction freed the allocations - using atomics, which are only C11-compatible and require enabling extensions. We didn't disable the parent transaction while child read transactions were still alive because we were using our LMDB Rust wrapper, heed, which handles this via lifetimes. We finally implemented everything C users needed so we could propose a well-working version of the nested read transactions feature.

// Disclaimer: We are not C developers

MDB_env *env;
MDB_dbi dbi;
MDB_val key, data;
MDB_txn *parent_txn, *txn;

mdb_env_create(&env);
mdb_env_set_maxreaders(env, 10);
mdb_env_set_mapsize(env, 10485760);
mdb_env_open(env, "./testdb", MDB_FIXEDMAP, 0664);

mdb_txn_begin(env, NULL, 0, &parent_txn);
// We can now create as many nested read transactions from the parent transaction
mdb_txn_begin(env, &parent_txn, MDB_RDONLY, &txn);

Integrating the API into Hannoy

It was time to replace our slow-to-initialize ImmutableItems and ImmutableLinks data structures with the following one: a wrapper for nested read transactions that lazily fetches items and graph links on demand. As you can see below, we generate and collect enough nested read transactions from the current write transaction to fill the threads in the current thread pool. Every time FrozenReader::item(&self, ItemId) or FrozenReader::links(&self, ItemId, usize) is invoked (by a rayon thread), we pull a transaction from the channel and store it in thread-local storage (TLS).

pub(crate) struct FrozenReader<'t, D> {
    rtxns_pool: crossbeam_channel::Receiver<RoTxn<'t, WithoutTls>>,
    rtxns: thread_local::ThreadLocal<RoTxn<'t, WithoutTls>>,
    index: u16,
    database: Database<D>,
}

impl<'t, D: Distance> FrozenReader<'t, D> {
    pub fn new(wtxn: &'t RwTxn<'_>, index: u16, database: Database<D>) -> Result<Self> {
        // We make sure to have one more thread so the current/main thread has a nested rtxn.
        let num_threads = rayon::current_num_threads() + 1;
        let (sender, rtxns_pool) = crossbeam_channel::bounded(num_threads);

        // Sequentially generate read transactions from the writer transaction
        for _ in 0..num_threads {
            let rtxn = wtxn.nested_read_txn()?;
            sender.try_send(rtxn).unwrap();
        }

        Ok(Self { rtxns_pool, rtxns: thread_local::ThreadLocal::new(), index, database })
    }

    pub fn item<'a>(&'a self, item_id: ItemId) -> Result<Item<'a, D>> {
        let rtxn = self.rtxns.get_or(|| self.rtxns_pool.try_recv().unwrap());
        let key = Key::item(self.index, item_id);
        // key is a `Key::item` so returned result must be a Node::Item
        self.database.get(rtxn, &key)?.and_then(|node| node.item()).ok_or(Error::missing_key(key))
    }

    pub fn links<'a>(&'a self, item_id: ItemId, level: usize) -> Result<Links<'a>> {
        let rtxn = self.rtxns.get_or(|| self.rtxns_pool.try_recv().unwrap());
        let key = Key::links(self.index, item_id, level as u8);
        // key is a `Key::links` so returned result must be a Node::Links
        self.database.get(rtxn, &key)?.and_then(|node| node.links()).ok_or(Error::missing_key(key))
    }
}

Note that the + 1 added to the number of nested read transactions serves a purpose. When calling the item and links methods, the channel was sometimes empty because threads from the rayon pool had pulled all the transactions and stored them in their own thread-local storage. When we accessed this data structure from the main thread — which is not part of the rayon thread pool — one read transaction was missing, so we accounted for it. Thanks, @dureuill, for the help on this one.

The ThreadLocal data structure stores data in a thread's local storage, eliminating the need for inter-thread synchronization to access it. We only need to synchronize when pulling a read transaction from the crossbeam channel; once it's stored thread-locally, we can retrieve it freely. It may seem odd to use read transactions with the WithoutTls marker, i.e., MDB_NOTLS, which indicates they can be moved between threads safely because they do not use TLS. The reason is that we need to create the nested read transactions from the main thread, as RwTxn is !Sync and cannot be accessed concurrently.

All of these constraints led us to design this approach, which satisfies our needs. It's memory-safe by design: there cannot be any writes while an Item or a Links is borrowed, as the FrozenReader data structure lives longer than the write transaction's lifetime. In terms of Rust's rules, you cannot use the RwTxn while you hold the lifetime. You must drop the FrozenReader to write anything to the database — and that's exactly what we do: we collect the different graph modifications we want to perform in memory, and apply them once we are done reading the database in parallel. After that, we write the changes synchronously using our original write transaction. Note that even without this safe, compile-time Rust lifetime-based design, the parent transaction would throw an error if used while child transactions were still alive.

Results: over 3x faster

We updated this customer to the brand-new version of Meilisearch featuring our patched vector store. As a reminder, we were inserting 7,000 embeddings in about 1,140s, which is roughly 6 embeddings per second. In the following progress trace, we are indexing ~13,300 embeddings in about 668s, resulting in 20 embeddings per second — more than three times faster than before:

"progressTrace": {
  // [..]
  "processing tasks > indexing > writing embeddings to database > retrieve the updated items": "3.80ms",
  "processing tasks > indexing > writing embeddings to database > resolve graph entry points": "173.33µs",
  "processing tasks > indexing > writing embeddings to database > building the graph > inserting items": "27.05s",
  "processing tasks > indexing > writing embeddings to database > building the graph": "27.05s",
  "processing tasks > indexing > writing embeddings to database > patch old new deleted links": "232.62s",
  "processing tasks > indexing > writing embeddings to database > writing the items": "402.52s",
  "processing tasks > indexing > writing embeddings to database > deleting the links": "2.13s",
  "processing tasks > indexing > writing embeddings to database > write the metadata": "3.89s",
  "processing tasks > indexing > writing embeddings to database": "668.22s",
  // [..]
  "processing tasks > indexing > finalizing > committing": "24.15s",
  "processing tasks > indexing > finalizing > computing stats": "18.27ms",
  "processing tasks > indexing > finalizing": "24.60s",
  "processing tasks > indexing": "1773.29s",
  "processing tasks": "1773.73s",
  "writing tasks to disk > task": "415.73ms",
  "writing tasks to disk": "415.82ms"
},
"writeChannelCongestion": {
  "attempts": 14819408,
  "blocking_attempts": 14435620,
  "blocking_ratio": 0.9741023182868958
},
"internalDatabaseSizes": {
  // [..]
  "vectorStore": "8.64 GiB (+412 KiB)",
  "documents": "61.01 GiB (+2.55 MiB)"
}

Note that this customer uses sharding to speed up embedding indexation, so the effective throughput is even higher than 20 embeddings per second. Also note that we are using AWS EBS to ensure we don't lose data. This disk technology is great because it replicates data across different NVMe drives, but communication between the machine and the disks is over the network, which is "slow": any page fetch (set up to do look-ahead) takes an average of 1 millisecond. For a disk, it's slow, but for a network, it's quite impressive.

So, the key concern in this kind of environment is not to waste a single bit of disk cache. The disk cache prevents you from triggering page faults too often and from fetching too many pages from disk. With this new patch, we ensure only the necessary pages are loaded into the cache — not all available pages in LMDB. We avoid evicting important pages. As a result, you can see that building the graph and patching old/new/deleted links now take longer, because those tasks are lazily loading only the necessary pages that the previous retrieving-the-item-IDs, fetch-item-pointers, and fetch-links-pointers steps were loading upfront. Better pages are now in the disk page cache!

In conclusion, this new LMDB feature is extremely useful and, as a bonus, does not require any API changes. We have been using it in production since December 10th, 2025. It would be very helpful if the LMDB team could review this work. A significant amount of personal time was dedicated to it, with care taken to meet all the requirements to eventually get it upstream. Maintaining a fork and keeping it up to date with various upstream patches is not ideal. Thanks for your time and for the wonderful tools you provide.

Why traditional hybrid search falls short (and how we fixed it)

Why traditional hybrid search falls short (and how we fixed it)

Meilisearch's innovative scoring system revolutionizes hybrid search by properly combining full-text and semantic search, delivering more relevant results than traditional rank fusion methods.

Louis Dureuil
Louis Dureuil19 Jun 2025
Introducing Meilisearch's next-generation indexer: 4x faster updates, 30% less storage

Introducing Meilisearch's next-generation indexer: 4x faster updates, 30% less storage

Indexer edition 2024 revolutionizes search performance with parallel processing, optimized RAM usage, and enhanced observability. See what's new in our latest release.

Louis Dureuil
Louis Dureuil26 Feb 2025
Meilisearch indexes embeddings 7x faster with binary quantization

Meilisearch indexes embeddings 7x faster with binary quantization

By implementing binary quantization with the vector store Arroy, significant reductions in disk space usage and indexing time for large embeddings have been achieved while maintaining search relevance and efficiency.

Tamo
Tamo29 Nov 2024