Cache Update

I see a lot of people writing code to update cached data. They delete the cache first, then update the database, and then the subsequent operations reload the data into the cache. However, this logic is incorrect. Imagine two concurrent operations: one is an update operation and the other is a query operation. When the update operation deletes the cache, the query operation does not hit the cache and retrieves the old data from the database and puts it into the cache. Then the update operation updates the database. As a result, the data in the cache is still the old data, causing the cache to be dirty and remain dirty.

I don't know why so many people use this logic. After I posted this on Weibo, I found that many people provided very complex and strange solutions. So, I want to write this article to discuss several cache update design patterns (let's have more tricks).

Here, let's not discuss whether updating the cache and updating the data are part of the same transaction or if there is a possibility of failure. Let's assume that updating the database and updating the cache can both be successful (let's get the logic for success right first).

There are four design patterns for updating the cache: Cache Aside, Read Through, Write Through, and Write Behind Caching. Let's take a look at each of these patterns one by one.

Cache Aside Pattern#

This is the most commonly used pattern. The specific logic is as follows:

Invalidation: The application first retrieves data from the cache. If it is not found, it retrieves the data from the database and then stores it in the cache.
Hit: The application retrieves data from the cache and returns it if found.
Update: The data is first stored in the database. After a successful update, the cache is invalidated.

Cache-Aside-Design-Pattern-Flow-Diagram-e1470471723210.png
Updating-Data-using-the-Cache-Aside-Pattern-Flow-Diagram-1-e1470471761402.png

Note that we update the database first and then invalidate the cache. So, can this approach avoid the problem mentioned earlier? Let's imagine.

We have two concurrent operations: one is a query operation and the other is an update operation. First, there is no deletion of cache data, but the database is updated first. At this point, the cache is still valid, so the concurrent query operation retrieves the old data. However, the update operation immediately invalidates the cache, and the subsequent query operation retrieves the data from the database. This does not cause the problem mentioned at the beginning, where subsequent query operations always retrieve old data.

This is a standard design pattern, and even Facebook's paper "Scaling Memcache at Facebook" uses this strategy. Why not update the cache after writing to the database? You can take a look at this question and answer on Quora: "Why does Facebook use delete to remove the key-value pair in Memcached instead of updating the Memcached during write request to the backend?" The main reason is to avoid dirty data caused by two concurrent write operations.

So, does Cache Aside completely eliminate concurrency issues? Not necessarily. For example, if there is a read operation that does not hit the cache and retrieves data from the database, and then a write operation occurs. After updating the database, the cache is invalidated, and then the previous read operation puts the old data back into the cache, resulting in dirty data.

However, this case is theoretically possible but may have a very low probability of occurrence. This condition requires the cache to be invalidated while reading from the cache, and there must be a concurrent write operation. In reality, database write operations are much slower than read operations and require table locking. Read operations must enter the database operation before write operations and update the cache after write operations. The probability of all these conditions occurring together is generally not high.

Therefore, this is also mentioned in the answer on Quora. Either ensure consistency through 2PC or Paxos protocols, or try to reduce the probability of dirty data during concurrency. Facebook uses the latter approach because 2PC is too slow and Paxos is too complex. Of course, it is best to set an expiration time for the cache.

Read/Write Through Pattern#

As we can see, in the Cache Aside pattern mentioned above, our application code needs to maintain two data stores: the cache and the database (repository). Therefore, the application becomes more verbose. The Read/Write Through pattern delegates the database update operation to the cache itself, making it simpler for the application layer. It can be understood that the application treats the backend as a single storage, and the storage maintains its own cache.

Read Through
The Read Through pattern updates the cache during a query operation. In other words, when the cache is invalidated (expired or LRU eviction), the Read Through pattern uses the cache service itself to load the data, making it transparent to the application.

Write Through
The Write Through pattern is similar to the Read Through pattern, but it occurs during data updates. When there is a data update, if the cache is not hit, the database is updated directly and then returned. If the cache is hit, the cache is updated, and then the cache itself updates the database (this is a synchronous operation).

The following image is from the Wikipedia page on Cache. You can understand the Memory as the database in our example.

460px-Write-through_with_no-write-allocation.svg_.png

Write Behind Caching Pattern#

Write Behind, also known as Write Back. Some students who are familiar with the Linux operating system kernel should be very familiar with write back. Isn't this the algorithm of the Linux file system's Page Cache? Yes, you can see that all these fundamentals are interconnected. Therefore, fundamentals are important. I have mentioned the importance of fundamentals many times.

The Write Back pattern is to update the cache only when updating data, without updating the database. Our cache asynchronously updates the database in batches. The advantage of this design is that it makes data I/O operations extremely fast (because it operates directly on memory), and because it is asynchronous, Write Back can also merge multiple operations on the same data, resulting in a significant performance improvement.

However, the problem it brings is that the data is not strongly consistent and may be lost (we know that abnormal shutdown of Unix/Linux can cause data loss, which is caused by this). In software design, we can never make a flawless design. Just like the time-space trade-off in algorithm design, there is a trade-off between strong consistency and high performance, high availability, and high performance. Software design always involves trade-offs.

In addition, the implementation logic of Write Back is more complex because it needs to track which data has been updated and needs to be persisted to the persistent layer. The write back in the operating system will only be truly persisted when the cache needs to be invalidated, such as when there is not enough memory or when the process exits. This is called lazy write.

There is a flowchart of write back on Wikipedia, and the basic logic is as follows:

A few more words#

The design patterns mentioned above are not specific to the update strategies of MySQL databases and memcache/redis. These things are all designs in computer architecture, such as CPU cache, disk file system cache, disk cache, and database cache. Basically, these cache update design patterns are very old and have been tested for a long time. This is why they are considered best practices in engineering. Just follow them.
Sometimes, we think that people who can do macro system architecture must have a lot of experience. In fact, many designs in macro system architecture come from these micro-level things. For example, the principles behind many virtualization technologies in cloud computing are similar to traditional virtual memory, right? The I/O models in Unix, when magnified, become synchronous and asynchronous models in architecture. The pipeline invented by Unix is a data flow computing architecture. Many of the designs in TCP are also used in communication between different systems. If you carefully study these micro-level aspects, you will find that many designs are very ingenious... So, please allow me to express a strong opinion here - if you want to do good architecture, you must first thoroughly understand computer architecture and many old-fashioned basic technologies.
In software development or design, I highly recommend referring to existing designs and ideas before starting, and looking at the corresponding guidelines, best practices, or design patterns. Once you understand these existing things, then decide whether to reinvent the wheel. Don't make software design based on false assumptions.
We did not consider the overall transaction of the cache and the persistence layer in the above discussion. For example, what if updating the cache is successful but updating the database fails? Or vice versa. Regarding this issue, if you need strong consistency, you need to use the "two-phase commit protocol" - prepare, commit/rollback. For example, Java 7's XAResource and MySQL 5.7's XA Transaction. Some caches also support XA, such as EhCache. Of course, such strong consistency approaches like XA will result in performance degradation.