Introducing Ristretto: A High-Performance Go Cache

Ristretto:一个高性能的Go缓存
https://blog.dgraph.io/post/introducing-ristretto-high-perf-go-cache/


This post made it to the top of Golang subreddit and is trending in top 10 on the front page of Hacker News. Do engage in discussion there and show us love by giving us a star.
本文已在Reddit上置顶,并且在Hacker News上处于首页前十。欢迎参与讨论,如果你也喜欢,请给一个星星

With over six months of research and development, we’re proud to announce the initial release of Ristretto: A High Performance, Concurrent, Memory-Bound Go cache. It is contention-proof, scales well and provides consistently high hit-ratios.
通过六个多月的调查和开发,我们终于可以很自豪地宣布Ristretto:一个高性能、并发安全、基于内存的Go缓存库首次发布。它具有低竞争、易于伸缩、高缓存命中率等特点。

You can now also watch the talk Manish gave at the latest Go Bangalore meetup!
可以从这里看到Manish在Go Bangalore meetup中发表的演讲:

Preface

It all started with needing a memory-bound, concurrent Go cache in Dgraph. We looked around for a solution, but we couldn’t find a great one. We then tried using a sharded map, with shard eviction to release memory, which caused us memory issues. We then repurposed Groupcache’s LRU, using mutex locks for thread safety. After having it around for a year, we noticed that the cache suffered from severe contention. A commit to remove that cache caused our query latency to dramatically improve by 5-10x. In essence, our cache was slowing us down!
这一切都是从Dgraph项目需要一个基于内存、并发安全的Go缓存开始的。我们曾在市面上到处寻找已有的解决方案,但是并没有找到一个合适的。我们也曾尝试使用具有分片内存回收机制的共享map,却带来了一系列内存问题。我们也曾使用Groupcache的LRU,同时使用互斥锁来确保并发安全,在工作了大约一年以后,却发生了严重的锁竞争。于是在这次提交中我们被迫删除了缓存,直接导致Dgraph的查询耗时增加了5-10倍。换而言之,我们的缓存反而拖慢了我们的项目。

We concluded that the concurrent cache story in Go is broken and must be fixed. In March, we wrote about the State of Caching in Go, mentioning the problem of databases and systems requiring a smart memory-bound cache which can scale to the multi-threaded environment Go programs find themselves in. In particular, we set these as the requirements for the cache:

  1. Concurrent
  2. High cache-hit ratio
  3. Memory-bounded (limit to configurable max memory usage)
  4. Scale well as the number of cores and goroutines increases
  5. Scale well under non-random key access distribution (e.g. Zipf).

终于,我们意识到项目中Go缓存已经损坏,必须得到重视。在3月的时候,我们发表了一篇关于Go缓存的文章,其中提到构建一个并发安全的Go内存缓存所面临的挑战。同时,我们期望缓存具有以下特性:

  1. 并发安全
  2. 高缓存命中率
  3. 基于内存(可配置最大内存使用量)
  4. 在并发数增加时,保持良好的扩展性
  5. 在缓存Key非随机分布情况下(e.g. Zipf),保持良好的扩展性

After publishing the blog post, we built a team to address the challenges mentioned therein and create a Go cache library worthy of being compared to non-Go cache implementations. In particular, Caffeine which is a high performance, near-optimal caching library based on Java 8. It is being used by many Java-based databases, like Cassandra, HBase, and Neo4j. There’s an article about the design of Caffeine here.
在发表了关于Go缓存的文章后,我们专门组建了一个团队致力于解决在构建Go缓存时所面临的挑战,同时使用了基于Java的Caffeine缓存库,来与我们构建Go缓存进行对比。Caffeine基于Java 8,它的性能极高,是接近缓存方案最优解的缓存库。很多基于Java的数据库都采用了该缓存库,比如:Cassandra、 HBase、 Neo4j。这篇文章介绍了Caffeine的设计实现。

Ristretto: Better Half of Espresso

Ristretto:请搭配Espresso食用

We have since read the literature, extensively tested implementations and discussed every variable there is to consider in writing a cache library. And today, we are proud to announce that it is ready for the wider Go community to use and experiment with.
在我们研究了文献,同时进行了完善的测试,甚至讨论了每个变量的命名之后。我们可以很自豪的宣布:Ristretto已经准备好接受Go社区的检验。

Before we begin explaining the design of Ristretto, here’s a code snippet which shows how to use it:
在开始聊Ristretto的设计实现之前,以下代码快速地展示了如何使用它:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
func main() {
cache, err := ristretto.NewCache(&ristretto.Config{
NumCounters: 1e7, // Num keys to track frequency of (10M).
MaxCost: 1 << 30, // Maximum cost of cache (1GB).
BufferItems: 64, // Number of keys per Get buffer.
})
if err != nil {
panic(err)
}

cache.Set("key", "value", 1) // set a value
// wait for value to pass through buffers
time.Sleep(10 * time.Millisecond)

value, found := cache.Get("key")
if !found {
panic("missing value")
}
fmt.Println(value)
cache.Del("key")
}

Guiding Principles

设计原则

Ristretto is built on three guiding principles:

  1. Fast Accesses
  2. High Concurrency and Contention Resistance
  3. Memory Bounding.

Ristretto致力于达成:

  1. 快速读写
  2. 高并发和低竞争
  3. 基于内存

In this blog post, we’ll discuss these three principles and how we achieved them in Ristretto.
在本文中,我们将介绍Ristretto是如何达成以上目标的。

Fast Access

快速读写

As much as we love Go and its opinionated stance on features, some of Go design decisions prevented us from squeezing out all the performance we wanted to. The most notable one is Go’s concurrency model. Due to the focus on CSP, most other forms of atomic operations have been neglected. This makes it hard to implement lock-free structures that would be useful in a cache library. For example, Go does not provide thread-local storage.
我们很喜欢Go,也很赞同它在某些特性上的坚持。但是不得不承认Go在设计上的某些决策限制了对性能的追求。其中最为显眼的一点就是Go的并发模型,因为过于强调CSP模型,在设计之初,很多形式的原子性操作都被抛弃了。这导致我们很难实现Go缓存中的无锁结构。举个例子,Go并不提供线程局部存储。
(注:线程局部存储是指当某个变量只会被某一个线程操作访问的话,并不需要对该变量进行加锁操作,虽说线程间共享内存,但是我们可以认为该变量只存储在该线程内部,因为其它线程不会去访问它。)

At its core, a cache is a hash map with rules about what goes in and what goes out. If the hash map doesn’t perform well, then the whole cache will suffer. As opposed to Java, Go does not have a lockless concurrent hashmap. Instead, thread safety in Go is achieved via explicitly acquiring mutex locks.
缓存的核心是一个控制当A数据输入则B数据输出的哈希表。如果这张哈希表的性能不佳,整个缓存也不会有高性能。与Java不同,Go中并没有无锁且并发安全的哈希表。实际上,Go的并发安全是通过显式地获取互斥锁实现的。

We experimented with multiple implementations (using the store interface within Ristretto) and found sync.Map performs well for read-heavy workloads but deteriorates for write workloads. Considering there’s no thread-local storage, we found the best overall performance with sharded mutex-wrapped Go maps. In particular, we chose to use 256 shards to ensure that this would perform well even with a 64-core server.
我们在Ristretto中定义了名为store的接口,并尝试使用多种实现方法,发现Go的sync.Map在读多写少的情况下性能表现优异,但当写操作增多时,性能就会下降。考虑到Go没有线程局部存储机制,最均衡的方案是:使用Go中互斥锁和多个Map进行分片。最终,我们采用了256个分片,它在64位处理器上也能很好的工作。

With a shard based approach, we also needed to find a quick way to calculate which shard a key should go in. This requirement and the concern about long keys consuming too much memory led us to using uint64 for keys, instead of storing the entire key. The rationale was that we’ll need the hash of the key in multiple places and doing it once at entry allowed us to reuse that hash, avoiding any more computation.
在确定了实现方法后,我们还需要一种可以快速计算出某个Key该处于某个分片中的哈希算法。因为担心过长的Key会导致过多的内存占用,所以并不存储整个缓存Key,而是使用uint64的哈希值作为存储Key。我们会在多个函数入口计算哈希值,但是确保一次操作只计算一次,然后复用哈希值,从而避免重复的无意义计算。

To generate a fast hash, we borrowed runtime.memhash from Go Runtime. This function uses assembly code to quickly generate a hash. Note that the hash has a randomizer that is initialized whenever the process starts, which means the same key would not generate the same hash on the next process run. But, that’s alright for a non-persistent cache. In our experiments, we found that it can hash 64-byte keys in under 10ns.
** 为了快速计算哈希值,我们借用了Go源码中的runtime.memhash** 这个函数会直接通过汇编代码快速计算哈希值。需要注意的是,这个函数中包含一个随机数种子,这个随机数生成器会在进程启动时进行初始化,这意味着相同的缓存Key在进程重启后会计算出不同的哈希值。但是,对于一个非永久性存储的内存缓存而言,完全没影响。通过测试,该哈希函数可以在10纳秒内(8.88ns)计算出64字节的缓存Key的哈希值。

1
2
3
4
BenchmarkMemHash-32 200000000	 8.88 ns/op
BenchmarkFarm-32 100000000 17.9 ns/op
BenchmarkSip-32 30000000 41.1 ns/op
BenchmarkFnv-32 20000000 70.6 ns/op

We then used this hash as not only the stored key but also to figure out the shard the key should go into. This does introduce a chance of key collision, that’s something we plan to deal with later.
在确定了使用哈希值作为存储Key之后,我们必须直面一个问题:哈希碰撞,稍后会展示如何处理。

Concurrency and Contention Resistance

高并发和低竞争

Achieving high hit ratios requires managing metadata about what’s present in the cache and what should be present in the cache. This becomes very hard when balancing the performance and scalability of the cache across goroutines. Luckily, there’s a paper called BP-Wrapper written about a system framework making any replacement algorithms almost lock contention-free. The paper describes two ways to mitigate contention: prefetching and batching. We only use batching.
实现高缓存命中率的关键点在于管理缓存数据与其存储位置的元数据。可是在高性能和扩展性中找到平衡点一直困扰着我们,幸好,这篇BP-Wrapper文献展示了一个无锁替换算法的系统框架。它主要采用两种方法来防止竞争:预取和批处理。在Ristretto中,只使用了批处理。
(注:prefetching-预取是指在获取锁之前,尽可能的处理掉不会发生竞争的操作,从而在获取锁之后,只需要处理可能发生竞争的操作,减少持有锁时执行的操作个数,达到提升性能的目的。)
(注:batching-批处理是指在获取锁之后,并不是只处理一个可能发生竞争的操作,也就是说并不是一个操作获取一次锁,而是将一批操作打包在一起,获取一次锁,处理一批可能发生竞争的操作,从而减少获取锁的次数,达到提升性能的目的。)

Batching works pretty much how you’d think. Rather than acquiring a mutex lock for every metadata mutation, we wait for a ring buffer to fill up before we acquire a mutex and process the mutations. As described in the paper, this lowers contention considerably with little overhead.
批处理的性能表现比想象中的还要好。我们并不是每次操作时都获取互斥锁,而是构造了一个环形缓冲区,每次操作只是将该操作加入缓冲区,在环形缓冲区被填满之后,才会获取互斥锁并进行操作。正如上述文献中所描述的那样,采用这种方法,可以在很大程度上避免数据竞争,而且只有很少的开销。

We apply this method for all Gets and Sets to the cache.
我们在所有的GetsSets中都使用了该方法。

Gets

All Gets to the cache are, of course, immediately serviced. The hard part is to capture the Get, so we can keep track of the key access. In an LRU cache, typically a key would be placed at the head of a linked list. In our LFU based cache, we need to increment an item’s hit counter. Both operations require thread-safe access to a cache global struct. BP-Wrapper suggests using batching to process the hit counter increments, but the question is how do we implement this batching process, without acquiring yet another lock.
在缓存进行Get操作时,除了返回对应的缓存元素外,一般还需要以某种形式记录该次访问行为。例如,在LRU缓存中,通常会将最近被访问的缓存元素移动至链表头部。而Ristretto采用LFU策略,这就需要对缓存元素的被访问次数进行计数。Get操作和计数操作都需要对缓存结构进行线程安全的操作,虽然BP-Wrapper中提到的批处理可以用来执行计数操作,但是问题是我们如何在不获取另外一个互斥锁的情况下,将计数操作加入批处理缓冲区。

This might sound like a perfect use case of Go channels, and it is. Unfortunately, the throughput performance of channels prevented us from using them. Instead, we devised a nifty way to use sync.Pool to implement striped, lossy ring buffers that have great performance with little loss of data.
这听起来好像很适合使用Go的channel来处理,不幸的是,channel的性能表现不是特别满意。所以我们使用sync.Pool构造了一个有损的环形缓冲区它会丢失少量的计数操作,但是性能表现十分出色。

Any item stored in the Pool may be removed automatically at any time without notification. That introduces one level of lossy behavior. Each item in Pool is effectively a batch of keys. When the batch fills up, it gets pushed to a channel. The channel size is deliberately kept small to avoid consuming too many CPU cycles to process it. If the channel is full, the batch is dropped. This introduces a secondary level of lossy behavior. A goroutine picks up this batch from the internal channel and processes the keys, updating their hit counter.
访问计数操作可能会丢失部分操作,主要分两种情况:一是上面提到的sync.Pool,在sync.Pool中存储的元素有可能会被移除;二是当缓冲区被填满后,会将这一批操作推送至一个channel。这里故意没有使用很大缓冲区的channel,是为了避免执行一次批处理需要消耗过多的CPU执行周期。当channel已满时,这批操作会被丢弃。有一个Go协程会从该channel中取出计数操作并执行。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
AddToLossyBuffer(key):
stripe := b.pool.Get().(*ringStripe)
stripe.Push(key)
b.pool.Put(stripe)

Once buffer fills up, push to channel:
select {
case p.itemsCh <- keys:
p.stats.Add(keepGets, keys[0], uint64(len(keys)))
return true
default:
p.stats.Add(dropGets, keys[0], uint64(len(keys)))
return false
}

p.itemCh processing:
func (p *tinyLFU) Push(keys []uint64) {
for _, key := range keys {
p.Increment(key)
}
}

The performance benefits of using a sync.Pool over anything else (slices, striped mutexes, etc.) are mostly due to the internal usage of thread-local storage, something not available as a public API to Go users.
使用sync.Pool的性能比使用切片或者分区互斥锁等方案要好很多。其主要原因是因为其内部是使用了线程局部存储。但是Go开发者者却不能直接使用线程局部存储

Sets

The requirements for Set buffer is slightly different from Get. In Gets, we buffer up the keys, only processing them once the buffer fills up. In Sets, we want to process the keys as soon as possible. So, we use a channel to capture the Sets, dropping them on the floor if the channel is full to avoid contention. A couple of background goroutines pick sets from the channel and process the Set.
Set操作与Get操作不同的是,在Get中,我们会将计数操作加入缓冲区,并使用一个Go协程批量执行。在Set中,我们希望尽可能及时的处理,所以我们使用了一个channel来收集Set操作,同时启动多个Go协程来执行channel中的操作,当channel被填满时,会返回Set失败。

This approach, as with Gets, is designed to optimize for contention resistance. But, comes with a few caveats, described below.
与Get操作一样,Set操作在设计上也是为了尽力避免竞争发生,但是以下有几点注意事项:

1
2
3
4
5
6
7
8
select {
case c.setBuf <- &item{key: hash, val: val, cost: cost}:
return true
default:
// drop the set and avoid blocking
c.stats.Add(dropSets, hash, 1)
return false
}

Caveats

注意事项

Sets in Ristretto are queued into a buffer, control is returned back to the caller, and the buffer is then applied to the cache. This has two side-effects:
Set函数在将操作加入channel后,就会立刻返回。由多个Go协程执行channel中的操作,这就会带来两个问题:

  1. There is no guarantee that a set would be applied. It could be dropped immediately to avoid contention or could be rejected later by the policy.

  2. Even if a Set gets applied, it might take a few milliseconds after the call has returned to the user. In database terms, it is an eventual consistency model.

  3. 并不是每次执行Set函数都会成功,在channel满时,Set会被丢弃。也许以后,我们会想到更好的办法来避免这点。

  4. 在Set函数返回后,大约还需要几毫秒,Set操作才会真正的被操作。遵循最终一致性。

If however, a key is already present in the cache, Set would update the key immediately. This is to avoid a cached key holding a stale value.
对于一个已经存在于缓存中的Key而言,Set会立即更新缓存值。这样做是为了避免缓存中持有已经过期的Key。
(注:Set操作对于新创建的Key只遵循最终一致性,对于已存在的Key遵循强一致性。)

Contention Resistance

避免竞争

Ristretto is optimized for contention resistance. This performs really well under heavy concurrent load, as we’ll see with throughput benchmarks below. However, it would lose some metadata in exchange for better throughput performance.
Ristretto致力于避免竞争。稍后的基准测试结果可以看出,在高并发的情况下,Ristretto性能表现优异。但是,Ristretto会付出丢失些许操作记录的代价。

Interestingly, that information loss doesn’t hurt our hit ratio performance because of the nature of key access distributions. If we do lose metadata, it is generally lost uniformly while the key access distribution remains non-uniform. Therefore, we still achieve high hit ratios and the hit ratio degradation is small as shown by the following graph.
有趣的是,这些操作记录的丢失并不会影响Ristretto的缓存命中率,这是由于缓存Key的分布特性,即使发生了操作记录的丢失,缓存Key依旧是不均匀分布的。因此我们还可以保持较高的缓存命中率,在并发数量提高后,缓存命中率只会小幅度下降。

Memory Bounding

内存边界

Key Cost

Key的消耗

An infinitely large cache is practically impossible. A cache must be bounded in size. Many cache libraries would consider cache size to be the number of elements. We found that approach naive. Surely it works in a workload where values are of identical size. Most workloads, however, have variable-sized values. One value could cost a few bytes, another a few kilobytes and yet another, a few megabytes. Treating them as having the same memory cost isn’t realistic.
无限大的缓存是不存在的。缓存必须有最大使用内存的限制。许多的缓存库会使用缓存元素和数量来统计缓存大小。我们认为这个观点是片面的,如果每个缓存的大小都是一样的话,当然可以这样来统计。但是,大多数情况下,缓存元素的大小是不一致的,可能有些缓存元素只需要几KB,而另外一些却需要几MB,将它们视为同等开销是不现实的。

In Ristretto, we attach a cost to every key-value. Users can specify what that cost is when calling Set. We count this cost against the MaxCost of the cache. When the cache is operating at capacity, a heavy item could displace many lightweight items. This mechanism is nice in that it works well for all different workloads, including the naive approach where each key-value costs 1.
Ristretto中,我们会对每组缓存元素标记上其占用内存大小。而且在调用Set函数时,也可以指定该组缓存所占用的内存大小,我们会将指定的值与配置的MaxCost进行比较。当缓存在满负荷工作的状态下,一个较大的缓存元素的Set可能会使得多个占用内存较少的元素被淘汰。这种策略有一个优点,它几乎可以适配任何工作负载,包括那种每组缓存元素都一样大小的情况。

Admission Policy via TinyLFU

基于TinyLFU的录入策略

“What should we let into the cache?”
“我们应当把什么东西放入缓存?”

is answered by the admission policy. The goal, obviously, is to let in new items if they are more “valuable” than the current items. However, this becomes a challenge if you consider the overhead (latency and memory) required to track relevant item information pertaining to the “value” question.
这个问题应当由录入策略来回答。当然,我们的理想是,每一个写入缓存的元素都比被淘汰的元素更有“价值”。但是,如何跟踪和统计一个缓存元素的“价值”以及所带来的时间和空间开销,将会是一个难题。

Despite being a well-documented strategy for increasing hit ratios, most Go cache libraries have no admission policy at all. In fact, many LRU eviction implementations assume the latest key as the most valuable.
虽然存储访问记录可以很显著地帮助提供命中率,但是大多数Go缓存都不包含这个功能。许多基于LRU的缓存库,都是很简单地认为最新的元素就是最有“价值”的元素。

Moreover, most of the Go cache libraries use pure LRU or an approximation of LRU as their eviction policy. The quality of LRU approximation notwithstanding, some workloads are just better suited to LFU eviction policies. We’ve found this to be the case in our benchmarks using various traces.
事实上,几乎所有Go缓存库的淘汰策略都是使用纯粹LRU或者近似LRU。我们通过多种基准测试发现:尽管LRU的表现并不差,但是在一些场景下,LFU更棒。

For our admission policy, we looked at a new and fascinating paper called TinyLFU: A Highly Efficient Cache Admission Policy. At a very high level, TinyLFU provides three methods:
在我们发现了TinyLFU: 一种高效的缓存录入策略。TinyLFU提供了三个方法:

  • Increment(key uint64) - 增加某Key的价值
  • Estimate(key uint64) int (referred as ɛ) - 估算某Key的价值(定义为 ɛ)
  • Reset - 重置

The paper explains it best, but TinyLFU is an eviction-agnostic admission policy designed to improve hit ratios with very little memory overhead. The main idea is to only let in a new item if its estimate is higher than that of the item being evicted. We implemented TinyLFU in Ristretto using a Count-Min Sketch. It uses 4-bit counters to approximate the frequency of access for the item (ɛ). This small cost per key allows us to keep track of a much larger sample of the global keyspace, than would be possible using a normal key to frequency map.
TinyLFU为了达到使用很少的内存同时保持较高命中率的目的,在新缓存元素加入时,会优先淘汰“低价值”的Key。我们在Ristretto中使用Count-Min实现了TinyLFU。每个Key的价值估算值(ɛ)仅占用4-bit,由于每个估算值极小的内存开销,我们可以存储很大的一个全局样本空间。

TinyLFU also maintains the recency of key access by a Reset function. After N key increments, the counters get halved. So, a key that has not been seen for a while would have its counter get reset to zero; paving the way for more recently seen keys.
TinyLFU还提供了Reset方法用于重置Key的访问记录。一段时间未被访问的Key,其“价值”会被重置为零,从而为最近被频繁访问的Key让出空间。

Eviction Policy via Sampled LFU

基于Sampled LFU的淘汰策略

When the cache reaches capacity, every incoming key should displace one or more keys present in the cache. Not only that, the ɛ of incoming key should be higher than the ɛ of key being evicted. To find a key with low ɛ, we used the natural randomness provided by Go map iteration to pick a sample of keys and loop over them to find a key with the lowest ɛ.
当缓存使用容量接近上限时,每存储一个新增的缓存元素都要淘汰一个或多个已存储的缓存元素。应当使用高“价值”的Key替换低“价值”的Key,依赖Go中map的迭代随机性,

We then compare the ɛ of this key against the incoming key. If the incoming key has a higher ɛ, then this key gets evicted (eviction policy). Otherwise, the incoming key is rejected (admission policy). This mechanism is repeated until the incoming key’s cost can be fit into the cache. Thus, a single incoming key may displace more than one key. Note that the cost of the incoming key does not play a factor in choosing the eviction keys.

With this approach, the hit ratios are within 1% of the exact LFU policies for a variety of workloads. This means we get the benefits of admission policy, conservative memory usage, and lower contention in the same little package.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Snippet from the Admission and Eviction Algorithm
incHits := p.admit.Estimate(key)
for ; room < 0; room = p.evict.roomLeft(cost) {
sample = p.evict.fillSample(sample)
minKey, minHits, minId := uint64(0), int64(math.MaxInt64), 0
for i, pair := range sample {
if hits := p.admit.Estimate(pair.key); hits < minHits {
minKey, minHits, minId = pair.key, hits, i
}
}
if incHits < minHits {
p.stats.Add(rejectSets, key, 1)
return victims, false
}
p.evict.del(minKey)
sample[minId] = sample[len(sample)-1]
sample = sample[:len(sample)-1]
victims = append(victims, minKey)
}

DoorKeeper

Before we place a new key in TinyLFU, Ristretto uses a bloom filter to first check if the key has been seen before. Only if the key is already present in the bloom filter, is it inserted into the TinyLFU. This is to avoid polluting TinyLFU with a long tail of keys that are not seen more than once.

When calculating ɛ of a key, if the item is included in the bloom filter, its frequency is estimated to be the Estimate from TinyLFU plus one. During a Reset of TinyLFU, the bloom filter is also cleared out.

Metrics

While optional, it is important to understand how a cache is behaving. We wanted to ensure that tracking metrics related to cache is not only possible, the overhead of doing so is low enough to be turned on and kept on.

Beyond hits and misses, Ristretto tracks metrics like keys and their cost being added, updated and evicted, sets being dropped or rejected, and gets being dropped or kept. All these numbers help understand the cache behavior on various workloads and pave way for further optimizations.

We initially used atomic counters for these. However, the overhead was significant. We narrowed the cause down to False Sharing. Consider a multi-core system, where different atomic counters (8-bytes each) fall in the same cache line (typically 64 bytes). Any update made to one of these counters, causes the others to be marked invalid. This forces a cache reload for all other cores holding that cache, thus creating a write contention on the cache line.

To achieve scalability, we ensure that each atomic counter completely occupies a full cache line. So, every core is working on a different cache line. Ristretto uses this by allocating 256 uint64s for each metric, leaving 9 unused uint64s between each active uint64. To avoid extra computation, the key hash is reused to determine which uint64 to increment.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Add:
valp := p.all[t]
// Avoid false sharing by padding at least 64 bytes of space between two
// atomic counters which would be incremented.
idx := (hash % 25) * 10
atomic.AddUint64(valp[idx], delta)

Read:
valp := p.all[t]
var total uint64
for i := range valp {
total += atomic.LoadUint64(valp[i])
}
return total

When reading the metric, all the uint64s are read and summed up to get the latest number. With this approach, metrics tracking only adds around 10% overhead to the cache performance.

Benchmarks

Now that you understand the various mechanisms present in Ristretto, let’s look at the Hit ratio and Throughput benchmarks compared to other popular Go caches.

Hit Ratios

Hit ratios were measured using Damian Gryski’s cachetest along with our own benchmarking suite. The hit ratio numbers are the same across both utilities, but we added the ability to read certain trace formats (LIRS and ARC, specifically) as well as CSV output for easier graphing. If you want to write your own benchmarks or add a trace format, check out the sim package.

To get a better idea of the room for improvement, we added a theoretically optimal cache implementation, which uses future knowledge to evict items with the least amount of hits over its entire lifetime. Note that this is a clairvoyant LFU eviction policy, where other clairvoyant policies may use LRU. Depending on the workload, LFU or LRU may be better suited, but we found clairvoyant LFU useful for comparisons with Ristretto’s Sampled LFU.

This trace is described as “disk read accesses initiated by a large commercial search engine in response to various web search requests.”

Database

This trace is described as “a database server running at a commercial site running an ERP application on top of a commercial database.”

Looping

This trace demonstrates a looping access pattern. We couldn’t include Fastcache, Freecache, or Bigcache implementations in this and the following benchmark because they have minimum capacity requirements that would skew the results. Some trace files are small and require small capacities for performance measurements.

CODASYL

This trace is described as “references to a CODASYL database for a one hour period.” Note that Ristretto’s performance suffers in comparison to the others here. This is because of the LFU eviction policy being a bad fit for this workload.

Throughput

Throughput was measured using the same utility as the previous blog post, which generates a large number of keys and alternates between goroutines for Getting and Setting according to the workload.

All throughput benchmarks were ran on an Intel Core i7-8700K (3.7GHz) with 16gb of RAM.

Mixed: 25% Writes, 75% Reads

Read: 100% Reads

Write: 100% Writes

Future Improvements

As you may have noticed in the CODASYL benchmarks, Ristretto’s performance suffers in LRU-heavy workloads. However, for most workloads, our Sampled LFU policy performs quite well. The question then becomes “How can we get the best of both worlds?”

In a paper called Adaptive Software Cache Management, this exact question is explored. The basic idea is placing an LRU “window” before the main cache segment, and adaptively sizing that window using hill-climbing techniques to maximize the hit ratio. Caffeine has already seen great results by doing this. Something we believe Ristretto can benefit from as well in the future.

Special Thanks

We would like to sincerely thank Ben Manes. His depth of knowledge and dedicated, selfless sharing has been a large factor in any progress we’ve made and we are honored to have had many conversations with him about all things caching. Ristretto would just not have been possible without his guidance, support and 99.9% availability on our internal Slack channel.

We would also like to thank Damian Gryski for his help with benchmarking Ristretto and writing a reference TinyLFU implementation.

Conclusion

We set out with the goal of making a cache library competitive with Caffeine. While not completely there, we did create something significantly better than most others in the Go world at the moment by using some new techniques that others can learn from.

Some initial experiments with using this cache in Dgraph are looking promising. And we hope to integrate Ristretto into both Dgraph and Badger in the upcoming months. Do check it out and perhaps use Ristretto to speed up your workloads!


We are building a fast, transactional and distributed graph database. If you like what you read above, come join the team!

Get started with Dgraph https://docs.dgraph.io
Star us on GitHub https://github.com/dgraph-io/dgraph
Ask us questions https://discuss.dgraph.io

Fortune 500 companies are using Dgraph in production. If you need production support, talk to us.

Top Image: Blue Bottle Coffee Espresso Preparation Guide.

https://github.com/dgraph-io/ristretto/pull/104