Increasingly, a lot of SOCs contain multicore processors, multiple separate processors, accelerators, and high-performance DMA devices. They also have cache memories, memories local to a block or core, used to improve performance. This causes a big problem that goes under the name cache-coherence—for example, you might have heard of cache-coherent interconnect and wondered what it is. There are many books on technical subjects that provide a gentle introduction to a technical subject for non-specialists, such as Programming for Poets, Physics for Poets , or, for the latest up-to-the-minute hot topic, TensorFlow for Poets . This post is my contribution to the genre. The standard analogy for a cache is a desk and a bookcase. The desk can hold only a few books but they are right there at your fingertips and you can look at them quickly. The bookcase holds many more books, but you have to get up and walk across the room, which is much slower. Cache memories are similar, smaller but faster than main memory. At some level, with both memory and books, the assumption is that it makes sense to put something where it can be accessed fast because you probably need it again soon. A Brief History of Caches Let's start with the simplest of caches. It is a little unclear exactly which was the first cache implementation, but the first one that I know about is the Titan computer at the Cambridge Computer Laboratory where I was an undergraduate. This was a prototype of the Ferranti Atlas 2 computer and was operational from 1964. Initially, it had no cache, but a 32-word instruction cache was added. Today, we would call it a one-way associative cache. The way it worked was simple: each cycle, the low-order five bits of the instruction were used to index a particular row of the cache. The high-order bits were then compared with the address side of the cache. If they matched, that meant that the instruction was already in the cache and it would be loaded from there. If the high-order bits did not match, then two things happened. First, the instruction would be fetched from the much slower ferrite core memory (and executed). Second, the instruction would be stored in the cache, in the row indexed by the low-order five bits of the instruction address, and the high-order bits would be stored in the address side of the cache. This meant that if the instruction were executed again (before a different instruction with the same low-order five address bits) then it would be loaded from the fast cache instead of the slowcore memory. At first glance, it might seem that a 32-word cache would be too small to make any meaningful difference to performance. However, the design meant that any loop of less than 32 words would execute completely out of the cache (assuming that the instructions were stored in consecutive addresses). Titan was not microcoded, so operations such as clearing an area of memory, or copying an area of memory, were done with instruction loops and would run fast. The same would apply to tight inner loops of operations like searching for a value in an array. These operations might only represent a small amount of the static code, but dynamically they could account for a lot of the instructions executed. Caches further developed in several dimensions. First, in addition to caching instructions they would cache data too, initially with separate instruction and data caches since otherwise the data being accessed would push out the instructions so loops would no longer be cached. Caches became multi-way, meaning that the low-order bits of the address would determine the cache row, but then there would be several choices of values retained with their high-order bits. If any of the high-order bits matched, then the corresponding data would be loaded directly from the cache. If not, then it would be fetched from memory. A new complexity was deciding which of the several values to evict from the cache, but that is another topic. When microprocessors came along, and as semiconductor technology advanced enough that chips had room for both a processor and a cache, then multi-level caches were required. The smallest but fastest cache would be on the same chip as the microprocessor itself, known as the level-1 cache (often written L1$ as a sort of play on dollar being a sort of cash that sounds like cache). A fast static memory would hold a larger cache, the level-2 cache (L2$), and then the slow DRAM main memory. The problems that would show up as cache-coherence started to occur with multi-level caches. If a DMA device like a disk-controller needed to write a block out, the correct values to write might not be in the DRAM that the disk-controller would access. One way to handle that was to ensure that any memory writes by the processor were written immediately to DRAM as well as the cache. Writes to memory were much rarer than reads, and doing this did not impact performance much. Alternatively, the cache could be flushed before any DMA transfer to ensure that DRAM had the accurate values. As caches got larger, this became less appropriate since most of the values being flushed were not ones that the DMA transfer was going to reference anyway, but DMA transfers were very occasional events occurring just a few dozen times per second maximum so it didn't really matter. Multicore When multicore processors arrived, the typical architecture would have each core having its own L1 cache, and a shared L2 cache. This created a new problem since the correct value corresponding to a given address might be in either or neither of the L1 caches. If core 0 (hardware engineers like to count from zero) accessed a given address, and the correct value was in the L1 cache for core 1, then it would get the wrong value. Writing all stale values through the L1 caches to the L2 cache would mean that the L2 cache was always up to date. But more importantly, the other caches could watch the bus, see the write to memory, and either load the new value or at least invalidate the stale value that they held. This is known as snooping. This architecture assumes that the multicore processor has been designed as a whole, with all the various sub-blocks cognizant of the rest of the system. Each L1 cache knew which bus to watch, all the cores ran the same operating system which knew when a DMA transfer was happening, and so forth. But as systems became more complex, these assumptions broke down. In particular, having to design every component on the assumption that it was part of a complete pre-determined system made IP-based design impossible. It had to be possible to design a block of IP separately, so that it would work in any environment, without requiring it to be reconfigured to account for the every bock on the rest of the chip. However, the problem of making sure that every access to memory from every core or block read the correct value remained. To give you an idea of how complex the architectures can get, here is an example of a Arm system put together with cache-coherent interconnect. This dates from 2014. Architectures have gotten only more complex since then. Cache-Coherent Interconnect The solution was in the interconnect. The concept of snooping, in which each cache "listened" to the bus and would see other updates, was extended. At the level of individual cores and blocks, they would assume that their cache was always up to date. It was the job of the interconnect to be like Picard's crew and "make it so." To make this work with reasonable performance, the values stored in each cache needed to be in one of several states: Modified : this cache holds the only valid value for this cache line (address) Exclusive : this cache has the only copy of the line, but it is unmodified Shared : this cache is one of several that has a copy of this line, and it matches main memory Invalid : this line is not valid and cannot be used This is the MESI cache-coherence protocol (from the initials). I won't run through the transitions, but the biggest one is that when one cache needs to be written when it is in a shared state, then the cache line being written needs to move to the modified state, and the equivalent cache lines in the other caches need to become invalid. The little table on the right shows the only valid combinations of states for the same line in two different caches. It is the job of the cache-coherent interconnect working behind the scenes to ensure that the two caches never get into states marked with red crosses. An additional Owned state is sometimes added, to get the MOESI protocol, when writes through to memory are not done automatically. Ownership of the line can move between caches without updating main memory. The values are transferred by the cache-coherent interconnect, which has to become even more complex, transferring data values between caches and not just state update messages. Verification Even without going into all the details, I think it is obvious that verifying that the cache-coherent interconnect does its job correctly is tricky. Indeed, it is considered one of the most complicated problems in all of verification. It is so complicated that the protocol itself needs to be verified, even before worrying about whether it is correctly implemented. Signals don't travel instantly from cache to cache, and so there have to acknowledgements. While a cache is waiting for an answer to its own question, it still has to respond to other caches' queries. I will cover this in more detail in tomorrow's post in the context of CCIX, the cache-coherent interconnect for accelerators. Or you can look at my post Decoding Formal Club: Arm and Arteris where Oski and Arteris used JasperGold to verify the Arteris cache-coherent interconnect. I have seen four presentations over the last couple of years on verifying cache-coherent interconnect (from Arm, NVIDIA, Arteris and Cadence) and none of them escaped without discovering at least one bug in the protocol itself, even before getting to actual implementation at the RTL level. You can see why Vigyan Singhal of Oski said, "nobody is getting cache-coherent interconnect out without formal". For More Information There is a JasperGold Formal Verification Platform page . For more general information on cache architecture, multiprocessors and related topics, my go-to book is always the latest edition of "Hennessy and Patterson" Computer Architecture: A Quantitative Approach . But unless you are obsessive, don't buy this book (the 5th edition) since the new edition is coming out in December. Sign up for Sunday Brunch, the weekly Breakfast Bytes email.
↧