Z-Core: Performance Analysis
Benchmarking Z-Core — DOOM FPS across cache configurations, then CoreMark/MHz, STREAM and pointer-chase results.
Benchmarking has always played a key role in developing and analyzing the performance of computing systems. I personally like to refer to it as the communication language between engineers and end-users. This concept applies not only to computer architecture but to countless other fields. You cannot simply explain that a processor features out-of-order execution, a high-performance 4-way set-associative cache, a Gshare branch predictor, and many other fancy features to someone who just wants the hardware for playing games or 3D rendering. This is where benchmarking comes into play. It creates a common understanding of performance between both parties, but it is imperative that everyone agrees on one key factor: performance metrics.
When these metrics are closely tied to the functioning of a system, quantifying its performance becomes much easier. Metrics can be direct measurements of hardware features, such as FLOPS or memory bandwidth. Alternatively, they can lean toward a more abstract approach, like GeekBench or AnTuTu, which lie in the higher abstraction layer of computing systems benchmarks. These synthetic benchmarks report a number that is supposed to correctly represent the performance of the device. On the one hand, this makes it easier to rank systems, while offering a unified score that is independent of specific system configurations and very easy to run. However, they might not accurately reflect real-world performance, especially since any vendor can tune its processor to excel specifically at the benchmark’s task.
For this performance analysis, I have decided to follow a more entertaining and visual approach for Z-Core. I will be showing the FPS (Frames Per Second) of DOOM across different Z-Core configurations by varying the presence and sizes of the caches. After this, I will present the results for CoreMark/MHz, STREAM, and pointer chase benchmarks, which will make this post look a bit more serious :).
For those who do not know Z-Core, it is a 5-stage pipelined RISC-V processor implementing the RV32IMZicsr ISA. The processor features multiple peripherals, including a Timer, GPIOs, a UART, and a VGA controller. Additionally, one of the main features that recently pushed Z-Core’s performance to the next level is the memory subsystem and the two caches it incorporates: a 32KB direct-mapped instruction cache, and a 2-way set-associative data cache with writeback and write-through policies, implementing an LRU replacement policy. For further information on the Z-Core microarchitecture and design choices, check out the Z-Core detailed description.
When DOOM was first run on Z-Core, the processor design had no instruction or data caches. The program was directly running from main memory of the DE10-Lite board, which corresponds to a 64MB SDRAM running at 100MHz. This creates a massive bottleneck in the execution of the game, as performance is completely dominated by the high latency of the AXI4-Lite transaction plus the latency of the SDRAM itself. This bottleneck is translated into a performance of less than one FPS, which made DOOM unplayable.
Introducing Caches to the Design
When data cache is deployed into the Z-Core, it is game changing (Pun intended!), it leverage a performance increase of 5x FPS, reaching 15-20 frames per second when running DOOM. Summary of the results is shown in Figure 1.
Introducing such a big data cache is plausible in my design due to the capabilities of the FPGA itself, however, when design constrains are tighter designer might have to move to an smaller cache subsystem. For this reason, I tested how Z-Core would perform on DOOM (running at a 320x200 resolution) when varying the size of the data cache while maintaining a 32KiB instruction cache.
Evaluating the Z-Core with a greater size of data cache might have yield a better performance, however, due to limitations of the hardware resources, 32KiB is the maximum size available for testing.
CoreMark benchmark
Moving away from DOOM, Z-Core was also evaluated against CoreMark benchmark, EEMBC’s CoreMark is a simple, yet sophisticated benchmark that is designed specifically to test the functionality of a processor core [1]. This gives the ability to directly compare the performance of Z-Core to other processors, such as the PicoRV32 RISC-V softcore [5].
Results show a CoreMark/Mhz result of 3.06, which leaves the Z-Core lying between an Arm Cortex-M0 and a Cortex-M3/4 (Figure 3). This makes the Z-Core a perfect fit for embedded and low power applications. Despite this results, it is key to note that Z-Core is an academic personal project and does not compete with industry standard processors in terms of reliability and ecosystem.
Memory Bandwidth and Access Latency
Finally, a lower level approach to the benchmarking of the Z-Core memory system is preseted, provinding key memory metrics such as bandwidth and latency in order to better understand the performance of the SoC.
First, the results for the STREAM benchmark are presented for both memory and L1 bandwidth (Figure 4). STREAM is a benchmark that measures sustained memory bandwidth of a processor [2]. By measuring the bandwidth of the Z-Core we can better estimate its perforamnce for memory bound applications. STREAM has been chosen as the memory benchmark above lmbench [3] or Mess [4] due to its simplicity and easier portaibilty to the Z-Core.
Figure 4 — STREAM sustained bandwidth: L1 data cache vs main memory.
These results provide us key information about the performance of the memory system, but I wanted to go an step further, and dive into the Z-Core pipeline to understand the values reported by the benchmark.
STREAM cache — single-cycle ideal throughput
When the working set fits in the data cache, every lw/sw hits and
completes in a single cycle. With branch prediction working and the pipeline fully fed, the
inner loop’s instruction throughput collapses to 1 instruction/cycle, so we can read the
ideal cycles per loop and turn it into a bandwidth. For a loop
issuing m memory operations every c cycles at 50 MHz (20 ns/cycle):
for (j = 0; j < N; j++) c[j] = a[j];
lw a3, 0(a5) addi a5, a5, 4 addi a4, a4, 4 sw a3, -4(a4) bne a5, s10, .L4
Copy moves 2 memory operations in 5 cycles. Notice there is no load-use hazard:
the compiler interleaves the two pointer-increment addis between the load and the
store, so by the time the store needs a3 the loaded value has already been read.
for (j = 0; j < N; j++) b[j] = scalar * c[j];
lw a2, 0(a5) addi a5, a5, 4 addi a3, a3, 4 slli a4, a2, 1 add a4, a4, a2 sw a4, -4(a3) bne a5, s8, .L5
Scale issues 2 memory operations every 7 cycles. The multiply by the scalar never reaches the
M-extension multiplier because scalar = 3, so the compiler reduces the operation to
slli + add, which is equivalent to 3·x = (x << 1) + x, two cheap single-cycle ALU ops.
Scale pipeline
for (j = 0; j < N; j++) c[j] = a[j] + b[j];
lw a4, 0(a5) lw a1, 0(a3) addi a5, a5, 4 addi a3, a3, 4 add a4, a4, a1 sw a4, 0(a2) addi a2, a2, 4 bne a5, s10, .L6
Add touches three arrays, so it issues 3 memory operations every 8 cycles. Both loads are
properly scheduled at the top of the loop so the dependent add sits four instructions later,
this avoids potential load-to-use hazards.
Add pipeline
for (j = 0; j < N; j++) a[j] = b[j] + scalar * c[j];
lw a1, 0(a3) lw a0, 0(a4) addi a4, a4, 4 slli a5, a1, 1 add a5, a5, a1 add a5, a5, a0 sw a5, 0(a2) addi a3, a3, 4 addi a2, a2, 4 bne a4, s5, .L7
Triad is the heaviest kernel: two loads, a strength-reduced scalar multiply, two adds and a store; a total of 3 memory operations every 10 cycles.
Triad pipeline
The measured data-cache bandwidths (79.68, 57.03, 74.84 and 59.90 MB/s) land almost exactly on the 80, 57.14, 75 and 60 MB/s the assembly predicts. Hitting the ideal instruction and cache single cycle throughput across all four kernels is the signature of a clean, well-tuned cache subsystem, and branch predictor. The small differences (< 1%) might come from early cold misses and branch misspredictions.
STREAM memory — main-memory bandwidth
Evaluating the bandwidth of main memory is trickier. A memory operation no longer costs a single cycle, so to compute the theoretical bandwidth we first need the actual latency of main memory. We measure it with a pointer-chase microbenchmark: an array wired into a randomly distributed chain of pointers, where each access depends on the previous one (See Figure 5). Serialising the accesses like this defeats overlap and exposes the raw round-trip to memory. The measured latency comes out at around 31 cycles per memory access.
Before moving into the calculation, one microarchitectural detail matters: writeback — write-allocate traffic.
When a memory access misses a cache with a writeback — write-allocate policy, the accessed line is fetched form memory
before the write operation is performed. After that, the data is written into the recently fetched cache line, leaving its dirty bit set to one.
Once the dirty cache line is replaced, it is evicted from the cache and must be written back to main memory.
These two policies add two memory operations while STREAM only accounts for one single sw instruction.
Effectively, the bandwidth the benchmark reports is the useful
bandwidth observed by the application, while the memory controller is actually moving more data as
there are extra accesses in flight. In other words, memory instructions ≠ memory
accesses. Accounting for the writeback and write-allocate per cache line, the peak theoretical bandwidths for each STREAM kernel are computed the following way:
These values sit very close to the ones STREAM reports (4.26, 4.15, 4.66 and 4.64 MB/s), with the small differences most likely down to the occasional memory access latency variability.
Theoretical vs measured bandwidth
| Level | Kernel | Theoretical (MB/s) | Measured (MB/s) | Error |
|---|---|---|---|---|
| L1 cache | Copy | 80.00 | 79.68 | 0.40 % |
| L1 cache | Scale | 57.14 | 57.03 | 0.19 % |
| L1 cache | Add | 75.00 | 74.84 | 0.21 % |
| L1 cache | Triad | 60.00 | 59.90 | 0.17 % |
| Main memory | Copy | 4.21 | 4.26 | 1.17 % |
| Main memory | Scale | 4.12 | 4.15 | 0.72 % |
| Main memory | Add | 4.65 | 4.66 | 0.21 % |
| Main memory | Triad | 4.58 | 4.64 | 1.29 % |
The analytical model tracks the measured STREAM bandwidth to within ~1.3 % on every kernel, at both levels of the hierarchy. That tight similitude is the real takeaway: it confirms the cycle-level reasoning captures what the silicon actually does: single-cycle hits with no hidden stalls in the L1 cache, and a ~31-cycle latency plus the write-allocate traffic from main memory which STREAM never reports. When the numbers written in a paper and the hardware agree this closely, it means that the memory subsystem is behaving exactly as designed:D.
This performance deep dive into the Z-Core has not only been useful to play some DOOM games ;), but also to consolidate Z-Core as not just a digital waveform project, but a piece of hardware capable of running games, achieving expected performance, and sitting between state-of-the-art small processors.
As I mentioned earlier, performance analysis and benchmarking is crucial: we now know the capabilities of the Z-Core, what it can achieve and what it cannot. I have really enjoyed this performance analysis journey, and having analytical and ideal numbers match the measured performance is what every CPU design engineer wants. With Z-Core, that has been achieved.
References:
[1] EEMBC, “CoreMark — An EEMBC Benchmark.” https://www.eembc.org/coremark/
[2] J. D. McCalpin, “STREAM: Sustainable Memory Bandwidth in High Performance Computers,” University of Virginia. https://www.cs.virginia.edu/stream/
[3] L. McVoy and C. Staelin, “lmbench: Portable Tools for Performance Analysis,” USENIX Annual Technical Conference, 1996.
[4] P. Esmaili-Dokht et al., “A Mess of Memory System Benchmarking, Simulation and Application Profiling,” 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2024.
[5] M. Jahnke, L. Bublitz and U. Kulau, “Performance Evaluation of PicoRV32 RISC-V Softcore for Resource-Constrained Devices,” 2023 IEEE Nordic Circuits and Systems Conference (NorCAS), Aalborg, Denmark, 2023.