← All posts
· z-core

Z-Core: Performance Analysis

Benchmarking Z-Core — DOOM FPS across cache configurations, then CoreMark/MHz, STREAM and pointer-chase results.

RISC-VPerformanceBenchmarkingFPGA

Benchmarking has always played a key role in developing and analyzing the performance of computing systems. I personally like to refer to it as the communication language between engineers and end-users. This concept applies not only to computer architecture but to countless other fields. You cannot simply explain that a processor features out-of-order execution, a high-performance 4-way set-associative cache, a Gshare branch predictor, and many other fancy features to someone who just wants the hardware for playing games or 3D rendering. This is where benchmarking comes into play. It creates a common understanding of performance between both parties, but it is imperative that everyone agrees on one key factor: performance metrics.

When these metrics are closely tied to the functioning of a system, quantifying its performance becomes much easier. Metrics can be direct measurements of hardware features, such as FLOPS or memory bandwidth. Alternatively, they can lean toward a more abstract approach, like GeekBench or AnTuTu, which lie in the higher abstraction layer of computing systems benchmarks. These synthetic benchmarks report a number that is supposed to correctly represent the performance of the device. On the one hand, this makes it easier to rank systems, while offering a unified score that is independent of specific system configurations and very easy to run. However, they might not accurately reflect real-world performance, especially since any vendor can tune its processor to excel specifically at the benchmark’s task.

For this performance analysis, I have decided to follow a more entertaining and visual approach for Z-Core. I will be showing the FPS (Frames Per Second) of DOOM across different Z-Core configurations by varying the presence and sizes of the caches. After this, I will present the results for CoreMark/MHz, STREAM, and pointer chase benchmarks, which will make this post look a bit more serious :).

For those who do not know Z-Core, it is a 5-stage pipelined RISC-V processor implementing the RV32IMZicsr ISA. The processor features multiple peripherals, including a Timer, GPIOs, a UART, and a VGA controller. Additionally, one of the main features that recently pushed Z-Core’s performance to the next level is the memory subsystem and the two caches it incorporates: a 32KB direct-mapped instruction cache, and a 2-way set-associative data cache with writeback and write-through policies, implementing an LRU replacement policy. For further information on the Z-Core microarchitecture and design choices, check out the Z-Core detailed description.

DOOM running on Z-Core @ 320x200

When DOOM was first run on Z-Core, the processor design had no instruction or data caches. The program was directly running from main memory of the DE10-Lite board, which corresponds to a 64MB SDRAM running at 100MHz. This creates a massive bottleneck in the execution of the game, as performance is completely dominated by the high latency of the AXI4-Lite transaction plus the latency of the SDRAM itself. This bottleneck is translated into a performance of less than one FPS, which made DOOM unplayable.


Introducing Caches to the Design

When data cache is deployed into the Z-Core, it is game changing (Pun intended!), it leverage a performance increase of 5x FPS, reaching 15-20 frames per second when running DOOM. Summary of the results is shown in Figure 1.

Figure 1 — DOOM FPS Table summary (160×120 resolution)
0 5 10 15 20 FPS < 1No $I nor $D caches 432KB $I not $D cache 15-2032KB $I & 32KB $D cache
Cache configuration

Introducing such a big data cache is plausible in my design due to the capabilities of the FPGA itself, however, when design constrains are tighter designer might have to move to an smaller cache subsystem. For this reason, I tested how Z-Core would perform on DOOM (running at a 320x200 resolution) when varying the size of the data cache while maintaining a 32KiB instruction cache.

Figure 2 — DOOM FPS Table for different Data cache sizes (320×200 resolution)
0 2.5 5 7.5 10 FPS 4128B 4.5256B 5512B 5.51KiB 62KiB 6.54KiB 88KiB 916KiB 1032KiB
Data cache size

Evaluating the Z-Core with a greater size of data cache might have yield a better performance, however, due to limitations of the hardware resources, 32KiB is the maximum size available for testing.


CoreMark benchmark

Moving away from DOOM, Z-Core was also evaluated against CoreMark benchmark, EEMBC’s CoreMark is a simple, yet sophisticated benchmark that is designed specifically to test the functionality of a processor core [1]. This gives the ability to directly compare the performance of Z-Core to other processors, such as the PicoRV32 RISC-V softcore [5].

Results show a CoreMark/Mhz result of 3.06, which leaves the Z-Core lying between an Arm Cortex-M0 and a Cortex-M3/4 (Figure 3). This makes the Z-Core a perfect fit for embedded and low power applications. Despite this results, it is key to note that Z-Core is an academic personal project and does not compete with industry standard processors in terms of reliability and ecosystem.

Figure 3 — CoreMark/MHz results for multiple cores
PicoRV32 on LPFPGA
0.16 CoreMark/MHz
esp32-s2 (Xtensa LX7)
1,97 CoreMark/MHz
STM32L0 (ARM Cortex-M0)
2.35 CoreMark/MHz
Z-Core on Altera MAX 10
3.06 CoreMark/MHz
Intel Atom N280
3.16 CoreMark/MHz
STM32L4 (ARM Cortex-M4)
3,32 CoreMark/MHz

Memory Bandwidth and Access Latency

Finally, a lower level approach to the benchmarking of the Z-Core memory system is preseted, provinding key memory metrics such as bandwidth and latency in order to better understand the performance of the SoC.

First, the results for the STREAM benchmark are presented for both memory and L1 bandwidth (Figure 4). STREAM is a benchmark that measures sustained memory bandwidth of a processor [2]. By measuring the bandwidth of the Z-Core we can better estimate its perforamnce for memory bound applications. STREAM has been chosen as the memory benchmark above lmbench [3] or Mess [4] due to its simplicity and easier portaibilty to the Z-Core.

STREAM — DCache BW
MB/s
79.68
Copy
57.03
Scale
74.84
Add
59.90
Triad
STREAM — main memory BW
MB/s
4,26
Copy
4,15
Scale
4,66
Add
4,64
Triad

Figure 4 — STREAM sustained bandwidth: L1 data cache vs main memory.

These results provide us key information about the performance of the memory system, but I wanted to go an step further, and dive into the Z-Core pipeline to understand the values reported by the benchmark.

STREAM cache — single-cycle ideal throughput

When the working set fits in the data cache, every lw/sw hits and completes in a single cycle. With branch prediction working and the pipeline fully fed, the inner loop’s instruction throughput collapses to 1 instruction/cycle, so we can read the ideal cycles per loop and turn it into a bandwidth. For a loop issuing m memory operations every c cycles at 50 MHz (20 ns/cycle):

BW = m ops × 4 bytes/op c cycles × 20 ns/cycle = 80 MB/s
STREAM Copy
C
for (j = 0; j < N; j++)
  c[j] = a[j];
RISC-V
lw    a3, 0(a5)
addi  a5, a5, 4
addi  a4, a4, 4
sw    a3, -4(a4)
bne   a5, s10, .L4

Copy moves 2 memory operations in 5 cycles. Notice there is no load-use hazard: the compiler interleaves the two pointer-increment addis between the load and the store, so by the time the store needs a3 the loaded value has already been read.

BW = 2 ops × 4 bytes/op 5 cycles × 20 ns/cycle = 80 MB/s
Copy — clean steady-state pipeline (no bubbles)
cycle 123456789 lw a3, 0(a5) IFIDEXMEMWB addi a5, a5, 4 IFIDEXMEMWB addi a4, a4, 4 IFIDEXMEMWB sw a3, -4(a4) IFIDEXMEMWB bne a5, s10, .L4 IFIDEXMEMWB
The load of a3 retires three instructions before the store consumes it, the compiler-scheduled pointer increments cover the load latency, so the loop runs one instruction per cycle with no stalls.
STREAM Scale
C
for (j = 0; j < N; j++)
  b[j] = scalar * c[j];
RISC-V
lw    a2, 0(a5)
addi  a5, a5, 4
addi  a3, a3, 4
slli  a4, a2, 1
add   a4, a4, a2
sw    a4, -4(a3)
bne   a5, s8, .L5

Scale issues 2 memory operations every 7 cycles. The multiply by the scalar never reaches the M-extension multiplier because scalar = 3, so the compiler reduces the operation to slli + add, which is equivalent to 3·x = (x << 1) + x, two cheap single-cycle ALU ops.

BW = 2 ops × 4 bytes/op 7 cycles × 20 ns/cycle = 57.14 MB/s
Scale pipeline
Scale — 7 cycles / iteration
cycle 1234567891011 lw a2, 0(a5) IFIDEXMEMWB addi a5, a5, 4 IFIDEXMEMWB addi a3, a3, 4 IFIDEXMEMWB slli a4, a2, 1 IFIDEXMEMWB add a4, a4, a2 IFIDEXMEMWB sw a4, -4(a3) IFIDEXMEMWB bne a5, s8, .L5 IFIDEXMEMWB
a2 is loaded five instructions before slli consumes it — again no load-use stall.
STREAM Add
C
for (j = 0; j < N; j++)
  c[j] = a[j] + b[j];
RISC-V
lw    a4, 0(a5)
lw    a1, 0(a3)
addi  a5, a5, 4
addi  a3, a3, 4
add   a4, a4, a1
sw    a4, 0(a2)
addi  a2, a2, 4
bne   a5, s10, .L6

Add touches three arrays, so it issues 3 memory operations every 8 cycles. Both loads are properly scheduled at the top of the loop so the dependent add sits four instructions later, this avoids potential load-to-use hazards.

BW = 3 ops × 4 bytes/op 8 cycles × 20 ns/cycle = 75 MB/s
Add pipeline
Add — 8 cycles / iteration
cycle 123456789101112 lw a4, 0(a5) IFIDEXMEMWB lw a1, 0(a3) IFIDEXMEMWB addi a5, a5, 4 IFIDEXMEMWB addi a3, a3, 4 IFIDEXMEMWB add a4, a4, a1 IFIDEXMEMWB sw a4, 0(a2) IFIDEXMEMWB addi a2, a2, 4 IFIDEXMEMWB bne a5, s10, .L6 IFIDEXMEMWB
Both loads complete well before the add consumes their results — no stalls.
STREAM Triad
C
for (j = 0; j < N; j++)
  a[j] = b[j] + scalar * c[j];
RISC-V
lw    a1, 0(a3)
lw    a0, 0(a4)
addi  a4, a4, 4
slli  a5, a1, 1
add   a5, a5, a1
add   a5, a5, a0
sw    a5, 0(a2)
addi  a3, a3, 4
addi  a2, a2, 4
bne   a4, s5, .L7

Triad is the heaviest kernel: two loads, a strength-reduced scalar multiply, two adds and a store; a total of 3 memory operations every 10 cycles.

BW = 3 ops × 4 bytes/op 10 cycles × 20 ns/cycle = 60 MB/s
Triad pipeline
Triad — 10 cycles / iteration
cycle 1234567891011121314 lw a1, 0(a3) IFIDEXMEMWB lw a0, 0(a4) IFIDEXMEMWB addi a4, a4, 4 IFIDEXMEMWB slli a5, a1, 1 IFIDEXMEMWB add a5, a5, a1 IFIDEXMEMWB add a5, a5, a0 IFIDEXMEMWB sw a5, 0(a2) IFIDEXMEMWB addi a3, a3, 4 IFIDEXMEMWB addi a2, a2, 4 IFIDEXMEMWB bne a4, s5, .L7 IFIDEXMEMWB
Ten independent-enough instructions stream through back to back; the dependency distances all exceed the load latency, so the loop holds one instruction per cycle.

The measured data-cache bandwidths (79.68, 57.03, 74.84 and 59.90 MB/s) land almost exactly on the 80, 57.14, 75 and 60 MB/s the assembly predicts. Hitting the ideal instruction and cache single cycle throughput across all four kernels is the signature of a clean, well-tuned cache subsystem, and branch predictor. The small differences (< 1%) might come from early cold misses and branch misspredictions.

STREAM memory — main-memory bandwidth

Evaluating the bandwidth of main memory is trickier. A memory operation no longer costs a single cycle, so to compute the theoretical bandwidth we first need the actual latency of main memory. We measure it with a pointer-chase microbenchmark: an array wired into a randomly distributed chain of pointers, where each access depends on the previous one (See Figure 5). Serialising the accesses like this defeats overlap and exposes the raw round-trip to memory. The measured latency comes out at around 31 cycles per memory access.

Figure 5: Pointer chase — every load depends on the previous one
[0] 3 [1] 4 [2] 0 [3] 1 [4] 2
Each cell stores the index of the next cell to read, so the loads form a single dependent chain (0 → 3 → 1 → 4 → 2 → 0). With no independent accesses to overlap, each load must wait for the previous one to return, exposing the latency of a single main-memory access.

Before moving into the calculation, one microarchitectural detail matters: writeback — write-allocate traffic. When a memory access misses a cache with a writeback — write-allocate policy, the accessed line is fetched form memory before the write operation is performed. After that, the data is written into the recently fetched cache line, leaving its dirty bit set to one. Once the dirty cache line is replaced, it is evicted from the cache and must be written back to main memory. These two policies add two memory operations while STREAM only accounts for one single sw instruction. Effectively, the bandwidth the benchmark reports is the useful bandwidth observed by the application, while the memory controller is actually moving more data as there are extra accesses in flight. In other words, memory instructions ≠ memory accesses. Accounting for the writeback and write-allocate per cache line, the peak theoretical bandwidths for each STREAM kernel are computed the following way:

BW_copy = 2 mem inst × 4 bytes (31 cycles × 3 accesses + 2 cycles) × 20 ns/cycle = 4.21 MB/s
BW_scale = 2 mem inst × 4 bytes (31 cycles × 3 accesses + 5 cycles) × 20 ns/cycle = 4.12 MB/s
BW_add = 3 mem inst × 4 bytes (31 cycles × 4 accesses + 5 cycles) × 20 ns/cycle = 4.65 MB/s
BW_triad = 3 mem inst × 4 bytes (31 cycles × 4 accesses + 7 cycles) × 20 ns/cycle = 4.58 MB/s

These values sit very close to the ones STREAM reports (4.26, 4.15, 4.66 and 4.64 MB/s), with the small differences most likely down to the occasional memory access latency variability.

Theoretical vs measured bandwidth

STREAM — analytical model vs measurement
LevelKernelTheoretical (MB/s)Measured (MB/s)Error
L1 cacheCopy80.0079.680.40 %
L1 cacheScale57.1457.030.19 %
L1 cacheAdd75.0074.840.21 %
L1 cacheTriad60.0059.900.17 %
Main memoryCopy4.214.261.17 %
Main memoryScale4.124.150.72 %
Main memoryAdd4.654.660.21 %
Main memoryTriad4.584.641.29 %
Error = |theoretical − measured| / measured.

The analytical model tracks the measured STREAM bandwidth to within ~1.3 % on every kernel, at both levels of the hierarchy. That tight similitude is the real takeaway: it confirms the cycle-level reasoning captures what the silicon actually does: single-cycle hits with no hidden stalls in the L1 cache, and a ~31-cycle latency plus the write-allocate traffic from main memory which STREAM never reports. When the numbers written in a paper and the hardware agree this closely, it means that the memory subsystem is behaving exactly as designed:D.


This performance deep dive into the Z-Core has not only been useful to play some DOOM games ;), but also to consolidate Z-Core as not just a digital waveform project, but a piece of hardware capable of running games, achieving expected performance, and sitting between state-of-the-art small processors.

As I mentioned earlier, performance analysis and benchmarking is crucial: we now know the capabilities of the Z-Core, what it can achieve and what it cannot. I have really enjoyed this performance analysis journey, and having analytical and ideal numbers match the measured performance is what every CPU design engineer wants. With Z-Core, that has been achieved.

References:

[1] EEMBC, “CoreMark — An EEMBC Benchmark.” https://www.eembc.org/coremark/
[2] J. D. McCalpin, “STREAM: Sustainable Memory Bandwidth in High Performance Computers,” University of Virginia. https://www.cs.virginia.edu/stream/
[3] L. McVoy and C. Staelin, “lmbench: Portable Tools for Performance Analysis,” USENIX Annual Technical Conference, 1996.
[4] P. Esmaili-Dokht et al., “A Mess of Memory System Benchmarking, Simulation and Application Profiling,” 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2024.
[5] M. Jahnke, L. Bublitz and U. Kulau, “Performance Evaluation of PicoRV32 RISC-V Softcore for Resource-Constrained Devices,” 2023 IEEE Nordic Circuits and Systems Conference (NorCAS), Aalborg, Denmark, 2023.

← All posts