Reliable Benchmarking
Abstract
Benchmarking is a widely-used method in experimental computer science, in particular, for the comparative evaluation of tools and algorithms. As a consequence, a number of questions need to be answered in order to ensure proper benchmarking, resource measurement, and presentation of results, all of which is essential for researchers, tool developers, and users, as well as for tool competitions. We identify a set of requirements that are indispensable for reliable benchmarking and resource measurement of time and memory usage of automatic solvers, verifiers, and similar tools, and discuss limitations of existing methods and benchmarking tools. Fulfilling these requirements in a benchmarking framework can (on Linux systems) currently only be done by using the cgroup and namespace features of the kernel. We developed BenchExec, a ready-to-use, tool-independent, and open-source implementation of a benchmarking framework that fulfills all presented requirements, making reliable benchmarking and resource measurement easy. Our framework is able to work with a wide range of different tools and has proven its reliability and usefulness in the International Competition on Software Verification, and is used by several research groups worldwide to ensure reliable benchmarking. Finally, we present guidelines on how to present measurement results in a scientifically correct and comprehensible way.
BenchExec: A Framework for Reliable Benchmarking
Our benchmarking framework BenchExec is available on GitHub (documentation).
Machine Architectures
As indication how complex and divergent today's machines can be, we show two exemplary machine architectures. In the figures linked below, each CPU is represented by a node labeled with “Socket” and the physical package id, each physical CPU core by a node labeled with “Core” and the core id, and each processing unit by a node labeled with “PU” and the processor id. A processing unit is what a process running under Linux sees as a core. Nodes whose label starts with “L” represent the various caches (“L” for level). Contiguous regions of memory are represented by nodes labeled with “NUMANode” and the node id, and each memory region is grouped together with the CPU cores that it belongs to in a green unlabeled node. The numeric ids in the figures are those that the Linux kernel assigns to the respective unit. Such figures can be created with the tool lstopo from the Portable Hardware Locality (hwloc) project.
Both examples are systems with a NUMA architecture. The first system has two AMD Opteron 6380 16-core CPUs with a total of 137 GB of RAM . On this CPU, always two virtual cores together form what AMD calls a “module”, a physical core that has separate integer-arithmetic units and level-1 data cache for each virtual core, but shared floating-point units and level-1 instruction cache and level-2 cache. The cores of each CPU are split into two groups of eight virtual cores. The memory of the system is split into four regions, each of which is coupled with one group of cores. This means that, for example, core 0 can access the memory of NUMANode 0 directly and thus fast, whereas accesses to NUMANode 1 would be indirect and thus slower, and accesses to NUMANodes 2 and 3 would be even slower as they need to be done via inter-CPU communication channels. Note that on this machine, the two virtual cores of each physical core (“module”) got adjacent processor ids and different core ids assigned by the Linux kernel, whereas virtual cores with the same core id on the same CPU actually belong to different physical cores.
The second system has two Intel Xeon E5-2650 v2 eight-core CPUs with a total of 135 GB of RAM (caches omitted for space reasons). This CPU model has hyper-threading, and thus there are always two virtual cores that share both integer-arithmetic and floating-point units and the caches of one physical core. The memory in this system is also split into two memory regions, one per CPU. Note that the numbering system differs from the other machine: the virtual cores of one physical core have the same core id, which uniquely identifies a physical core on a CPU here. The processor ids for virtual cores, however, are not consecutive but jump between the CPUs.
Note that both presented systems could appear equal at a cursory glance, because they both have the same number of processing units and approximately the same amount of RAM. However, they differ in their architecture and (depending on the workload) could deliver substantially different performance even if running at the same frequency.
Benchmark Results
Here we provide the full results and the raw data for the benchmarks listed in Section 4 of our paper. All tables were created with BenchExec 1.9 and can be used as described in the section below. Note: For the benchmark results belonging to the SPIN'15 version of the paper please see below.
We executed benchmarks on a machine with two Intel Xeon E5-2650 v2 eight-core CPUs and 68 GB of RAM per CPU. The CPUs support hyper-threading (with two virtual cores per physical core) and Turbo Boost (base frequency is 2.6 GHz, with up to 3.4 GHz if only one core is used). The machine has a NUMA architecture.
As an example for a benchmarked tool we took the verifier CPAchecker in revision 17829 from the project repository with its configuration for predicate analysis. The machine was running a 64-bit Ubuntu 14.04 with Linux 3.13 and OpenJDK 1.7 as Java virtual machine. As benchmark set we used 2731 C programs from SV-COMP'15 (excluding categories not supported by this configuration of CPAchecker, and those programs for which CPAchecker times out on the current machine because including time outs would skew the results). In the paper, we present results only for the 2414 programs that CPAchecker could solve correctly, but for completeness on this page we present the raw results for all 2731 programs. All runs were restricted to one virtual core of the machine, 4.0 GB of memory, and 900s of CPU time. Except were noted, Turbo Boost was disabled such that all used cores were running at 2.6 GHz. Apart from our benchmarks the machine was completely unused. All experiments were conducted with BenchExec 1.2.
Note that the actual performance impact of certain hardware features will differ according to the characteristics of the benchmarked tool and the benchmarking machine. For example, a tool that uses only little memory but fully utilizes its CPU core(s) will be influenced more by hyper-threading than by non-local memory, whereas for a tool that relies more on memory accesses it might be the other way around. In particular, the results for CPAchecker and our machine that are shown here are not generalizable and show only that there is such an impact. Because the quantitative amount of the impact is not predictable and might be non-deterministic, it is important to avoid these problems for reproducible benchmarking.
Benchmarks Regarding Hyper-Threading
To show the impact of hyper-threading, we executed the verifier twice in parallel on one CPU of our machine. In one instance of the benchmark, we assigned each of the two parallel runs one virtual core from separate physical cores of the same CPU. In a second instance of the benchmark, we assigned each of the two parallel runs one virtual core from the same physical core, such that both runs had to share the hardware resources of one physical core.
In the table, the first columns show the results for an appropriate assignment of CPU cores (each parallel run is executed on a separate physical core). The columns to the right show the results for a benchmark run where we used an inappropriate assignment of CPU cores (both parallel runs on the same physical cores).
Benchmarks Regarding Shared Memory Bandwidth and Caches
To show the impact of a shared memory connection and a shared Level 3 cache for multiple physical cores, we executed benchmarks with 1 to 8 parallel executions of the verifier on the same CPU (each on its own physical core), i.e., with 1 to 8 used physical cores that share a common Level 3 cache and the memory bandwidth. The second virtual core of every physical core was unused, and Turbo Boost was disabled.
The table contains eight groups of columns with 1 to 8 used physical cores, respectively.
Benchmarks Regarding Turbo Boost
To show the impact of Turbo Boost, we executed benchmarks with the same setup as in the previous section, but now with Turbo Boost enabled. This means that the CPU uses a higher frequency that depends on the current load of its cores. Without Turbo Boost, a used core of this CPU always runs at 2.6 GHz, whereas with Turbo Boost, a single used core of this CPU can run at 3.4 GHz, if the CPU is otherwise idle, and even if all eight cores are used, they can still run at 3.0 GHz.
The table contains eight groups of columns with 1 to 8 used physical cores, respectively. To see the influence of Turbo Boost, compare the results with the results from the previous section.
Benchmarks Regarding Non-Uniform Memory Access (NUMA)
To show the impact of non-uniform memory access, we executed 16 instances of the verifier in parallel, one instance per physical core of the two CPUs of our machine. In one instance of the benchmark, we assigned memory to each run that was local to the CPU the run was executed on. In a second instance of the benchmark, we deliberately forced each of the 16 runs to use only memory from the other CPU, such that all memory accesses were indirect.
In the table, the first columns show the results for using memory that is attached to the same CPU that executes the benchmark run. The columns to the right show the results for a benchmark run where we deliberately used memory that is attached to the other CPU of the machine, making all memory accesses indirect and thus slower.
Benchmarks Regarding Multiple CPUS
To show the impact of multiple CPUs used in parallel, we executed benchmarks with 2 to 16 parallel executions of the verifier across all CPUs. We used both CPUs of the machine, executing 1 to 8 parallel instances of the verifier on each CPU.
The table contains eight groups of columns with 1 to 8 instances of the verifier per CPU, respectively. To see the influence of using two CPUs, compare the results with the results from the on shared memory bandwith, where only one CPU was used during the benchmarks.
Result Tables
Examples for the result tables that BenchExec produces are available from the links above.
The header of the tables contains information about the machine that was used to execute the benchmark, the resource limits, and the tool name, version, and configuration. The body contains the result given by the tool (colored according to whether it is correct), and the resource measurements. The footer contains aggregate values for the measurements.
By clicking on any cell in the column “status”, one can see the output of the tool for this run. The header bar allows access to features such as filtering of columns and rows, and to dynamically generated quantile and scatter plots for arbitrary columns of the table. Plots can be further configured with the buttons in the lower-left corner, and the shown columns can be changed with the drop-down boxes above the plot.
Benchmark Results SPIN'15
Here we provide the full results and the raw data for the benchmarks listed in our SPIN'15 paper. All tables were created with BenchExec and can be used as described in the section below.
We executed benchmarks using the predicate analysis of the verifier CPAchecker in revision 15307 of the project repository. We used 4011 C programs from SV-COMP'15 (excluding categories not supported by CPAchecker) and a CPU-time limit of 900s.
Note that the actual performance impact will differ according to the resource-usage characteristics of the benchmarked tool. For example, a tool that uses only very little memory but fully utilizes its CPU core(s) will be influenced more by hyper-threading than by non-local memory, whereas for a tool that relies more on memory accesses it might be the other way round. In particular, the results for CPAchecker that are shown here are not generalizable and show only that there is such an impact. As the size of the impact is not predictable and might be non-deterministic, it is important to rule out these factors for reliable and accurate benchmarking in any case.
Benchmarks Regarding Hyper-Threading
To show the influence of hyper-threading, we executed benchmarks on a machine with a single Intel Core i7 4770 3.4GHz CPU (with four physical cores and hyper-threading) and 33 GB of memory. We executed the verifier twice in parallel and assigned one virtual core and 4.0 GB of memory to each run. In one instance of the benchmark, we assigned each of the two parallel runs a virtual core from separate physical cores. In a second instance of the benchmark, we assigned each of the two parallel runs one virtual core from the same physical core, such that both runs had to share the hardware resources of this one physical core.
In the table, the first columns show the results for an appropriate assignment of CPU cores (each parallel run is executed on a separate physical core). The columns to the right show the results for a benchmark run where we used an inappropriate assignment of CPU cores (both parallel runs on the same physical cores). For the 2472 programs from the benchmark set that CPAchecker could solve on this machine, 13 hours of CPU time were necessary using two separate physical cores and 19 hours of CPU time were necessary using the same physical core, an increase of 41% caused by the inappropriate core assignment. (Results with a timeout have to be discarded in this regard, as the used time is always the same in timeout cases, of course.)
Benchmarks Regarding Non-Uniform Memory Access (NUMA)
To show the influence of non-uniform memory access, we executed benchmarks on a NUMA machine with two Intel Xeon E5-2650 v2 2.6GHz CPUs with 63 GB of local memory each. We executed the verifier twice in parallel, assigning all cores of one of the CPUs and 60 GB of memory to each of the two runs. In one instance of the benchmark, we assigned memory to each run that was local to the CPU the run was executed on. In a second instance of the benchmark, we deliberately forced each of the two runs to use only memory from the other CPU, such that all memory accesses were indirect.
In the table, the first columns show the results for using memory that is attached to the same CPU that executes the benchmark run. The columns to the right show the results for a benchmark run where we deliberately used memory that is attached to the other CPU of the machine, making all memory accesses indirect and thus slower. For the 2483 programs from the benchmark set that CPAchecker could solve on this machine, 19 hours of CPU time were necessary using local memory and 21 hours of CPU time were necessary using remote memory, an increase of 11% caused by the inappropriate memory assignment. The wall time also increased by 9.5%. (Results with a timeout have to be discarded in this regard, as the used time is always the same in timeout cases, of course.)