Testing is a widely used method to assess software quality. Coverage criteria and coverage measurements are used to ensure that the constructed test suites adequately test the given software. Since manually developing such test suites is too expensive in practice, various automatic test-generation approaches were proposed. Since all approaches come with different strengths, combinations are necessary in order to achieve stronger tools. We study cooperative combinations of verification approaches for test generation, with high-level information exchange. We present CoVeriTest, a hybrid approach for test-case generation, which iteratively applies different conditional model checkers. Thereby, it allows to adjust the level of cooperation and to assign individual time budgets per verifier. In our experiments, we combine explicit-state model checking and predicate abstraction (from CPAchecker) to systematically study different CoVeriTest configurations. Moreover, CoVeriTest achieves higher coverage than state-of-the-art test-generation tools for some programs.
We performed three types of experiments:
Next to the analysis components, which we fixed, CoVeriTest allows one to configure the time limit of each component (per iteration) and how to reuse information between analysis executions of the same CoVeriTest run. We used six different time limits to investigate whether CoVeriTest should switch often, sometimes, or rarely between analyses and whether should equally distribute the time among the component analyses or not. Furthermore, we used nine reuse types ranging from no exchange, over only reuse own results to full cooperation. For the details on the different reuse types, we refer to our paper.
For the comparison of these 54 configuration, we used all 6703 C programs from the largest publicly available benchmark collection of C verification tasks. Since all configurations are run by the same tool CPAchecker, we decided to disable test-case generation (thus, reducing the amount of data) and only compare the number of covered test goals reported by CPAchecker.
The following table presents the experimental data of our study of the different CoVeriTest configurations. For each combination of reuse type and time limit pair, the table provides the raw data (compressed XML file). Additionally, the table contains one HTML and one CSV file for each reuse type. These files show for each reuse type and test task (program) the experimental data for the different time limits, i.e., configurations using the particular reuse type with varying time limits. Similarly, the table contains one HTML and one CSV file for each time limit pair. These files provide for each time limit pair and test task (program) the experimental data for the different reuse types. The last row presents the log files (zip-compressed). There exists one log file per configuration and test task. Log files are grouped by the time limit used in the CoVeriTest configuration (due to course of experiments).
The above table lets one compare the coverage results individually per test task (program). To compare the different configurations, we need to aggregate the results on all test tasks. We decided to look at the distribution of the achieved coverage.
Since a program may have a large number of unreachable test goals, we decided not to compare the total coverage, i.e., number of covered test goals divided by number of total test goals, but a relative measure (called relative coverage) that compares against the best value achieved. Hence, when comparing a set of configurations, we first compute for each program (test task) the largest number of covered test goals reported among all configurations in the set. To compute the relative coverage, we then divide the number of test goals achieved by a configuration by this largest number.
As explained above, we have two parameters for CoVeriTest, which we can control in our experiments, namely time limits and reuse type. In our paper, we studied which time limits to use for a particular reuse type and based on this what is the best CoVeriTest configuration. On our paper, we also study which reuse type to use for a particular time limit pair.
The following nine figures show for each of the nine reuse types six box plots. Each box plots shows the distribution of the relative coverage. For the per task comparison of the coverage, we refer to the HTML and CSV tables above. Looking at the figures, we observe that the first four reuse types perform best with view switches, while the last five reuse types perform best when preferring the predicate analysis. Since the first four reuse types let an analysis forget all information found out in a previous run, switching seem to be avoided. If this is not the case (as for the last five reuse types shown), the more powerful analysis (predicate analysis) is preferred.
plain | condv | condp |
condv,p | reuse-prec | reuse-arg |
condv+r | condp+r | condv,p+r |
Based on the previous results, we chose for each reuse type the best time limit pair and compared the different reuse types. Next, we provide the box plot comparison known from above as well as the CSV and HTML table of this comparison. The result of the comparison is that the best configuration uses type reuse-arg and time limit pair 20×80.
HTML Table | CSV Table |
The following six figures show for each of the six time limit pairs nine box plots. Each box plots shows the distribution of the relative coverage. For the per task comparison of the coverage, we refer to the HTML and CSV tables above. Looking at the figures, we observe that for each time limit there always exists a configuration which is better than the first four reuse types. Thus, CoVeriTest should be configured to reuse information that an analysis generated in a previous iteration of the same CoVeriTest run. Furthermore, either reuse type reuse-arg or condv performed best.
10×10 | 50×50 |
100×100 | 250×250 |
80×20 | 20×80 |
Based on the previous results, we chose for each time limit pair the best reuse type and compared the different time limits. Next, we provide the box plot comparison known from above as well as the CSV and HTML table of this comparison. The result of the comparison is that the best configuration uses type condv+r and time limit pair 20×80.
HTML Table | CSV Table |
Depending on which parameter is fixed first, the best CoVeriTest configuration slightly differs. While the time limit pair is the same, the reuse type is different. Already in our paper, we mentioned that these two configurations are close by. The reason is that we use relative coverage instead of total coverage. Since the set of configurations differs, the maximal coverage value and thus the relative coverage value can be different for some test tasks. A direct comparison of only the two configurations revealed that the configuration (condv+r,20×80) achieves higher coverage for more tasks, while the configuration (reuse-arg,20×80) achieved higher coverage in total. Thus, we follow our paper and refer to the latter configuration as best CoVeriTest configuration.
As already mentioned, CoVeriTest interleaves the two analyses: predicate and value analysis. To investigate if CoVeriTest's interleaving is beneficial, we compare the best CoVeriTest configuration ((reuse-arg, 20×80) with the two component analyses and a parallel execution of both analyses.
For the comparison, we again used all 6703 C programs from the largest publicly available benchmark collection of C verification tasks. Since all configurations are still run by the same tool CPAchecker, we decided to disable test-case generation, too.
The following table presents the experimental data (XML and log files) of the three non-CoVeriTest test-case generation experiments. The experimental data for the best CoVeriTest configuration is provided in the table for the comparison of CoVeriTest configurations. Additionally, the table contains HTML and CSV tables comparing the best CoVeriTest configuration with one of the other three configurations.
Analysis Type | Raw Data | CoVeriTest vs. Analysis | |
---|---|---|---|
Parallel Analysis (Value+Predicate) | Result XML | HTML Table | CSV Table |
Predicate Analysis | Result XML | HTML Table | CSV Table |
Value Analysis | Result XML | HTML Table | CSV Table |
Log Files |
The above table lets one compare the coverage results individually per test task (program). To compare the different configurations, we need to aggregate the results on all test tasks. We decided to use a scatter plot that contains one point per test tasks. Each point (x,y) describes the coverage x achieved by CoVeriTest and the coverage y achieved by the other analysis.
Looking at the scatter plots, we observe that CoVeriTest outperforms the value analysis. In the other two cases, it is not that obvious. CoVeriTest is beneficial for certain program classes, but not always best.
To study when and whether CoVeriTest improves over state-of-the-art test-case generation tools, we compare the branch coverage achieved by CoVeriTest's best configuration (reuse-arg,20×80) with the branch coverage achieved by the participants of the International Competition on Software Testing (Test-Comp'19). In our paper, we only compared against the best two tools.
For our comparison, we used the all 1720 programs used by Test-Comp'19 in the category Code Coverage. Furthermore, we no longer cannot use the reported number of covered goals for comparison, but had to (1) generate test-cases with CoVeriTest and (2) let the Test-Comp validator measure branch coverage. In the following, we show six scatter plots. Each scatter plots compares the branch coverage achieved by CoVeriTest against the branch coverage achieved by one Test-Comp participant. We ordered the scatter plots by the success of the tool in Test-Comp'19 (best first). Note that we did not include the two ESBMC participants because they do not output a single test case in the branch coverage category.
Looking at the scatter plots, we observe that CoVeriTest neither dominates nor is dominated by any of the approaches. Thus, it complements existing approaches.
The following table provides the raw data for the above comparison. The generated tests are available in our artifact. Furthermore, note that we did not produce the raw data for the Test-Comp'19 ourselves, but link to the raw data of Test-Comp'19, which we used in the comparison.
Analysis Type | Raw Data | |
---|---|---|
Test-case generation | Result XML | Log Files |
Test-case validation | Result XML | Log Files |
Test-Comp'19 results | All Raw |
BASEDIR ├── benchmark_defs ├── benchmarks │ ├── sv-benchmarks │ ├── sv-benchmarks-testcomp19 ├── CoVeriTest.pdf ├── CPAchecker ├── debs ├── experiments │ ├── data │ ├── figures │ ├── table ├── README.md ├── validation │ ├── coverage-val.sh │ ├── tbf-test-suite-validator │ ├── tool-info │ ├── unzip_testcases.py │ ├── zip_testcases.py
Install the required software dependencies from the folder debs and set up BenchExec for <USER> using the following commands.
cd $BASEDIR/debs sudo dpkg -i *.deb sudo adduser <USER> benchexec reboot
Before we explain how to repeat our results, we show on the example program test_locks_5_true-unreach-call_true-valid-memsafety_false-termination.c how to run the different CoVeriTest configurations. We assume that all run commands are executed from $BASEDIR/CPAchecker. For our tutorial, we structured all run commands in the same way. First, each command set up the tool environment and the total time limit of the run via scripts/cpa.sh -benchmark -heap 10000M -timelimit 900s -setprop log.consoleLevel=SEVERE -stats. The parameter -spec tells CoVeriTest which functions cause program termination. The next two parameters are different for each configuration and define the reuse type as well as the time limit. Finally, the program is given.
In the following, we list for each of the nine reuse type the command line required to execute the CoVeriTest configuration that uses this reuse type and time limit 10×10 (i.e., each of the two analyses is granted 10 s per iteration). To use a different time limit pair (x,y), one must adapt the command line string as follows: replace the first occurrence of _10 by _x and the second by _y.
Note that the configurations target at programs with 32-bit architecture. To generate tests for programs with a 64-bit architecture add the parameter -64. Furthermore, we disabled the output of test cases. To enable test-case generation add the parameter -setprop testcase.values=%d.test.txt. The test-cases are written to $BASEDIR/CPAchecker/output. Each test-case contains one test input value per line.
scripts/cpa.sh -benchmark -heap 10000M -timelimit 900s -setprop log.consoleLevel=SEVERE -stats \ -spec config/specification/sv-comp-terminatingfunctions.spc \ -testCaseGeneration-interleaved-value+predicate-noreuse \ -setprop interleavedAlgorithm.configFiles=config/testCaseGeneration-valueAnalysis.properties::noreuse_10,\ config/testCaseGeneration-predicateAnalysis.properties::noreuse_10 \ ../benchmarks/sv-benchmarks/c/locks/test_locks_5_true-unreach-call_true-valid-memsafety_false-termination.c
scripts/cpa.sh -benchmark -heap 10000M -timelimit 900s -setprop log.consoleLevel=SEVERE -stats \ -spec config/specification/sv-comp-terminatingfunctions.spc \ -testCaseGeneration-interleaved-value+predicate-cmc-v2p \ -setprop interleavedAlgorithm.intermediateStatistics=EXECUTE \ -setprop interleavedAlgorithm.configFiles=config/components/testCaseGeneration-value-generate-cmc-condition.properties::noreuse_10,\ config/components/testCaseGeneration-predicate-use-cmc-condition.properties::noreuse_10 \ ../benchmarks/sv-benchmarks/c/locks/test_locks_5_true-unreach-call_true-valid-memsafety_false-termination.c
scripts/cpa.sh -benchmark -heap 10000M -timelimit 900s -setprop log.consoleLevel=SEVERE -stats \ -spec config/specification/sv-comp-terminatingfunctions.spc \ -testCaseGeneration-interleaved-value+predicate-cmc-p2v \ -setprop interleavedAlgorithm.intermediateStatistics=EXECUTE \ -setprop interleavedAlgorithm.configFiles=config/components/testCaseGeneration-value-use-cmc-condition.properties::noreuse_10,\ config/components/testCaseGeneration-predicate-generate-cmc-condition.properties::noreuse_10 \ ../benchmarks/sv-benchmarks/c/locks/test_locks_5_true-unreach-call_true-valid-memsafety_false-termination.c
scripts/cpa.sh -benchmark -heap 10000M -timelimit 900s -setprop log.consoleLevel=SEVERE -stats \ -spec config/specification/sv-comp-terminatingfunctions.spc \ -testCaseGeneration-interleaved-value+predicate-cmc-bidirectional \ -setprop interleavedAlgorithm.intermediateStatistics=EXECUTE \ -setprop interleavedAlgorithm.configFiles=config/components/testCaseGeneration-value-generate-and-use-cmc-condition.properties::noreuse_10,\ config/components/testCaseGeneration-predicate-generate-and-use-cmc-condition.properties::noreuse_10 \ ../benchmarks/sv-benchmarks/c/locks/test_locks_5_true-unreach-call_true-valid-memsafety_false-termination.c
scripts/cpa.sh -benchmark -heap 10000M -timelimit 900s -setprop log.consoleLevel=SEVERE -stats \ -spec config/specification/sv-comp-terminatingfunctions.spc \ -testCaseGeneration-interleaved-value+predicate-precision-reuse \ -setprop interleavedAlgorithm.configFiles=config/testCaseGeneration-valueAnalysis.properties::reuse-precision_10,\ config/testCaseGeneration-predicateAnalysis.properties::reuse-precision_10 \ ../benchmarks/sv-benchmarks/c/locks/test_locks_5_true-unreach-call_true-valid-memsafety_false-termination.c
scripts/cpa.sh -benchmark -heap 10000M -timelimit 900s -setprop log.consoleLevel=SEVERE -stats \ -spec config/specification/sv-comp-terminatingfunctions.spc \ -testCaseGeneration-interleaved-value+predicate-continue \ -setprop interleavedAlgorithm.configFiles=config/testCaseGeneration-valueAnalysis.properties::continue_10,\ config/testCaseGeneration-predicateAnalysis.properties::continue_80 \ ../benchmarks/sv-benchmarks/c/locks/test_locks_5_true-unreach-call_true-valid-memsafety_false-termination.c
scripts/cpa.sh -benchmark -heap 10000M -timelimit 900s -setprop log.consoleLevel=SEVERE -stats \ -spec config/specification/sv-comp-terminatingfunctions.spc \ -testCaseGeneration-interleaved-value+predicate-cmc-v2p \ -setprop interleavedAlgorithm.intermediateStatistics=EXECUTE \ -setprop interleavedAlgorithm.configFiles=config/components/testCaseGeneration-value-generate-cmc-condition.properties::continue_10,\ config/components/testCaseGeneration-predicate-use-cmc-condition-precision.properties::noreuse_10 \ ../benchmarks/sv-benchmarks/c/locks/test_locks_5_true-unreach-call_true-valid-memsafety_false-termination.c
scripts/cpa.sh -benchmark -heap 10000M -timelimit 900s -setprop log.consoleLevel=SEVERE -stats \ -spec config/specification/sv-comp-terminatingfunctions.spc \ -testCaseGeneration-interleaved-value+predicate-cmc-p2v \ -setprop interleavedAlgorithm.intermediateStatistics=EXECUTE \ -setprop interleavedAlgorithm.configFiles=config/components/testCaseGeneration-value-use-cmc-condition-precision.properties::noreuse_10,\ config/components/testCaseGeneration-predicate-generate-cmc-condition.properties::continue_10 \ ../benchmarks/sv-benchmarks/c/locks/test_locks_5_true-unreach-call_true-valid-memsafety_false-termination.c
scripts/cpa.sh -benchmark -heap 10000M -timelimit 900s -setprop log.consoleLevel=SEVERE -stats \ -spec config/specification/sv-comp-terminatingfunctions.spc \ -testCaseGeneration-interleaved-value+predicate-cmc-bidirectional \ -setprop interleavedAlgorithm.intermediateStatistics=EXECUTE \ -setprop interleavedAlgorithm.configFiles=config/components/testCaseGeneration-value-generate-and-use-cmc-condition-precision.properties::noreuse_10,\ config/components/testCaseGeneration-predicate-generate-and-use-cmc-condition-precision.properties::noreuse_10 \ ../benchmarks/sv-benchmarks/c/locks/test_locks_5_true-unreach-call_true-valid-memsafety_false-termination.c
As already explained above, we performed three types of experiments: internal comparison of CoVeriTest configurations, comparison of CoVeriTest against its components, and comparison of CoVeriTest against Test-Comp participants. In the following, we explain how to repeat the experiments of each of the three types. If you system has swap memory, you need to turn it off.
sudo swapoff -aYou can later turn it on again executing the following command.
sudo swapon -a
To replicate our results, you need 8 CPU cores and 15 GB of memory.
The results of the experiments can be found in the sub-folder results. The sub-folder results is a subfolder of the location from where you started the respective benchexec command, i.e., either $BASE/CPAchecker or $BASE/validation.
To run the CoVeriTest configurations with time limit pair 10×10, use the following commands.
cd $BASE/CPAchecker benchexec --container ../benchmark_defs/test-generation_10_10.xml
To run the CoVeriTest configurations with time limit pair 50×50, use the following commands.
cd $BASE/CPAchecker benchexec --container ../benchmark_defs/test-generation_50_50.xml
To run the CoVeriTest configurations with time limit pair 100×100, use the following commands.
cd $BASE/CPAchecker benchexec --container ../benchmark_defs/test-generation_100_100.xml
To run the CoVeriTest configurations with time limit pair 250×250, use the following commands.
cd $BASE/CPAchecker benchexec --container ../benchmark_defs/test-generation_250_250.xml
To run the CoVeriTest configurations with time limit pair 80×20, use the following commands.
cd $BASE/CPAchecker benchexec --container ../benchmark_defs/test-generation_80_20.xml
To run the CoVeriTest configurations with time limit pair 20×80, use the following commands.
cd $BASE/CPAchecker benchexec --container ../benchmark_defs/test-generation_20_80.xml
To compare the best CoVeriTest configuration with the single analyses used by CoVeriTest as well as the parallel execution of these analyses, we need to run three analyses: value analysis, predicate analysis and parallel execution of value and predicate analysis. To regenerate the experimental data for these three analyses, execute the following commands. For CoVeriTest, we reuse the experimental data from the first experiment.
cd $BASE/CPAchecker benchexec --container ../benchmark_defs/test-generation-single+parallel.xml
To compare our best CoVeriTest configuration with the tools participating in the International Competiton on Software Testing (Test-Comp'19), we first let the best CoVeriTest configuration (reuse-arg,20×80) generate test-cases for the test tasks of Test-Comp'19 that support the property coverage-branches.prp. Thereafter, we use the Test-Comp validator to measure for each task the branch coverage achieved by CoVeriTest on that task. Since the CoVeriTest version used in the evaluation does not support the test-case format of the validator, we use a wrapper script (coverage-val.sh) that converts the test-cases and then calls the validator.
cd $BASE/CPAchecker benchexec --container ../benchmark_defs/test-case-generation-compare.xml
cd $BASE/validation export PYTHONPATH=$PYTHONPATH:. benchexec --container ../benchmark_defs/test-case-coverage-val.xmlIf you want to use the test-cases you generated with the commands in step one, you need to properly adjust the paths to test cases of your benchmark. To this end, adapt the text ending on test.zip in the option and requiredfiles tags of the benchmark file test-case-coverage-val.xml. Thereafter, you can use the above command.
We compared the produced coverage values with the results of the test-case generation tools participating in Test-Comp'19. Note that this comparison only makes sense because we used the same resource limits and infrastructure as Test-Comp'19.