CoVeriTest: Cooperative, Verifier-Based Testing

Dirk Beyer, Marie-Christine Jakobs

Abstract

Testing is a widely used method to assess software quality. Coverage criteria and coverage measurements are used to ensure that the constructed test suites adequately test the given software. Since manually developing such test suites is too expensive in practice, various automatic test-generation approaches were proposed. Since all approaches come with different strengths, combinations are necessary in order to achieve stronger tools. We study cooperative combinations of verification approaches for test generation, with high-level information exchange.
We present CoVeriTest, a hybrid approach for test-case generation, which iteratively applies different conditional model checkers. Thereby, it allows to adjust the level of cooperation and to assign individual time budgets per verifier. In our experiments, we combine explicit-state model checking and predicate abstraction (from CPAchecker) to systematically study different CoVeriTest configurations. Moreover, CoVeriTest achieves higher coverage than state-of-the-art test-generation tools for some programs.

Results

We performed three types of experiments:

The first type of experiments systematically studies different CoVeriTest configurations.
The second type of experiments compares the best CoVeriTest configuration against its components, more concrete against each of its single component analyses and against the parallel execution of all component analyses.
The third type of experiments compares CoVeriTest against existing test-case generation approaches.

Comparing CoVeriTest Configurations

CoVeriTest is implemented as part of CPAchecker. Currently, only CPAchecker's value and predicate analysis support all kind of reuse types offered by CoVeriTest. Thus, we decided to fix the analyses components used for the comparison of CoVeriTest configurations. Investigating how CoVeriTest performs when using different component analyses is left for future work.

Next to the analysis components, which we fixed, CoVeriTest allows one to configure the time limit of each component (per iteration) and how to reuse information between analysis executions of the same CoVeriTest run. We used six different time limits to investigate whether CoVeriTest should switch often, sometimes, or rarely between analyses and whether should equally distribute the time among the component analyses or not. Furthermore, we used nine reuse types ranging from no exchange, over only reuse own results to full cooperation. For the details on the different reuse types, we refer to our paper.

For the comparison of these 54 configuration, we used all 6703 C programs from the largest publicly available benchmark collection of C verification tasks. Since all configurations are run by the same tool CPAchecker, we decided to disable test-case generation (thus, reducing the amount of data) and only compare the number of covered test goals reported by CPAchecker.

The following table presents the experimental data of our study of the different CoVeriTest configurations. For each combination of reuse type and time limit pair, the table provides the raw data (compressed XML file). Additionally, the table contains one HTML and one CSV file for each reuse type. These files show for each reuse type and test task (program) the experimental data for the different time limits, i.e., configurations using the particular reuse type with varying time limits. Similarly, the table contains one HTML and one CSV file for each time limit pair. These files provide for each time limit pair and test task (program) the experimental data for the different reuse types. The last row presents the log files (zip-compressed). There exists one log file per configuration and test task. Log files are grouped by the time limit used in the CoVeriTest configuration (due to course of experiments).

	Time Limits
Reuse Type	10×10	50×50	100×100	250×250	80×20	20×80	All Limits

plain	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table	CSV Table
cond_v	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table	CSV Table
cond_p	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table	CSV Table
cond_v,p	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table	CSV Table
reuse-prec	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table	CSV Table
reuse-arg	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table	CSV Table
cond_v+r	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table	CSV Table
cond_p+r	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table	CSV Table
cond_v,p+r	Result XML	Result XML	Result XML	Result XML	Result XML	Result XML	HTML Table	CSV Table

	HTML Table	HTML Table	HTML Table	HTML Table	HTML Table	HTML Table

All Types	CSV Table	CSV Table	CSV Table	CSV Table	CSV Table	CSV Table

	Log Files	Log Files	Log Files	Log Files	Log Files	Log Files
						Log Files

Since a program may have a large number of unreachable test goals, we decided not to compare the total coverage, i.e., number of covered test goals divided by number of total test goals, but a relative measure (called relative coverage) that compares against the best value achieved. Hence, when comparing a set of configurations, we first compute for each program (test task) the largest number of covered test goals reported among all configurations in the set. To compute the relative coverage, we then divide the number of test goals achieved by a configuration by this largest number.

As explained above, we have two parameters for CoVeriTest, which we can control in our experiments, namely time limits and reuse type. In our paper, we studied which time limits to use for a particular reuse type and based on this what is the best CoVeriTest configuration. On our paper, we also study which reuse type to use for a particular time limit pair.

Comparison of Time Limits per Reuse Type

The following nine figures show for each of the nine reuse types six box plots. Each box plots shows the distribution of the relative coverage. For the per task comparison of the coverage, we refer to the HTML and CSV tables above. Looking at the figures, we observe that the first four reuse types perform best with view switches, while the last five reuse types perform best when preferring the predicate analysis. Since the first four reuse types let an analysis forget all information found out in a previous run, switching seem to be avoided. If this is not the case (as for the last five reuse types shown), the more powerful analysis (predicate analysis) is preferred.

plain	cond_v	cond_p

cond_v,p	reuse-prec	reuse-arg

cond_v+r	cond_p+r	cond_v,p+r

Comparison of Best Configurations per Reuse Type

Based on the previous results, we chose for each reuse type the best time limit pair and compared the different reuse types. Next, we provide the box plot comparison known from above as well as the CSV and HTML table of this comparison. The result of the comparison is that the best configuration uses type reuse-arg and time limit pair 20×80.

	HTML Table	CSV Table

Comparison of Time Limits per Reuse Type

The following six figures show for each of the six time limit pairs nine box plots. Each box plots shows the distribution of the relative coverage. For the per task comparison of the coverage, we refer to the HTML and CSV tables above. Looking at the figures, we observe that for each time limit there always exists a configuration which is better than the first four reuse types. Thus, CoVeriTest should be configured to reuse information that an analysis generated in a previous iteration of the same CoVeriTest run. Furthermore, either reuse type reuse-arg or cond_v performed best.

10×10	50×50

100×100	250×250

80×20	20×80

Comparison of Best Configurations per Time Limit

Based on the previous results, we chose for each time limit pair the best reuse type and compared the different time limits. Next, we provide the box plot comparison known from above as well as the CSV and HTML table of this comparison. The result of the comparison is that the best configuration uses type cond_v+r and time limit pair 20×80.

	HTML Table	CSV Table

Depending on which parameter is fixed first, the best CoVeriTest configuration slightly differs. While the time limit pair is the same, the reuse type is different. Already in our paper, we mentioned that these two configurations are close by. The reason is that we use relative coverage instead of total coverage. Since the set of configurations differs, the maximal coverage value and thus the relative coverage value can be different for some test tasks. A direct comparison of only the two configurations revealed that the configuration (cond_v+r,20×80) achieves higher coverage for more tasks, while the configuration (reuse-arg,20×80) achieved higher coverage in total. Thus, we follow our paper and refer to the latter configuration as best CoVeriTest configuration.

Comparing CoVeriTest against its Components

As already mentioned, CoVeriTest interleaves the two analyses: predicate and value analysis. To investigate if CoVeriTest's interleaving is beneficial, we compare the best CoVeriTest configuration ((reuse-arg, 20×80) with the two component analyses and a parallel execution of both analyses.

For the comparison, we again used all 6703 C programs from the largest publicly available benchmark collection of C verification tasks. Since all configurations are still run by the same tool CPAchecker, we decided to disable test-case generation, too.

The following table presents the experimental data (XML and log files) of the three non-CoVeriTest test-case generation experiments. The experimental data for the best CoVeriTest configuration is provided in the table for the comparison of CoVeriTest configurations. Additionally, the table contains HTML and CSV tables comparing the best CoVeriTest configuration with one of the other three configurations.

The above table lets one compare the coverage results individually per test task (program). To compare the different configurations, we need to aggregate the results on all test tasks. We decided to use a scatter plot that contains one point per test tasks. Each point (x,y) describes the coverage x achieved by CoVeriTest and the coverage y achieved by the other analysis.

Looking at the scatter plots, we observe that CoVeriTest outperforms the value analysis. In the other two cases, it is not that obvious. CoVeriTest is beneficial for certain program classes, but not always best.

Comparing CoVeriTest against Test-Comp Participants

Analysis Type	Raw Data	CoVeriTest vs. Analysis
Parallel Analysis (Value+Predicate)	Result XML	HTML Table	CSV Table
Predicate Analysis	Result XML	HTML Table	CSV Table
Value Analysis	Result XML	HTML Table	CSV Table

	Log Files

To study when and whether CoVeriTest improves over state-of-the-art test-case generation tools, we compare the branch coverage achieved by CoVeriTest's best configuration (reuse-arg,20×80) with the branch coverage achieved by the participants of the International Competition on Software Testing (Test-Comp'19). In our paper, we only compared against the best two tools.

For our comparison, we used the all 1720 programs used by Test-Comp'19 in the category Code Coverage. Furthermore, we no longer cannot use the reported number of covered goals for comparison, but had to (1) generate test-cases with CoVeriTest and (2) let the Test-Comp validator measure branch coverage. In the following, we show six scatter plots. Each scatter plots compares the branch coverage achieved by CoVeriTest against the branch coverage achieved by one Test-Comp participant. We ordered the scatter plots by the success of the tool in Test-Comp'19 (best first). Note that we did not include the two ESBMC participants because they do not output a single test case in the branch coverage category.

Looking at the scatter plots, we observe that CoVeriTest neither dominates nor is dominated by any of the approaches. Thus, it complements existing approaches.

The following table provides the raw data for the above comparison. The generated tests are available in our artifact. Furthermore, note that we did not produce the raw data for the Test-Comp'19 ourselves, but link to the raw data of Test-Comp'19, which we used in the comparison.