Looking at the Empirical Evidence for Using Pairwise and Combinatorial Software Testing

By Justin Hunter · Mar 25, 2013

This post addresses some comments and skeptical (in a good way) questions raised by Phil Kirkham to our recent posts: The Software Testing Community Needs More Empirical Studies.

As background to my answers: The studies and dozens of proof of concept pilot projects that I’ve been directly involved with have sought to answer these 3 questions:

1) Is it actually faster to generate tests with Hexawise than creating and documenting them manually?

Consistent findings: Yes. It takes, on average, about 40% less time to create and document tests using Hexawise because using Hexawise allows testers to partially automate test selection and test documentation steps.

2) Is it possible to generate smaller sets of tests that will be as thorough or more thorough than larger sets of manually created tests and allow testers to find more defects in less test execution time?

Consistent findings: Yes. Typically more than twice as many defects per tester hour. See, e.g., the IEEE Computer article written with 3 PhDs showing an increase in defects found per tester hour of 2.4 times. A more recent set of 10 proof of concept pilot projects at an insurance firm revealed 3.0 times as many defects per tester hour. See: Does pairwise testing really work? Evidence, data, and case studies.
This is because Hexawise-generated tests (or any pairwise tests, for that matter) consistently have dramatically less wasteful repetition than manually selected tests will and because Hexawise-generated tests leave no potential dual-mode faults untested (that is, no potential pairwise defects involving test inputs that have been contemplated by the tester and included in their models).

3) Finding more defects per tester hour is certainly nice, but do Hexawise-generated tests find MORE defects?

Consistent findings: Yes. Much smaller set of Hexawise tests have consistently found more defects. On average by about 13%

 

In answer to specific questions:

Phil Kirkham: “Seems a very basic measure of a testers productivity How about the severity of the defects ?”

JH: Agreed.

In my experience in more than 5 years of helping teams conducting these proof of concept pilot projects, pairwise and Hexawise-generated tests are just plan more effective at finding defects. They find more of ALL kinds of defects. My experience has been that the types of defects being found is not skewed towards less significant types of defects or missing severe defects. A case in point: at my old firm, one of the early adopters of orthogonal array testing approaches ran pilot project after pilot project with teams of testers reporting into him. I can’t remember the exact number of pilots he had conducted, perhaps 20 pilot projects or so, before he experienced a single defect that escaped the Hexawise-generated tests that was found by the much longer set of manually selected tests. So, a short, blunt, honest answer to your excellent question, is “Believe it or not, Phil, it almost never matters. This approach will find ALL of the defects you otherwise would have found. Plus additional ones.”*

*Major caveat here that calls into question my specific answer here (to your question concerning severity) as well as all of the results from all of the studies I’ve been involved with. Testers like you and me are strong proponents of Exploratory Testing. These studies, though, treat the test inputs and test cases as “frozen.” You have the test cases in list A (created manually) and the test cases in list B (created by using Hexawise). The ideas about what can be changed from test to test (parameters) and how each of those things can be changed (values) are identical in both lists. The difference is that list A has lots of wasteful repetition and lots of gaps in coverage. List B has neither. That’s the only difference. Then one tester executes the tests from list A and another tester executes the tests from list B. But what if you have an unskilled tester following rote scripts executing one set of tests and someone like you, Rob Sabourin, Michael Bolton, James Bach, Shmuel Gershon, Ajay Balamurugadas, etc., executing the second set? Whoa! All bets are off. What would happen is that skilled Exploratory Testers would use the Hexawise-generated test ideas (which they would not want to be overly-detailed), and go “off script” to explore interesting test ideas that they cooked up in real time as they were doing their testing. So skilled Exploratory Testers would be able to find defects (presumably including serious ones) that the written test cases, regardless of whether they were manually created or created by Hexawise, would not lead them to directly. That’s an important topic for another time. I’ll be talking about Exploratory Combinatorial Testing at the Conference of the Association of Software Testing – CAST – this year in my home town of Madison, Wisconsin. Since you’re also going, perhaps we could collaborate and you could share your experiences (good or bad) with the attendees. I’ll happily give you 10 minutes of my speaking time to share your experiences if you’d like.

PK: Are you only counting functional defects ?

JH: Not explicitly. The directions I give to teams running these pilot projects is. Try to answer the 3 questions above. Report defects. We’re not after a count of “failed test cases.” We’re after a number of defects. Having said that, as you might suspect, most of the defects reported tend to be functional defects.

PK: What about all the other types of defect ?

JH: They’re not reported as often but we count them too. In situations where one tester reports a “hard to spot” bug (e.g., one that might take a more experienced tester to identify), it raises the possibility that the bug is being reported not because one set of tests is superior to the other but because one tester is better than the other. Accordingly, in an effort to keep an apples to apples comparison, we talk with the tester and try to determine with the tester’s input whether the tester would have found that same defect with the other set of tests. If the answer is yes, we’d report the defect as “found” in both sets of tests. This doesn’t happen as often as you might suspect it would.

PK: Does the project type matter ?

JH: Yes. Benefits tend to be relatively smaller when there are a disproportionately high percentage of small, discrete, one-off tests. And higher when there are more than 5 parameters that interact in meaningful ways. And easier to capture when the System Under Test does not have a lot of conditional branching logic.

PK: How about the devs they are working with and the practices they follow ?

JH: I don’t have enough empirical evidence to say definitively. It’s used successfully by thousands of testers in waterfall projects and thousands of testers in Agile projects.

PK: What about the experience of the tester, does that make a difference?

JH: Even more important than experience level of the tester are, in order, (1) analytical ability, (2) willingness to try new things, and (3) willingness to ask questions. By my estimates about 50% of the testers I come across at our clients (almost all at Fortune 2000 firms) would not be able to design excellent sets of pairwise tests from scratch. This is because above average analytical ability is required for testers to select parameters and values from their Systems Under Test in a thoughtful way. Getting back to experience level, some of our strongest users at our clients are straight of out of college. They start work, get exposed to Hexawise, “get it” and don’t look back. Interestingly, some testers who have been testing for, say, 10 years or more – while experienced – sometimes seem to be too set in their ways to embrace this rather different approach to designing tests.

PK: If they are working with top of the range developers (as some lucky testers are cough cough) then there aren’t that many functional bugs to be found and you’re looking at browser compatibility, usability, race conditions – is combinatorial testing going to find these more quickly ?

JH: Yes. Absolutely. If you’d like to collaborate to test that and help gather empirical evidence that you could share at CAST, I would be happy to work with you to do just that. If your experience contradicts what I’m saying here, you’d have the floor to tell CAST participants what your actual experience was.

PK: I read the study in the link – 97% of defects could be found by pairwise combinatorial testing ? Really ? ALL types of defects ? Really ? How can pairwise find a defect caused by a missing or ambiguous or inconsistent requirement, or a performance or security ?

JH: The statistics I quote are a lot lower than that. The pie chart I use averages out several studies done by PhD’s that have found, on average, 84% of defects could be triggered by testing for all combinations of 2 test inputs. The 97% figure is eyebrow-raising on its own (regardless of industry). Given that it was in the medical device industry in the United States (one of the most litigious area in the history of the world?), that statistic is particularly mind-boggling. What the PhDs in that study did was take a look at all of the medical devices that had been taken off of the market in the United States as a result of software defects. Then they investigated how many test inputs would be required to trigger each of those defects. They authors of that study found that an astonishing 97% of those defects could have been triggered by just 2 test inputs.

PK: Love your passion and enthusiasm and I do have a beta of Hexawise to see if it can do anything for my productivity – and I might agree that there is a lack of empirical studies, not just among the testing community but the s/w community as a whole into the effectiveness or not of how software is produced

JH: Thanks. I hope you have positive experiences with using Hexawise and I’m happy to help you if ever have any questions about using Hexawise on your projects.