I run into the same problem quite often: people have a hard time distinguishing between good and bad tests.

So what are ‘bad tests’?

Bad tests are those that don’t make the tester learn as much as possible with each test step. Said another way: bad tests are repetitive.

Repetitive tests are time wasters. They make testing a mundane task to be completed. Repetitive tests don’t emphasize the complexity inherent in most systems today. And they let things (read: bugs, defects, faults) slip through into production.

Experiment Time!

Question: How bad are bad tests?

Research: (not much out there)

Hypothesis: Bad tests are deceptively bad.


  1. Choose some tests.
  2. Model their ideas in Hexawise.
  3. Lock-in bad tests as Requirements.
  4. Find out how many of the interactions bad tests cover.
  5. Then see how many tests are needed to cover all pairwise interactions.

Step 1: Choosing Tests

I took some tests that some testers all agreed cover the functionality that needed to be covered for the story they were testing.

Step 2: Test Designing

I sat down and analyzed them. Repetitive as any I’d seen. I sifted through them to pull out the main testing ideas. These I would use as parameters and values. I entered these into Hexawise.

Step 3: Lock-in Bad Tests

I used our Requirements feature to lock-in their repetitive tests. So Hexawise would be forced to use their tests first (this would allow me to model how many interactions each of their tests covered)

Step 4: Analyzing Bad Tests

This is the chart Hexawise produced. In 30 tests, they covered 47% of the total possible pairwise interactions.

30 bad tests

Before we go on: Do you notice that there are little plateaus? From Test 5 to 6, 7 to 8, 14 to 15, 20 to 21, 22 to 23, 26 to 27, and 29 to 30? That means those were tests that did not include ANY new pairs. No new information could have been learned from those tests. Learning literally plateaued. 7 times. Nearly 1 out of 4 tests didn't teach the testers anything.

Then I unleashed the Kraken Hexawise.

Step 5: Let Hexawise Optimize

I removed those Requirements. See how many tests Hexawise needs to cover all of the interactions in this specific functionality.

30 good tests

Okay, to be honest, I wanted Hexawise to do it in like 20 tests. (More Coverage. Fewer Tests.) But it used 30 (More Coverage). BUT (and this is a big BUT snickers) Hexawise covered 100% of the pairwise interactions in 30 tests.

Lessons Learned

No one would have guessed their tests were that bad by just reading through them. They looked perfectly fine. They read like good test cases. But as we started to visualize their coverage, we saw that perhaps they weren't achieving all they could. And when we compared bad tests to Hexawise tests, more coverage (in the same amount of tests) is a clear winner.

In short:

  • Bad tests are deceptively bad
  • Sometimes you have to prove it
  • Pairwise tests can alleviate bad-test-initis

By: Jordan Weck on Jul 2, 2014

Categories: Combinatorial Software Testing, Pairwise Software Testing, Software Testing Efficiency

Some of those using Hexawise use Gherkin as their testing framework. Gherkin is based on using a given [a], when [b] --> then [c] format. The idea is this helps make communication clear and make sure business rules are understood properly. Portions of this post may be a bit confusing for new Hexawise users, links are provided for more details on various topics. But, if you don't need to create output for Gherkin and you are confused you can just skip this post.

A simple Gherkin scenario: Making an ATM withdrawal

Given a regular account
  And the account was originally opened at Goliath National
  And the account has a balance of $500 
When using a Goliath National Bank 
  And making a withdrawal of $200 
Then the withdrawal should be handled appropriately 

Hexawise users want to be able to specify the parameters (used in given and when statements) and then import the set of Hexawise generated test cases into a Gherkin style output.

In this example we will use Hexawise sample test plan (Gherkin example), which you can access in your Hexawise account.

I'll get into how to export the Hexawise created test plans so they can be used to create Gherkin data tables below (we do this ourselves at Hexawise).

In the then field we default to an expected value of "the withdrawal should be handled appropriately." This is something that may benefit from some explanation.

If we want to provide exact details on exactly what happens on every variation of parameter values for each test script those have to be manually created. That creates a great deal of work that has very little value. And it is an expensive way to manage for the long term as each of those has to be updated every time. So in general using a "behaves as expected" default value is best and then providing extra details when worthwhile.

For some people, this way of thinking can be a bit difficult to take in at first and they have to keep reminding themselves how to best use Hexawise to improve efficiency and effectiveness.


To enter the default expected value mouse-over the final step in the auto scripts screen. When you mouse over that step you will see the "Add Expected Results" link. Click that and add your expected result text.


The expect value entered on the last step with no conditions (the when drop down box is blank) will be the default value used for the export (and therefor the one imported into Gherkin).

In those cases when providing special notes to tester are deemed worth the extra effort, Hexawise has 2 ways of doing this. In the event a special expected value exists for the particular conditions in the individual test case then the special expected value content will be exported (and therefore used for Gherkin).

Conditional expected results can be entered using the auto scripts feature.

Or we can use the requirements feature when we want to require a specific set of parameter values to be tested. If we chose 2 way coverage (the default, pairwise coverage) every pair of parameter values will be tested at least once.

But if we wanted a specific set of say 3 exact parameter values ([account type] = VIP, [withdrawal ATM] = bank-owned ATM, [withdrawal amount] = $600 then we need to include that as a requirement. Each required test script added also includes the option to include an expected result. The sample plan includes a required test case with those parameters and an expected result of "The normal limit of $400 is raised to $600 in the special case of a VIP account using a Goliath National Bank owned ATM."

So, the most effective way to use Hexawise to create a pairwise (or higher) test plan to then use to create Gherkin data tables will be to have the then case be similar to "behaves as expected." And when there is a need for special expected results details to use the auto script or requirements features to include those details. Doing so will result the expected result entered for that special case being the value used in the Gherkin table for then.

When you click auto script button the test are then generated, you can download them using the export icon.


Then select option to download as csv file.


You will download a zip file that you can then unzip to get 2 folders with various files. The file you want to use for this is the combinations.txt file in the csv directory.

The Ruby code we use to convert the commas to pipes | used for Gherkin is

!/usr/bin/env ruby
require 'csv'
tests = CSV.read("combinations.csv")
table = []
tests.each do |test|
table << "| " + test[1..-1].join(" | ") + " |\n"
IO.write("gherkin.txt", table.join())

Of course, you can use whatever method to convert the format you wish, this is just what we use. See this explanation for a few more details on the process.

Now you have your Gherkin file to use however you wish. And as the code is changed over time (perhaps adding parameter value options, new parameters, etc.) you can just regenerate the test plan and export it. Then convert it and the updated Gherkin test plan is available.


Related: Create a Risk-based Testing Plan With Extra Coverage on Higher Priority Areas - Hexawise Tip: Using Value Expansions and Value Pairs to Handle Dependent Values - Designing a Test Plan with Dependent Parameter Values

By: John Hunter on Mar 27, 2014

Categories: Hexawise test case generating tool, Hexawise tips, Scripted Software Testing, Software Testing Efficiency, Testing Strategies

Since creating Hexawise, I've worked with executives at companies around the world who have found themselves convinced in the value of pairwise testing. And then they need to convince their organization of the value.

They often follow the following path: first thinking "pairwise testing is a nice method in theory, but not applicable in our case" then "pairwise is nice in theory and might be applicable in our case" to "pairwise is applicable in our case" and finally "how do I convince my organization."

In this post I review my history helping convince organizations to try and then adopt pairwise, and combinatorial, software testing methods.

About 8 years ago, I was working at a large systems integration firm and was asked to help the firm differentiate its testing offerings from the testing services provided by other firms.

While I admittedly did not know much about software testing then but by happy coincidence, my father was a leading expert in the field of Design of Experiments. Design of Experiments is a field that has applicability in many areas (including agriculture, advertising, manufacturing, etc.) The purpose of Design of Experiments is to provide people with tools and approaches to help people learn as much actionable information as possible in as few tests as possible.

I Googled "Design of Experiments Software Testing." That search led me to Dr. Madhav Phadke (who, by coincidence, had been a former student of my father). More than 20 years ago now, Dr. Phadke and his colleagues at ATT Bell Labs had asked the question you're asking now. They did an experiment using pairwise test design / orthogonal array test design to identify a subset of tests for ATT's StarMail system. The results of that testing effort were extraordinarily successful and well-documented.

Shortly after doing that, while working at that systems integration firm, I began to advocate to anyone and everyone who would listen that designing approach to designing tests promised to be both (a) more thorough and (b) require (in most but not all cases) significantly fewer tests. Results from 17 straight projects confirmed that both of these statements were true. Consistently.


Repeatable Steps to Confirm Whether This Approach Delivers Efficiency and Thoroughness Improvement (and/or document a business case/ROI calculation)

How did we demonstrate that this test design approach led to both more thorough testing and much more efficient testing? We followed these steps:

  1. Take an existing set of 30 - 100 existing tests that had already been created, reviewed, and approved for testing (but which had not yet been executed).

  2. Using the test ideas included in those existing tests, design a set of pairwise tests (often approximately half as many tests as were in the original set). When putting your tests together, if there are particular, known, high-priority scenarios that stakeholders believe are important to test, it is important to make sure that that you "force" your pairwise test generator to include such high-priority scenarios.

  3. Have two different testers execute both sets of tests at the same time (e.g., before developers start fixing any defects that are uncovered by testers executing either set of tests)

Document the following:

  • How long did it take to execute each set of tests?

  • How many unique defects were identified by each set of tests?

  • How long did it take to create and document each set of tests?*


*This third measurement was usually an estimate because a significant number of teams had not tracked the amount of time it took to create the original set of tests.

The results in 17 different pairwise testing "bake-off" projects conducted at my old firm included:

  • Defects found per tester hour during test execution: when pairwise tests were used, more than twice as many defects were found per tester hour

  • Total defects found: pairwise tests as many or more defects in every single project (despite the fact that in almost every case there were significantly more tests in the each original set of tests)

  • Defects found by pairwise tests but missed by traditional tests: a lot (I forget the exact number)

  • Defects found by traditional tests but missed by pairwise tests: zero

  • Amount of time to select and document tests: much less time required when a pairwise test generator was used (As mentioned above, precise measurements were difficult to gather here)


More recent project benefits have included these:



Those experiences - combined with the realization that many Fortune 500 firms were starting to try to implement smarter test design methods to achieve these kinds of benefits but were struggling to find a test design tool that was user-friendly and would integrate into their other tools - led me to the decision to create Hexawise.


Additional Advice and Lessons Learned Based on My Experiences

Once the testing the value of pairwise software testing at a specific organization it is very common to find the proponent of taking advantage of pairwise testing advantages to find themselves saying:

I have already elaborated some test plans that would save us up to 50% effort with that method. But now my boss and other colleagues are asking me for a proof that these pairwise test cases suffice to make sure our software is running well.

In that case, my advice is three-fold:

First, appreciate how your own thinking has evolved and understand that other people will need to follow a similar journey (and that others won't have as much time to devote as you have had to experience learnings first-hand).

When I was creating Hexawise, George Box, a Design of Experiments expert with decades of experience explaining to skeptical executives how Design of Experiments could deliver transformational improvements to their organizations' efficiency and effectiveness, told me "Justin, you'll experience three phases of acceptance and acceptance will happen more gradually than you would expect it to happen. First, people will tell you 'It won't work.' Next, they'll say "It won't work here." Eventually, he said with a smile, they'll tell you 'Of course this works. I thought of it first!'

When people hear that you can double their test execution productivity with a new approach, they won't initially believe you. They'll be skeptical. Most people you're explaining this approach to will start with the thought that "it is nice in theory but not applicable to what I'm doing." It will take some time and experience for people to understand and appreciate how dramatic the benefits are.

Second, people will instinctively be dismissive of pairwise testing case study after case study after case study that show how effective this approach has been for testers in virtually all types of industries and all types and phases of testing. George Box was right when he predicted that people will often respond with 'It won't work here.' Sometimes it is hard not to smile when people take that position.

Case in point: I will be talking to a senior executive at a large capital markets firm soon about how our tool can help them transform the efficiency and effectiveness of their testing group. And I can introduce them to a client of ours that is using our test design tool extensively in every single one of their most important IT projects. Will that executive take me up on my offer? I hope so, but based on past experience, I suspect odds are good that he'll instead react with 'Yes, yes, sure, if companies were people, that company would be our company's identical twin, but still... It won't work here.'

Third, at the end of the day, the most effective approach I have found to address that understandable skepticism and to secure organizational-level buy-in and commitment is through gathering hard, indisputable evidence on multiple projects that the approach works at the company itself through a bake-off approach (e.g., following those four steps outlined above. A few words of advice though.

My proposed approach isn't for the faint of heart. If you're working at a large company with established approaches, you'll need patience and persistence.

Even after you gather evidence that this approach works in Business Unit A, and B and C, someone from Business Unit D will be unconvinced with the compelling and irrefutable evidence you have gathered and tell you 'It won't work here. Business Unit D is unique.' The same objections may likely arise with results from "Type of Testing" A, B, and C. As powerful and widely-applicable as this test design approach is, always remember (and be clear with stakeholders) that it is not a magical silver bullet.

James Bach raises several valid limitations with using this approach. In particular, this approach won't work unless you have testers who have relatively strong analytical skills driving the test design process. Since pairwise test case generating tools are dependent upon thoughtful test designers to identify appropriate test inputs to vary, this approach (like all test design approaches) is subject to a "garbage in / garbage out" risk.

Project leads will resist "duplicating effort." But unless you do an actual bake-off stakeholders won't appreciate how broken their existing process is. There's inevitably far more wasteful repetition hidden away in standard tests than people realize. When you start reporting a doubling of tester productivity on several projects, smart managers will take notice and want to get involved. At that point - hopefully - your perseverance should be rewarded.


Some benefits data and case studies that you might find useful:


If you can't change your company, consider changing companies

Lastly, remember that your new-found skills are in high demand whether or not they're valued at your current company. And know that, despite your best efforts and intentions, your efforts might not convince skeptics. Some people inevitably won't be willing to take the time to understand. If you find yourself in a situation where you want to use this test design approach (because you know that these approaches are powerful, practical, and widely-applicable) but that you don't have management buy-in, then consider whether or not it would be worth leaving your current employer to join a company that will let you use your new-found skills.

Most of our clients, for example, are actively looking for software test designers with well developed pairwise and combinatorial test design skills. And they're even willing to pay a salary premium for highly analytical test designers who are able to design sets of powerful tests. (We publicize such job openings in the LinkedIn Hexawise Guru group for testers who have achieved "Guru" level status in the self-paced computer-based-training modules in the tool).


Related: Looking at the Empirical Evidence for Using Pairwise and Combinatorial Software Testing - Systematic Approaches to Selection of Test Data - Getting Known Good Ideas Adopted

By: Justin Hunter on Nov 21, 2013

Categories: Pairwise Software Testing, Software Testing, Software Testing Efficiency, Testing Case Studies, Testing Strategies

Hexawise allows you to adjust testing coverage to focus more thorough coverage on selected, high-priority areas. Mixed strength test plans allow you to select different levels of coverage for different parameters.

Increasing from pairwise to "trips" (3 way) coverage increases the test plan so that bugs that are the results of 3 parameters interacting can be found. That is a good thing. But the tradeoff is that it requires more tests to catch the interactions.

The mixed-strength option that Hexawise provides allow you to do is select a higher coverage level for some parameters in your test plan. That lets you control the balance between increased test thoroughness with the workload created by additional tests.




See our help section for more details on how to create a risk-based testing plan that focuses more coverage on higher priority areas.

As that example shows, Hexawise allows you to focus additional thoroughness on the 3 highest priority parameters with just 120 tests while also providing full pairwise coverage on all factors. Mixed strength test plans are a great tool to provide extra benefit to your test plans.


Related: How Not to Design Pairwise Software Tests - How to Model and Test CRUD Functionality - Designing a Test Plan with Dependent Parameter Values

By: John Hunter on Nov 6, 2013

Categories: Combinatorial Testing, Efficiency, Hexawise test case generating tool, Hexawise tips, Software Testing Efficiency, Testing Strategies

This post addresses some comments and skeptical (in a good way) questions raised by Phil Kirkham to our recent posts: The Software Testing Community Needs More Empirical Studies.

As background to my answers: The studies and dozens of proof of concept pilot projects that I’ve been directly involved with have sought to answer these 3 questions:

1) Is it actually faster to generate tests with Hexawise than creating and documenting them manually?

Consistent findings: Yes. It takes, on average, about 40% less time to create and document tests using Hexawise because using Hexawise allows testers to partially automate test selection and test documentation steps.

2) Is it possible to generate smaller sets of tests that will be as thorough or more thorough than larger sets of manually created tests and allow testers to find more defects in less test execution time?

Consistent findings: Yes. Typically more than twice as many defects per tester hour. See, e.g., the IEEE Computer article written with 3 PhDs showing an increase in defects found per tester hour of 2.4 times. A more recent set of 10 proof of concept pilot projects at an insurance firm revealed 3.0 times as many defects per tester hour. See: Does pairwise testing really work? Evidence, data, and case studies.
This is because Hexawise-generated tests (or any pairwise tests, for that matter) consistently have dramatically less wasteful repetition than manually selected tests will and because Hexawise-generated tests leave no potential dual-mode faults untested (that is, no potential pairwise defects involving test inputs that have been contemplated by the tester and included in their models).

3) Finding more defects per tester hour is certainly nice, but do Hexawise-generated tests find MORE defects?

Consistent findings: Yes. Much smaller set of Hexawise tests have consistently found more defects. On average by about 13%


In answer to specific questions:

Phil Kirkham: “Seems a very basic measure of a testers productivity How about the severity of the defects ?”

JH: Agreed.

In my experience in more than 5 years of helping teams conducting these proof of concept pilot projects, pairwise and Hexawise-generated tests are just plan more effective at finding defects. They find more of ALL kinds of defects. My experience has been that the types of defects being found is not skewed towards less significant types of defects or missing severe defects. A case in point: at my old firm, one of the early adopters of orthogonal array testing approaches ran pilot project after pilot project with teams of testers reporting into him. I can’t remember the exact number of pilots he had conducted, perhaps 20 pilot projects or so, before he experienced a single defect that escaped the Hexawise-generated tests that was found by the much longer set of manually selected tests. So, a short, blunt, honest answer to your excellent question, is “Believe it or not, Phil, it almost never matters. This approach will find ALL of the defects you otherwise would have found. Plus additional ones.”*

*Major caveat here that calls into question my specific answer here (to your question concerning severity) as well as all of the results from all of the studies I’ve been involved with. Testers like you and me are strong proponents of Exploratory Testing. These studies, though, treat the test inputs and test cases as “frozen.” You have the test cases in list A (created manually) and the test cases in list B (created by using Hexawise). The ideas about what can be changed from test to test (parameters) and how each of those things can be changed (values) are identical in both lists. The difference is that list A has lots of wasteful repetition and lots of gaps in coverage. List B has neither. That’s the only difference. Then one tester executes the tests from list A and another tester executes the tests from list B. But what if you have an unskilled tester following rote scripts executing one set of tests and someone like you, Rob Sabourin, Michael Bolton, James Bach, Shmuel Gershon, Ajay Balamurugadas, etc., executing the second set? Whoa! All bets are off. What would happen is that skilled Exploratory Testers would use the Hexawise-generated test ideas (which they would not want to be overly-detailed), and go “off script” to explore interesting test ideas that they cooked up in real time as they were doing their testing. So skilled Exploratory Testers would be able to find defects (presumably including serious ones) that the written test cases, regardless of whether they were manually created or created by Hexawise, would not lead them to directly. That’s an important topic for another time. I’ll be talking about Exploratory Combinatorial Testing at the Conference of the Association of Software Testing – CAST – this year in my home town of Madison, Wisconsin. Since you’re also going, perhaps we could collaborate and you could share your experiences (good or bad) with the attendees. I’ll happily give you 10 minutes of my speaking time to share your experiences if you’d like.

PK: Are you only counting functional defects ?

JH: Not explicitly. The directions I give to teams running these pilot projects is. Try to answer the 3 questions above. Report defects. We’re not after a count of “failed test cases.” We’re after a number of defects. Having said that, as you might suspect, most of the defects reported tend to be functional defects.

PK: What about all the other types of defect ?

JH: They’re not reported as often but we count them too. In situations where one tester reports a “hard to spot” bug (e.g., one that might take a more experienced tester to identify), it raises the possibility that the bug is being reported not because one set of tests is superior to the other but because one tester is better than the other. Accordingly, in an effort to keep an apples to apples comparison, we talk with the tester and try to determine with the tester’s input whether the tester would have found that same defect with the other set of tests. If the answer is yes, we’d report the defect as “found” in both sets of tests. This doesn’t happen as often as you might suspect it would.

PK: Does the project type matter ?

JH: Yes. Benefits tend to be relatively smaller when there are a disproportionately high percentage of small, discrete, one-off tests. And higher when there are more than 5 parameters that interact in meaningful ways. And easier to capture when the System Under Test does not have a lot of conditional branching logic.

PK: How about the devs they are working with and the practices they follow ?

JH: I don’t have enough empirical evidence to say definitively. It’s used successfully by thousands of testers in waterfall projects and thousands of testers in Agile projects.

PK: What about the experience of the tester, does that make a difference?

JH: Even more important than experience level of the tester are, in order, (1) analytical ability, (2) willingness to try new things, and (3) willingness to ask questions. By my estimates about 50% of the testers I come across at our clients (almost all at Fortune 2000 firms) would not be able to design excellent sets of pairwise tests from scratch. This is because above average analytical ability is required for testers to select parameters and values from their Systems Under Test in a thoughtful way. Getting back to experience level, some of our strongest users at our clients are straight of out of college. They start work, get exposed to Hexawise, “get it” and don’t look back. Interestingly, some testers who have been testing for, say, 10 years or more – while experienced – sometimes seem to be too set in their ways to embrace this rather different approach to designing tests.

PK: If they are working with top of the range developers (as some lucky testers are cough cough) then there aren’t that many functional bugs to be found and you’re looking at browser compatibility, usability, race conditions – is combinatorial testing going to find these more quickly ?

JH: Yes. Absolutely. If you’d like to collaborate to test that and help gather empirical evidence that you could share at CAST, I would be happy to work with you to do just that. If your experience contradicts what I’m saying here, you’d have the floor to tell CAST participants what your actual experience was.

PK: I read the study in the link – 97% of defects could be found by pairwise combinatorial testing ? Really ? ALL types of defects ? Really ? How can pairwise find a defect caused by a missing or ambiguous or inconsistent requirement, or a performance or security ?

JH: The statistics I quote are a lot lower than that. The pie chart I use averages out several studies done by PhD’s that have found, on average, 84% of defects could be triggered by testing for all combinations of 2 test inputs. The 97% figure is eyebrow-raising on its own (regardless of industry). Given that it was in the medical device industry in the United States (one of the most litigious area in the history of the world?), that statistic is particularly mind-boggling. What the PhDs in that study did was take a look at all of the medical devices that had been taken off of the market in the United States as a result of software defects. Then they investigated how many test inputs would be required to trigger each of those defects. They authors of that study found that an astonishing 97% of those defects could have been triggered by just 2 test inputs.

PK: Love your passion and enthusiasm and I do have a beta of Hexawise to see if it can do anything for my productivity – and I might agree that there is a lack of empirical studies, not just among the testing community but the s/w community as a whole into the effectiveness or not of how software is produced

JH: Thanks. I hope you have positive experiences with using Hexawise and I’m happy to help you if ever have any questions about using Hexawise on your projects.

By: Justin Hunter on Mar 25, 2013

Categories: Combinatorial Software Testing, Efficiency, Pairwise Software Testing, Software Testing Efficiency, Testing Strategies

Attempting to assess the relative benefits of more than 200 software development practices is not for the faint of heart. Context-specific considerations run the risk of confounding the conclusions at every turn. Even so, Capers Jones, a software development expert with dozens of years of experience and nearly twenty books related to software development to his credit, recently attempted the task. He's literally devoted decades of his career to assessing such things for clients. We're quite pleased with how using Hexawise fared in the analysis.

Scoring and Evaluating Software Methods, Practices and Results by Capers Jones (Vice President and CTO, Namcook Analytics) provides some great idea on software project management. The article is based on the Software Engineering Best Practices with some new data is taken from The Economics of Software Quality (two of the books Capers Jones has authored).

Software development, maintenance, and software management have dozens of methodologies and hundreds of tools available that are beneficial. In addition, there are quite a few methods and practices that have been shown to be harmful, based on depositions and court documents in litigation for software project failures.

In order to evaluate the effectiveness or harm of these numerous and disparate factors, a simple scoring method has been developed. The scoring method runs from +10 for maximum benefits to -10 for maximum harm.

The scoring method is based on quality and productivity improvements or losses compared to a mid-point. The mid point is traditional waterfall development carried out by projects at about level 1 on the Software Engineering Institute capability maturity model (CMMI) using low-level programming languages. Methods and practices that improve on this mid point are assigned positive scores, while methods and practices that show declines are assigned negative scores.

The data for the scoring comes from observations among about 150 Fortune 500 companies, some 50 smaller companies, and 30 government organizations. Negative scores also include data from 15 lawsuits.


The article provides guidance, based on the results achieved by many, and varied, organizations with respect to software projects.

finding and fixing bugs is overall the most expensive activity in software development. Quality leads and productivity follows. Attempts to improve productivity without improving quality first are not effective.


This is an extremely important point for business managers to understand. Those involved in software development professionally don't find this surprising. But business people often greatly underestimate the costs of maintaining and updating software. The costs of bugs introduced by fairly minor feature requests to a system that doesn't have good software test coverage or test plans often create far more trouble than business managers expect.

This is especially true because there is a high correlation between software applications that have poor software testing processes (including poor test coverage and poor or completely missing test plans) and those application that were designed without long term maintenance in mind. Both deficiencies result of decisions made to minimize initial development costs and time. They both show a lack of appreciation for wise software engineering practices and software application project management.

The article discusses a complicating factor for accessing the most effective software development practices: the extremely wide differences in software engineering scope. Projects range from simple applications one software developer can create in a short period of time to massive application requiring thousands of developer-years or effort.

In order to be considered a “best practice” a method or tool has to have some quantitative proof that it actually provides value in terms of quality improvement, productivity improvement, maintainability improvement, or some other tangible factors.

Looking at the situation from the other end, there are also methods, practices, and social issues have demonstrated that they are harmful and should always be avoided. ... Although the author’s book Software Engineering Best Practices dealt with methods and practices by size and by type, it might be of interest to show the complete range of factors ranked in descending order, with the ones having the widest and most convincing proof of usefulness at the top of the list. Table 2 lists a total of 220 methodologies, practices, and social issues that have an impact on software applications and projects.

The average scores shown in table 2 are actually based on the composite average of six separate evaluations:

  1. Small applications < 1000 function points

  2. Medium applications between 1000 and 10,000 function points

  3. Large applications > 10,000 function points

  4. Information technology and web applications

  5. Commercial, systems, and embedded applications

  6. Government and military applications

The data for the scoring comes from observations among about 150 Fortune 500 companies, some 50 smaller companies, and 30 government organizations and around 13,000 total projects. Negative scores also include data from 15 lawsuits.

The scoring method does not have high precision and the placement is somewhat subjective.

Top 10 tools and practices listed in the article:

Practice Score
1. Reusability (> 85% zero-defect materials) 9.65
2. Requirements patterns - InteGreat 9.50
3. Defect potentials < 3.00 per function point 9.35
4. Requirements modeling (T-VEC) 9.33
5. Defect removal efficiency > 95% 9.32
6. Personal Software Process (PSP) 9.25
7. Team Software Process (TSP) 9.18
8. Automated static analysis - code 9.17
8. Mathematical test case design (Hexawise) 9.17
10. Inspections (code) 9.15


We are obviously thrilled that Hexawise is listed. We have seen the value our customers have achieved using mathematical based combinatorial software test plans (see several Hexawise case studies). It is great to see that value recognized in comparison to other software development practices and judged to be of such high value to software development projects.

The article makes it clear the importance of the results is not "the precision of the rankings, which are somewhat subjective, but in the ability of the simple scoring method to show the overall sweep of many disparate topics using a single scale."

The methodology behind the results shown in the article can be used to evaluate your organization's software development practice and determine opportunities for improvement. But, as stated above, software projects cover a huge range of scopes. The specific software project needs will drive which practices are most critical to achieving success for a specific project. The list in the article, of what practices have provided huge value and what practices have resulted great harm, is a very helpful resource but project managers and software developers and testers need to apply their judgement to the information the article provides in order to achieve success.

A leading company will deploy methods that, when summed, total to more than 250 and average more than 5.5. Lagging organizations and lagging projects will sum to less than 100 and average below 4.0.


The use of Hexawise has been growing; that has helped increase the number of software projects using best practices (that score 9, or higher), however as the article states there is quite a need for improvement.

From data and observations on the usage patterns of software methods and practices, it is distressing to note that practices in the harmful or worst set are actually found on about 65% of U.S. Software projects as noted when doing assessments. Conversely, best practices that score 9 or higher have only been noted on about 14% of U.S. Software projects. It is no wonder that failures far outnumber successes for large software applications!


A score of 9 to 10 for a practice means that practice results 20-30% improvement in quality and productivity of software projects.

Conclusion: while your individual mileage may vary, this report provides further evidence that using Hexawise really does lead to large, measurable improvements in efficiency and effectiveness.

We are very proud of the success of Hexawise thus far; as a new year starts we see huge potential to help many organizations improve their software development efforts.

The article includes a list of references and suggested readings that is valuable. Included in that list are:

DeMarco, Tom; Controlling Software Projects, 1986, 296 pages.

Gilb, Tom and Graham, Dorothy; Software Inspections, 1994, 496 pages.

Jones, Capers; Applied Software Measurement, 3rd edition, 2008, 662 pages.

McConnell, Code Complete, (I'm linking to the 2nd edition the article references the 1st edition) 2004, 960 pages.


Related: Maximizing Software Tester Value by Letting Them Spend More Time Thinking - A Powerful Software Test Design Approach - 3 Strategies to Maximize Effectiveness of Your Tests

By: John Hunter on Mar 18, 2013

Categories: Software Testing, Software Testing Efficiency, Testing Case Studies, Testing Strategies

I am passionate about pairwise software testing techniques. I have helped dozens of teams, for example, carefully measure the benefits that can be created when teams of testers adopt pairwise and related combinatorial testing approaches to identify the test cases they will execute (as compared to manual test case identification methods). What usually happens is that tester productivity doubles. (See Combinatorial Software Testing - pdf download).

I believe these approaches will be much more widely adopted in a few years than they are now for the simple reason that they consistently deliver dramatic benefits to both the speed of software test design and the efficiency and thoroughness of software test execution. As more teams try these methods for themselves, and measures the benefits they achieve with them, broader adoption seems highly likely to me.*

I see three main barriers to broader adoption by the testing community at large:

  1. The first barrier is that testers will not make an attempt to apply this method to their testing projects so they will never find out how effective it is. The second is that ill-informed testers will try to apply the approach but do such a poor job at implementation that they do not generate benefits.

  2. The second barrier is that even testers who use the approach effectively a few times, will not realize how much more effective it is making them. A dismissive thought process guilty of this might sound something like this: "Those 11 bugs I just found? Yeah. I found them because I'm a good tester; the fact that I happened to use pairwise tests just now? That's largely irrelevant. I'm sure I would have found them regardless.")

  3. The third barrier is that testers unfamiliar with the basics of pairwise testing principles will design test cases without thinking about what they are doing, and achieve "garbage in / garbage out" results. The benefits that would have been so easily achieved in the testing project - like Lindsey Jacobellis' opportunity to win a gold medal for Snow Boarding - disappear in a groan-worthy moment of bone-headed stupidity.



This blog post addresses this third barrier. When testers sabotage their own test plans with a poor choice of inputs, they may well blame the test design strategy rather than themselves, which would be unfortunate. Here's one common problem I see (exaggerated a bit in this example to make my point).


Objective: create a set of tests that will check to see if the underwriting engine for a car insurance firm is calculating premium estimates correctly.

Our aspiring pairwise test designer enters stage left and identifies a set of parameters:

First Name, Last Name, Age of Primary Driver, Credit Score, Number of Cars, Number of Accidents, Number of Speeding Tickets, and Number of Additional Drivers

So far so good. We now have the initial ingredients for a thing of beauty; we have a set of parameters that could quickly result in a combinatorial explosion of possibilities and, ready to save the day, we have a test designer who has correctly identified this as an opportunity to achieve efficiency and thoroughness benefits through the application of pairwise testing methods. Our potential hero is a couple minutes away from creating a concise set of tests that will confirm not only confirm that each of the data points in the plan work as they should but that they work as they should in combination with each of the other data points in the test plan.

In other words, the plan will not only confirm that "Number of Accidents = 3" will impact premiums as it should on its own, but also that "Number of Accidents = 3" will work as it should when tested in combination with the other relevant inputs in the application, e.g.,: 3 accidents with every relevant input for "Age of Primary Driver," 3 accidents with every relevant input for "Credit Score," 3 accidents with every relevant input for "Number of Cars," 3 accidents with every relevant combination for "Number of Speeding Tickets," and 3 accidents with every relevant input for "Number of Drivers."

He's seen the Promised Land of improved efficiency and effectiveness and he's ready to enter. Unfortunately, with his next move, he demonstrates he's a doofus. Entry to Promised Land denied. Check out the values he chose to enter for each of his parameters.



Notice anything wrong here?


Just for fun, let's take a close up look at Lindey's disastrous Snow Boarding maneuver here.


... and let's break down our shame-faced test designer's bone-headed move here. Can you notice what is wrong in with his choices of values?

There are nine different parameters in the mix here. Of those, two ("First Name" and "Second Name"), are the least important to our current objective of looking for problems in the underwriting engine calculations. And yet...

He's added ten values to each of them. Oops! Whenever you are putting together a pairwise (or 2-way) test plan, the number of tests required will never be lower than the product of the number of parameter values from the two parameters that have the highest number of values. In plain English, that high-falutin' previous sentence means: when you have a plan with 7 parameters that have a maximum of 4 values each, "10 largely irrelevant values X 10 largely irrelevant values = you're a big fat idiot" because you'll create a test plan that has 100 test cases (as compared to a test plan that could have covered the System Under Test more effectively with fewer than a quarter of the tests you've just created).


For more information on pairwise and combinatorial testing, I would recommend the following sources:


If you are attempting to use pairwise and/or combinatorial testing methods and running into questions, I'd sincerely like to help. Please consider one or more of the following:


Thank you,

Justin Hunter


*The manufacturing industry followed a similar pattern of adoption to similar methods that consistently delivered dramatic efficiency and effectiveness benefits. It took decades before multi-variate Design of Experiments methods were widely adopted by manufacturers even long after the benefits were proven to be dramatic and repeatable to anyone who would look at the clear, unambiguous, objectively-measurable evidence. Today, it is impossible to find a Fortune 500 manufacturing firm that does not regularly use multi-variate Design of Experiments in their manufacturing processes. One day it will be the same for Fortune 500 firms with respect to their adoption of multi-variate Design of Experiments methods of software testing.

By: Justin Hunter on Jan 29, 2013

Categories: Combinatorial Testing, Pairwise Software Testing, Software Testing, Software Testing Efficiency, Testing Strategies

I'll be talking at QAI's 12th Annual International Software Testing Conference on Dec 6th in Bangalore, India.

Topic: Conquering the Single Largest Challenge Facing Testers Today

"There's too much to test and not enough time to test it all." According to a recent survey conducted by Robert Sabourin, this is the single largest challenge facing test managers today. And this challenge clearly won't go away any time soon. Software is becoming increasingly complex and time pressures put on testing teams are becoming ever more extreme.

To survive and thrive as testers, we need to find ways to learn more in the limited time we have. This talk addresses:

  • Proven test design methods to learn as much as possible about a System Under Test as quickly as possible

  • How these methods were originally developed and refined in other (non-IT) industries over the last 80 years

    • How the recent Apple Maps disaster could have been easily avoided by implementing these methods
  • Real world case studies: these methods sound nice on paper, but do they actually work?

  • Reasons why these methods are being used at more than 100 Fortune 500 firms today

    • What does the future hold?


Attendees will learn about valuable testing strategies that are being used today by more than 100 Fortune 500 firms. In particular, attendees will hear about:

  • Practical test design approaches that they can begin implementing after the conference at their firms to:

    • Reduce the amount of time spent selecting and documenting test scripts
    • Reduce the amount of tests needed for execution by creating unusually powerful tests
    • Increase the thoroughness of software test suites


Related: Efficient and Effective Test Design - A Fun Presentation on a Powerful Software Test Design Approach - Maximize Test Coverage Efficiency And Minimize the Number of Tests Needed

By: Justin Hunter on Nov 22, 2012

Categories: Combinatorial Testing, Efficiency, Pairwise Testing, Software Testing Efficiency, Software Testing Presentations

Hexawise includes an array of sample plans when a new user account is created. These provide concrete examples of how to categorize items when creating a combinatorial test plans (also called pairwise test plans, orthogonal array test plans, etc.). Once you [sign in to your Hexawise account](http://hexawise.com/ (or setup a new, free, account) looking at this [sample test plan](https://app.hexawise.com/share/HT3UG7M8 (which is similar to the situation raised in the question that follows), might be useful.

Within your Hexawise account you can copy the sample test plans that you are provided with and then make adjustments to them. This lets you quickly see what effects changes you make have on real test plans. And it also lets you see how easy it is to adjust as changes in priorities are made, or gaps are found in the existing test plan.


A Hexawise user sent us the following question.

What is the recommended approach to configuring parameter with one or more values?

I have two parameters which are related.

If Parameter 1 = Yes, Parameter 2 allows the user to select one or more values out of a list of 25 - most of which are not equivalent.

For Parameter 2, is the recommended approach to handle this to create separate parameters each with a yes/no value? i.e. create one parameter for each non-equivalent value, and one parameter for the equivalent values. Then link each of these as a married pair to Parameter 1.

I'm open to suggestions as to alternatives.

Here's the screen in question. Parameter 1 = "Pilot", Parameter 2 = checkboxes for types of plans.

aviation question inline

Great question.

I would recommend that you use different parameters for each option (e.g., "Scheduled Commercial" as a parameter with "Selected, Not Selected" as your Values associated with it).

Also, I'd recommend following these 3 strategies to maximize the effectiveness of your tests.

First, consider using adjusted weightings. You may find it useful to weight certain values multiple times, e.g., have 4 values such as "Select, Do Not Select, Do Not Select, Do Not Select" to create 3 times as many tests with "Do Not Select" as "Select."

Second, use the MECE principle. The MECE principles states you should define your Values in a way that makes each of them "Mutually Exclusive" from the others in the list (no subsets should represent any other subsets, no overlaps) and "Collectively Exhaustive" as a group (the set of all subsets, taken together, should fully encompass all items, no gaps)

Third, avoid "ands" in your value names. As a general rule it is unwise to define values like "Old and Male" or "Young and Female", etc. A better strategy is to break those ideas into two separate Parameters, like so:

First Parameter = "Age" --- Values for "Age" = Old / Young

Second Parameter = "Gender" --- Values for "Gender" = "Male / Female"


Related: Efficient and Effective Test Design - Context-Driven Usability Considerations, and Wireframing - Why isn't Software Testing Performed as Efficiently and Effecively as it could be?

By: John Hunter on Oct 25, 2012

Categories: Efficiency, Hexawise tips, Pairwise Software Testing, Software Testing, Software Testing Efficiency, Testing Strategies

84 percent coverage in 20 tests

Hexawise test coverage graph showing 83.9% coverage in just 20 tests


Among the many benefits Hexawise provides is creating a test plan that maximizes test coverage with each new scenario tested. The graph above shows that after just 20 test 83.9% of the test combinations have been tested. Read more about this in our case study of a mortgage application software test plan. Just 48 test combinations are needed to test for every valid pair (3.7 million possible tests combinations exist in this case). If you are lost now, this video may help.

The coverage achieved by the first few tests in the plan will be quite high (and the graph line will point up sharply) then the slope will decrease in the middle of the plan (because each new test will tend to test fewer net new pairs of values for the first time) and then at the end of the plan the line will flatten out quite a lot (because by the end, relatively few pairs of values will be tested together for the first time).

One of the benefits Hexawise provides is making that slope as steep as possible. The steeper the slope the more efficient your test plan is. If you repeat the same tests of pairs and triples and... while not taking advantage of the chance to test, untested pairs and triples you will have to create and run far more test than if you intelligently create a test plan. With many interactions to test it is far too complex to manually derive an intelligent test plan. A combinatorial testing tool, like Hexawise, that maximizes test plan efficiency is needed.

For any set of test inputs, there is a finite number of pairs of values that could be tested together (that can be quite a large number). The coverage chart answers, after each tests, what percentage of the total number of pairs (or triples, etc.) that could be tested together have been tested together so far?

The Hexawise algorithms achieve the following objectives that help testers find as many defects as possible in as few tests as possible. In each and every step of each and every test case, the algorithm chooses a test condition that will maximize the number of pairs that can be covered for the first time in the test case. (Or, the maximum number of triplets or quadruplets, etc. based on the thoroughness setting defined by the user). Allpairs (AKA pairwise) is a well known and easy to understand test design strategy. Hexawise lets users create pairwise sets of tests that will test not only every pair but it also allows test designers to generate far more thorough sets of tests (3-way to 6-way coverage). This allows users to "turn up the coverage dial" and generate tests that cover every single possible triplet of test inputs together at least once (or every 4-way combination or 5-way combination or 6-way combination).

Note that the coverage ratio Hexawise shows is based on the factors entered as items to be tested: not a code coverage percentage. Hexawise sorts the test plan to front load the coverage of the tuple pairs, not the coverage of the code paths. Coverage of code paths ultimately depends on how good a job the test designer did at extracting the relevant parameters and values of the system under test. You would expect there to be some loose correlation between coverage of identified tuple pairs and coverage of code paths in most typical systems.

If you want to learn more about these concepts, I would recommend Scott's Scott Sehlhorst articles on pairwise and combinatorial test design. They are some of the clearest introductory articles about pairwise and combinatorial testing that I have seen. They also contain some interesting data points related to the correlation between 2-way / allpairs / pairwise / n-way coverage (in Hexawise) and the white box metrics of branch coverage, block coverage and code coverage (not measurable by Hexawise).

In Software testing series: Pairwise testing, for example, Scott includes these data points:


  • We measured the coverage of combinatorial design test sets for 10 Unix commands: basename, cb, comm, crypt, sleep, sort, touch, tty, uniq, and wc... The pairwise tests gave over 90 percent block coverage.


  • Our initial trial of this was on a subset Nortel’s internal e-mail system where we able cover 97% of branches with less than 100 valid and invalid testcases, as opposed to 27 trillion exhaustive testcases.


  • A set of 29 pair-wise... tests gave 90% block coverage for the UNIX sort command. We also compared pair-wise testing with random input testing and found that pair-wise testing gave better coverage.


Related: Why isn't Software Testing Performed as Efficiently and Effecively as it could be? - Video Highlight Reel of Hexawise – a pairwise testing tool and combinatorial testing tool - Combinatorial Testing, The Quadrant of Massive Efficiency Gains

Specific guidance on how to view the percentage of coverage graph for the test plan in Hexawise:


When working on your test plan in Hexawise, to get the checklist to be visible, click on the two downward arrow keys located shown in the image:

How-To Progress Checklists-2 inline

Then you'll want to open up the "Advanced" list. So you might need to click here:

Advanced How-To Progress Checklist inline

Then the detailed explanation will begin when you click on "Analyze Tests"

Decreasing Marginal Returns inline


This post is adapted (and some new content added) from comments posted by Justin Hunter and Sean Johnson.

By: John Hunter on Feb 3, 2012

Categories: Combinatorial Software Testing, Combinatorial Testing, Efficiency, Multi-variate Testing, Pairwise Software Testing, Pairwise Testing, Scripted Software Testing, Software Testing, Software Testing Efficiency

There are good reasons James Bach is so well known among the testing community and constantly invited to give keynote presentations around the globe at software testing conferences. He's passionate about testing and educating testers; he's a gifted, energetic, and entertaining speaker with a great sense of humor; and he takes joy in rattling his saber and attacking well-established institutions and schools of thought that he disagrees with. He doesn't take kindly to people who make inflated claims of benefits that would materialize "if only you'd perform testing in XYZ way or with ABC tool" given that (a) he can always seem to find exceptions to such claims, (b) he doesn't shy away from confrontation, and (c) he (rightly, in my view) thinks that such benefits statements tend to discount the importance of critical thinking skills being used by testers and other important context-specific considerations.

Leave it up to James to create a list of 13 questions that would be great to ask the next software testing tool vendor who shows up to pitch his problem-solving product. In his blog post titled "The Essence of Heuristics," he posed this exact set of questions in a slightly different context, but as a software testing tool vendor myself, they really hit home. They are:


  1. Do they teach you how to tell if it’s working?
  2. Do they teach you how to tell if it’s going wrong?
  3. Do they teach you heuristics for stopping?
  4. Do they teach you heuristics for knowing when to apply it?
  5. Do they compare it to alternative heuristics?
  6. Do they show you why it works?
  7. Do they help you understand when it probably works best?
  8. Do they help you know how to re-design it, if needed?
  9. Do they let you own it?
  10. Do they ask you to practice it?
  11. Do they tell stories about how it has failed?
  12. Do they listen to you when you question or challenge it?
  13. Do they praise you for questioning and challenging it?


[Side note: Apparently I wasn't the only one who thought of Hexawise and pairwise / combinatorial test design approaches when they saw these 13 questions. I was amused that after I drafted this post, I saw Jared Quinert's / @xflibble's tweet just now:]


Where do I come down on each of James' 13 questions with respect to people I talk to about our test design tool, Hexawise, and the types of benefits and the size of benefits it typically delivers? Quite simply, "Yes" to all 13. I enjoy talking about exactly the kinds of questions that James raised in his list. In fact, when I sought out James to ask him questions at a conference in Boston earlier this year, it was because I wanted his perspective on many of the points above, particularly #11: (hearing stories about how James has seen pairwise and combinatorial approaches to test design fail), and #7 (hearing his views on where it works best and where it would be difficult to apply it). I'll save my specific answers to another post, but I am serious about wanting to share my thoughts on them; time constraints are holding me back today. I gave a speech at the ASQ World Conference on Quality Improvement in St. Louis last week though that addressed many, but not all, of James' questions.

I'm not your typical software tool vendor. Basically, my natural instincts are all wrong for sales. I agree with the premise that "a fool with a tool is still a fool"; when talking to target clients and/or potential partners, I'm inclined to point out deficiencies, limitations, and various things that could go wrong; I'm more of an introvert than an extrovert, etc. Not exactly the typical characteristics of a successful salesman... Having said that, I believe that we've built a very good tool that helps enable dramatic efficiency and thoroughness benefits in many testing situations but our tool, along with the pairwise and combinatorial test design approaches that Hexawise enables both have their limitations. It is primarily by talking to software testers about their positive and negative experiences that our company is able to improve our tool, enhance our training, and provide honest, pragmatic guidance to users about where and how to use our tool (and where and how not to).

Tool vendors who defend their tools (and/or the approaches by which their tools helps users solve problems) as magical, silver bullet solutions are being both foolish and dishonest. Tool vendors who choose not to engage in serious, honest and open discussions with users about the challenges that users have when applying their tools in different situations are being short-sighted. From my own experiences, I can say that talking about the 13 topics raised by James have been invaluable.

By: Justin Hunter on Jun 1, 2010

Categories: Combinatorial Testing, Design of Experiments, Hexawise test case generating tool, Pairwise Testing, Software Testing, Software Testing Efficiency, Uncategorized

Luis Fernández, an Associate professor at Universidad de Alcala is conducting a survey of software testers to gather data relating to, e.g., "Why isn't software testing conducted as efficiently and effectively as it should be?" and "What factors lead to software testing being 'under-appreciated' as a potential career path?"

His survey (as of March, 2010) is listed [here].(http://www.cc.uah.es/encuestas/index.php?sid=28392&lang=en)

Personally, I agree that the following two issues (identified in his survey) are significant causes of inefficiency in software testing:

1) "People tend to execute testing in an uncontrolled manner until the total expenditure of resources in the belief that if we test a lot, in the end, we will cover or control all the system." (Or, at least, given the relatively undisciplined test case selection methods prevalent in the industry, my experience in analyzing manually selected test scenarios is that testers generally believe (a) they are covering a higher proportion of an application's possible combinations than they actually are and (b) they underestimate the amount of time that is spent during test execution unproductively repeating steps that they have previously tested)

2) "Many managers did not receive appropriate training on software testing so they do not appreciate its interest or potential for efficiency and quality."

It is unfortunate, but true, that many testing managers do not have any background whatsoever in combinatorial testing methods that (a) dramatically reduce the amount of time it takes to select and document test cases, and (b) will simultaneously improve test execution efficiency when applied correctly. See, for example, this

See also: This slideshow on efficient and effective test design

Please consider taking Fernández's short survey. It takes only 5-10 minutes to complete.

By: Justin Hunter on Mar 1, 2010

Categories: Combinatorial Testing, Efficiency, Software Testing, Software Testing Efficiency