work-blog/the-logic-of-software-testing.md at 2df4e5fe104d7fba79d5682f82ce2d634445ee4f

Gregory Gauthier 0589ae9d08 docs(metadata): standardize topics to controlled vocabulary

- Restrict topics in CLAUDE.md to: philosophy, craft, epistemology, exploratory-testing, agile
- Update GROK.md YAML examples for title quotes and related file extensions
- Adjust topics in published articles to align with controlled list, removing deprecated terms like bdd, automation, reasoning, formal-logic, resources

2026-04-07 16:59:43 +01:00

17 KiB

Raw Blame History

title

date

topics

abstract

The Logic Of Software Testing

2025-12-12

epistemology

five-essential-lessons.md

Testers-As-Explorers.md

An inventory of the four inferential methods testers use daily — modus ponens, modus tollens, induction, and abduction — and the pitfalls of each.

Most testers aren’t consciously aware of the fact that their work requires them to engage in a fairly complex and rigorous set of inferential procedures. Most of us grow up absorbing them intuitively, to one degree or another, by way of the accidents of experience; never giving them names of their own. Aristotle would say you do not know a thing until you are capable of giving it a proper name, and you cannot name things properly until you can define their fundamental essence. Given that as our motivation, let us make a proper inventory of the ways in which a tester thinks.

What Do You Know, and How Do You Know It?

Software testing is “knowledge work” in the most literal sense of the term. The work of the tester is intended to provide a body of evidence from which a software development organisation can establish a shared understanding of what they have built.

In philosophy, the field of study in which this work belongs is “epistemology”. Epistemology is concerned with what you know, how you know it, and why you believe it is true. It is from this field of philosophy that testers derive core concepts of their work. Notions like “belief”, “truth”, “fact”, “perception”, “judgement”, “opinion”, and of course “knowledge” itself.

Software testing is not itself an academic discipline. It is one applied practice of the discipline of Epistemology. The difference cannot be stressed enough. Testers practice the art of discovering what can be known about software. For comparison, another applied practice of epistemology would be scientific fields like chemistry or physics. Mathematicians and philosophers (in other fields) also engage in the applied practice of the discipline of epistemology. What differentiates the scientist from the mathematician, is the kind of knowledge their practice is interested in, and the methods required to obtain that knowledge. As noted in several previous posts, software testing is also interested in certain kinds of knowledge, and employs certain methods of its own, suited to obtaining knowledge particular to its field.

What methods are those, exactly? How is the tester moving from mere opinion to knowledge, using these methods? Let's explore them, one by one.

Four Key Ways We Reason As Testers

There are many forms of procedural reasoning that investigators, scientists, and indeed software testers engage in, when they are attempting to reach a confident conclusion. Among the variety of methods, there are four essential forms that occur most often in our daily lives, as technicians and testers. Let's examine each separately.

Conditional Logics: The Twin Pillars

The first two should be immediately recognisable from their description, because we engage in this almost continuously, throughout our daily lives:

Modus Ponens - If P implies Q and P; then Q. Used to derive knowledge from conditional statements and established premises. This form of reasoning is typically known as "confirmation" or "validation", in practice. If we find "P", then the positive assertion in "Q" is true. We found "P", therefore "Q" is the case. Here's an example: If the marble jar is full, then John has put all of his marbles away. The marble jar is, in fact, full. Therefore, John has put all his marbles away. This is a useful mode of reasoning where certainties are high, such as in mathematics. However, as we'll next see, this mode of reasoning can be deceiving in more material contexts.
Modus Tollens - If P implies Q and not Q; then not P. Applied to test and falsify knowledge claims by examining their logical consequences. This form of reasoning is often thought of as "falsification" or "exception reasoning". If we find "P" then the positive assertion in "Q" is the case by implication. However, we know from observation that "Q" is not the case. Therefore, "P" cannot be the case, either. Let's take the example from Modus Ponens and rework it: If the marble jar is full, then John has put all his marbles away. However, there are marbles on the floor. Therefore, John has not put all his marbles away, even if the jar appears full.

To put it simply, Modus Ponens is primarily about what we can know, and Modus Tollens is primarily about what we cannot know. Thinking about the examples in particular, we can see that the difference between these two forms of reasoning lies in where we focus our attention. To put it another way, the difference lies in where we make our observations. In the case of Modus Ponens, the imperical anchor is P. The example is primarily concerned with what state the jar is in. In the case of Modus Tollens, the empirical anchor is Q. That example is primarily concerned with what state the marbles are in.

In software testing, the gap between the two forms of reasoning can show up in a wide variety of ways. Perhaps the most common (and entertaining) symptom of this is the sarcastic developer slogan, "Works On My Machine!". The developer's focus is on his code. The tester's focus is on the target environment. The developer is operating in Modus Ponens mode, the tester is operating in Modus Tollens mode. Or, to put it more simply: The deveoper is asking himself the question, "what gets the code to work", and the tester is asking himself the question, "what gets the code to stop working".

Induction: Can you reproduce it?

The next most common form of reasoning can be seen in the question asked in this heading. Given a situation in which I observe X when I do Y, am I justified in concluding that X always follows Y? What criteria must be satisfied to qualify for justified belief? Repeating the scenario two times? Ten times? A hundred times? This is known as inductive generalization.

Our intuition might incline us to suspicion at such an approach to belief. How could any arbitrary number of repetitions justify a belief? But before we concede to our skepticism, consider the fact that the most common example of a conclusion drawn from such a generalization, is our automatic expectation that the sun will rise in the east tomorrow morning. In our own context, an example of inductive generalization that directly affects our business would be the phases of a drug trial, where conclusions are drawn from the results of thousands of trial volunteer repetitions.

Still, we should take care when using inductive generalization to reach conclusions about our work. There are two common traps of generalization of which we ought to be wary. Let's discuss each separately.

Dirty Inductions

The first is "weak" or "dirty" generalization. This form of induction too easily accepts a generalization as true, because no counterexample is available. The problem can be demonstrated by extending the drug trial metaphor from above. Let us consider: Children who get the measles vaccine almost never get measles. We conclude from this, that the measles vaccine prevents the contraction of measles.

But is this conclusion justified? To see why it is not, let me modify the analogy slightly: People who wear garlic necklaces are never attacked by vampires. So, we conclude, garlic neclaces are effective wards against vampire attack. What's the problem with this? Essentially, we haven't established an actual threat. Or, more precisely, we haven't established a causal relationship between the lack of vampires and the presence of garlic. The people who don't wear garlic necklaces also don't experience vampire attacks. So, what's really going on here?

Likewise, with way we've framed the story of the measles vaccine. We would need to establish a like-for-like threat between two similar groups of people, and then show a difference in the group that took the vaccine. In a scientific study, the untreated are known as a "control" group. In this kind of inductive generalization, we are attempting to find a way to rule out the people who don't wear garlic neclaces, so that we can establish a firm basis for our generalization about those who do wear garlic neclaces. This is what is known as "eliminative" or "comparative" induction, because it is an attempt to reason from two parallel lines of evidence in which one line has had a variable introduced, while the other has had it eliminated, providing an anchor for comparison.

Correlation Versus Causation

The second problem with inductive generalization is the problem of spurious association. This form of induction too easily accepts a causal claim from a consistent association. It assumes a causal link from a correlation, even if a causal agent is unknown. The fallacy in this assumption has been famously illustrated at the website Spurious Correlations, where, among other things, we can discover that the national per-capita consumption of margarine in the US is nearly perfectly correlated with the divorce rate in the state of Maine:

The central question still open on this example: is the decline of margarine consumption responsible for the falling divorce rate? Or is the falling divorce rate the cause of the decline in margarine consumption? It turns out, both questions are illustrative of the "spurious association" problem. These sorts of simple linear associations are quite easy to find. What is much more difficult, is being able to maintain the correlation. If you were to expand this graph out by 40 years, the relation would collapse. In addition, there are literally dozens of much more sensible common causes that could be identified to explain this brief period of correlation. As off-the-cuff examples: a decline in preference for vegetable oils, a change in economic policy in Maine making marriage more lucrative, the migration of people inclined to divorce to other states, the cost of margarine production, and so forth. There are presently over three thousand such examples of spurious association to be found on the Spurious Correlations website. They're quite fun to read.

However, sometimes correlation is indeed an excellent indicator of causation. The most famous modern instance of this is the history of cigarette smoking. For decades, cigarette smoking was strongly correlated with many health problems. Lung cancer, pulmonary disease, ischemic stroke, and several lesser health conditions were very often observed in the majority of the population that also regularly smoked cigarettes. Many people pointed to this as an obvious problem, but others (including many doctors) dismissed the concerns as a spurious association. It wasn't until the 1970s, that medical science had managed to finally isolate actual causal agents, such as aromatic hydrocarbons, nitrosamines, and aldehydes, that were directly responsible for the cancers and pulmonary diseases associated with smoking. Over the same period, numerous experts and investigators have failed to find "third variable" explanations that could break the strong correlation between the two phenomena, in the way that one easily could with margarine and marriage.

Abduction: Elementary, My Dear Watson

This form of reasoning is perhaps the most common across all engineering disciplines. It is the kind of reasoning we do, when troubleshooting a problem, or searching for systemic causes to those problems. The plain English way to describe this form of reasoning, is "inference to the best explanation", and it works by taking an aggregation of weakly related observed phenomena, and hypothesizing plausible common causes, eliminating each until we land on a hypothesis that explains all the phenomena together.

Perhaps the most famous cultural illustration of this kind of reasoning can be found in Arthur Conan Doyle's Sherlock Holmes novels. Most of the mysteries he solved, were (despite his declarations to the contrary) concluded abductively, not deductively. For example, here is how he did just that, in The Adventure of Silver Blaze:

Observation (surprising fact): The watchdog did not bark on the night of the crime.
Inductive generalization: “In my experience, dogs bark at strangers 99% of the time.”
Abduction (what Holmes actually does):
- Step 1: Speculation: There are three possibilities: The dog was removed; the dog was drugged; the dog knew the intruder.
- Step 2: Elimination: Evidence establishes the dog was present, and not drugged (this, through inductive inferences).
- Step 3: Abductive conclusion: “The absence of barking is surprising under the stranger hypothesis, but completely expected if the culprit was someone the dog knew. The ‘familiar person’ hypothesis explains the data far better than any rival. Therefore, the best explanation is that the thief was someone the dog knew.”
As the Holmes novels demonstrate, this form of reasoning can be and usually is an extremely powerful tool in everyday life. Indeed, so powerful that Holmes is often contrasted as nearly superhuman in his capacity to discern, compared to the police and his companion Dr. Watson. However, as the method would also indicate, this form of reasoning is heavily dependent upon experience. Or, to be a bit more explicit: the accumulation of empirical knowledge that can be used to reason inductively to generalizations ("In my experience, someething does X when Y"). Thus, this form of reasoning is also perhaps the most susceptible to a whole host of problems. Among the most common:
"Cherry-Picking" - selecting only the bits of evidence that validate a hypothesis we prefer
Availability Bias - taking the first thing that comes to mind as the most "plausible" hypothesis, regardless of whether it satisfies all explanatory criteria.
"Satisficing" - taking the first hypothesis that "fits", out of convenience. Different from Availability Bias in that the hypothesis has to actually satisfy all the explanatory criteria.
Emotionalism - confusing certain explanatory virtues with the truth. In other words, taking a hypothesis that is emotionally satisfying, over an hypothesis that satisfies all the explanatory criteria.
Undecidability - two or more hypotheses are equally capable of satisfying all the explanatory criteria, and no additional information tips the scales.
Faulty Background Assumptions - if the generalizations we draw from experience are skewed or mistaken, the abduction will typically fail. What if Holmes' experience had been that only about 50% of dogs bark at strangers?
Overconfidence - taking the "best explanation" as final and definitive.

Indeed, we can see hints of this in the Holmes story above. Why should we take the generalization as read? Perhaps, if we were to test Holmes' anecdotal experience, we would find that statistically speaking, only about 50% of dogs bark. Even if we assume this background knowledge, why are his three speculations the only possibilities? Perhaps the dog was poorly trained? Perhaps the dog was exhausted and the culprit was especially stealthy? Perhaps the dog was distracted by something else? The possibilities are actually endless. Holmes never checks any of these. He simply declares the “best” explanation based on his own background beliefs and moves on. This is why philosophers of science say Holmes is a brilliant literary example of abductive reasoning done badly in real life.

Conclusion: Think About It

This post is not intended to make philosophers out of engineers or testers. It's also not intended to advocate for any change in the way we do things. However, it is advocating for a more conscious approach to what we already do, day to day. Aristotle rightly points out that It is one thing to be really good at some particular thing, but quite another to know why you are good at it. To advance from "experienced practitioner" (an "empiric"), to a genuine artisan (an "epistemic"), is to graduate from knowledge of the what, to knowledge of the why. It is knowledge of the why that gives us genuine mastery over our environment. And it is that mastery that elevates the quality of any endeavour. So, I would encourage everyone to spend just a little time, thinking about how they think about their work.

17 KiB Raw Blame History Unescape Escape