Testing the Tests

Dr David L. J. Freed, MB, MD, MIBiol

A reliable accurate test for diagnosing allergies is what we all need, and for at least 30 years numerous companies have been advertising tests which claim to be what we’re looking for, but how do we test the claims?

Diagnostic tests are conventionally assessed for the following criteria (I will explain soon why not all of them readily apply to allergy diagnostic tests):

1) validity: is the test measuring what it purports to test? Is this RAST, for example, which purports to measure IgE antibodies, actually measuring IgE or is it some artifact interfering with the system to make it appear positive - perhaps an unsuspected extra specificity in the antiserum? (not such an issue nowadays with monoclonals, but a big issue years ago). Does the second antibody, for example, bind directly to the solid-phase, or to the test antigen? What controls are performed to check for such artifacts, and how often?

2) reproducibility, which means if I do this test ten times on the same patient on the same day, do I get ten identical results - and if not why not, and what do I do about that variability in practice?

3) sensitivity: In any population of patients with this illness, say an allergy to wheat, what percentage of that population give a positive result in the test (that is, a true positive)?

4) positive predictive value: In a group of patients who all gave a positive result in the test, how many were actually allergic in clinical fact? (the mirror image of the previous question, but one which can give a totally different picture, as we shall see)

5) specificity: in a group of people who did not have the allergy in question, how many gave a negative result in the test?

These last three values are worked out by filling in the appropriate numbers into what is called a contingency table, thus:

	Test positive	Test negative
Clinically allergic to wheat
Clinically not allergic to wheat

Obviously the ideal test gives consistent results in the face of all possible interfering factors (questions 1 and 2), and gives a score of 100% for the other three questions. Thus, for any 200 patients of whom 100 are allergic and 100 are not, our contingency table gives us the three pieces of information that we’d like:

	Test pos	Test neg	totals
Clin pos	100	0	100
Clin neg	0	100	100
totals	100% predictive	100	200

and

	Test pos	Test neg	totals
Clin pos	100	0	100% sensitive
Clin neg	0	100	100
totals	100	100	200

and

	Test pos	Test neg	totals
Clin pos	100	0	100
Clin neg	0	100	100% specific
totals	100	100	200

But in practice few tests achieve these ideals and the clinician is left with a degree (which can be quantified, as we shall see) of uncertainty. To take a real-life example (Manchester medical student volunteers surveyed in 1982-3):

	Grass-pollen RAST pos	Grass-pollen RAST neg	totals
Clinically grass-pollen sensitive (hay fever)	19	14	33
Clinically not grass-pollen sensitive	2	48	50
totals	21	62	83

which gives the RAST in this condition a sensitivity of 19/33 = 58%,
a specificity of 48/50 = 96%
and a positive predictive value of 19/21 = 91%

If this test is positive it is very likely that the patient is clinically allergic, since it is highly specific, but because the test is not very sensitive, almost half of the patients who were genuinely allergic will be wrongly judged as “negative” [1]. In practice, sensitivity and specificity are often mutually exclusive.

The problem: The underlying assumption for all these assessments is that we already know which patients are really allergic and which are not. Only then can we test the test. In the above example of hay fever we were fairly confident that we knew these figures because all the students gave a crystal-clear history (we excluded any who didn’t) and we confirmed the diagnosis when necessary by nasal challenge with grass pollen, to which a positive result is unmistakable and cannot readily be imagined or faked.

The same may be true of food allergy/intolerance if the symptoms are objective and clear (e.g. urticaria) but it’s much harder if they are semi-subjective (like asthma), subjective (headaches, hyperactivity, fatigue, etc etc) or masked. In an attempt to overcome this difficulty Charles May in the USA introduced the double-blind challenge, and Ronnie Finn in this country, followed by David Pearson and his group, placed it on a statistically-sound basis [2,3]. The conditions for doing this successfully are quite strict. Firstly challenges can only be done once you have got the patient symptom-free, or almost so, so that if you succeed in provoking symptoms they will be noticed by the patient. You therefore need (a) a pretty shrewd suspicion of what the likely culprits are before you start, so the patient knows what to avoid, and (b) a condition in which the symptoms abate rapidly once the culprit food is withdrawn – if it takes six months you won’t get far. You then challenge the patient with the suspect foodstuff at suitable intervals, interspersing with placebo challenges (having first ascertained that the “placebo” is indeed inert for this patient), and the patient tells you after each challenge whether he thinks it was “active” or “placebo”.

Now, for a subjective and capricious symptom like headache, the patient could give the correct answer by lucky guess, and this happens about 50% of the time [3]. In mathematical parlance we say that the probability (p) is 50%, that is:

p = 0.5

So we do a second challenge. He could correctly identify both challenges by lucky chance, but the likelihood of that is lower, merely:

p = 0.5 x 0.5 = 0.25

which is less plausible but still possible. So we do it a third time, yielding:

p = (0.5)3 = 0.125.

Do you believe that now? Well, do you believe it’s possible to toss a coin and get heads three times in a row? Yes; it doesn’t often happen but it does sometimes. So do it five times, and if the patient still gets it consistently right you feel sure it’s no longer lucky chance, because now:

p = 0.03 (i.e. 3%)

and that gullible you’re not.

That’s all fine and dandy if the patient gets it consistently right, but what if he makes one or two mistakes – does that prove he’s not intolerant? No, but if you allow for that possibility you have to do more challenges [3] – and for that reason Finn and Cohen [2] did a series of ten challenges for each patient, for each foodstuff tested (it took several weeks).

And after all that, challenge studies only prove the patient allergic/intolerant (within the limits of the statistics) if the overall result of the series is positive. If negative, they do not disprove it, for a whole host of reasons [4] as follows:

the dose or duration of the “active” challenge may be inadequate
the patient was not symptom-free before starting (and had possibly forgotten what it felt like to be symptom-free) so there was a fluctuating background level of unwellness throughout,
essential co-stimuli may be absent during ‘active’ challenges (such as exertion, premenstruality, alcohol, aspirin, pregnancy, fear, stress and other foods, all of which have been shown to be potential confounders)
the “placebo” challenge may not be truly inert, or
the patient was not shielded from other, possibly unknown, reaction triggers (not necessarily foods) coincidentally present at the same time as the placebo challenge,
the reaction takes weeks to develop or (worse) weeks to resolve once provoked, or
there was a “masked allergy”, and the first challenge, albeit positive, started a process of re-masking so that subsequent challenges were negative (i.e. the test was also a therapy)

and for that reason it is never possible, for many food intolerance states (namely those very conditions in which we most need reliable tests), to fill in the spaces in the contingency table given above, and the sensitivity, specificity and predictive values of food-allergy tests in these conditions will never and can never be known. Many of our conventional colleagues, sadly, have failed to grasp this last point and persist in the fallacy of pronouncing a patient “not allergic” who fails to identify all (or most) challenges correctly. The correct verdict in such a case is “not proven”.

A more serious challenge (also more controversial) is the tendency of some conservative allergists to attribute a negative challenge series to psychogenesis or hyperventilation (especially if within the challenge series some of the “placebos” were wrongly identified by the patient as active) [3,5]. This suggestion almost certainly contains truth in some cases, and exposes us to the jibe of our enemies that we are treating imaginary illnesses with equally imaginary cures. Since we Clinical Ecologists are the ones who want to expand the range of “allergy” to include masked allergies, the onus is on us to prove our case and we can’t. Theron Randolph agreed, in private conversation with me, that masked allergy is a diagnosis that can only be made in retrospect, and the evidence of success is weak [6].

Thus after 35 years as an allergist, I find myself incompetent to make a confident diagnosis of allergy in most of the patients who come to me, lost souls who have largely been rejected by our profession and are in the greatest need of a robust diagnosis. Conventional allergists sidestep the problem by saying “there is no evidence of allergy” in these patients and discharging them back to the tender mercies of pharmacodoxy. Most of us resist doing that because that would deny the sufferers treatments that we know (through experience) stand a good chance of helping. But if we can’t make a diagnosis, what gives us licence to treat, and by what right do we practise as allergists?

To answer that we need to take a step back and ask a deeper question: why do patients come to doctors in the first place? Commonsense suggests that they want to made better, and whether or not the doctor can make an accurate diagnosis is less important. It just so happens that, for all sorts of good reasons, we doctors have set ourselves the task (at least in principle) of making accurate diagnoses before we treat.

But incompetence at diagnosis does not prevent successful treatment, at least in this field. We can be lousy diagnosticians but still restore health to a lot of previously sick people, as any doctor who prescribes medicines for migraine, back pain or IBS can attest (because these are descriptive labels, recognitions not diagnoses, and tell us little that the patient didn’t already know). The inescapable corollary is that yes, we are pleased if there is a strong placebo effect to help us.

And that belief is what keeps me going. With this in mind, and returning now to the theme of testing the tests, Waickman and I proposed more recently, when examining the value of diagnostic tests [6], the clinical criterion of:

6) utility value, which means, what proportion of ill people (with clearly-defined syndromes x, y and z) get better as a result of following the results of this test? Now that’s a question that means something to us jobbing clinicians, matters to patients, and can be investigated by double-blind controlled trials because now it’s a treatment being tested, whilst the test behind it is being tested only indirectly.

In a small unpublished trial of food-antibody tests at Manchester University in 1982 Janet Ditchfield and I tested 81 migraine sufferers for antibodies to a panel of 13 likely food triggers, and for each patient we generated two lists, one listing the foods to which the patient had antibodies and the other a list of foods to which he/she did not have antibodies. Some of the patients were given their “active” lists and instructed to avoid the foods on that list. The others were given the same instructions for their antibody-negative list. 22 patients (14 on the “positive diet” and 8 on the “negative”) completed the diets and reported back.

7/14 of the “antibody-positive dieters” reported improvement after two months, as did 4/8 “negative dieters” – exactly half in each group. Probably the “placebo” diets performed relatively well because all “negative” nominated foods were anyway in the high-risk category (grains, dairy, citrus, caffeine). Incidentally, the prevalence of food antibodies in migraineurs and in healthy controls was not significantly different, nor was there any significant effect of age, sex, drinking or smoking habits. The most antibody-provoking (but not necessarily symptom-provoking) foods in all groups were yeast, milk and mushroom [7], although the three commonest clinical culprits in migraine are milk, chocolate and benzoate [8].

We didn’t publish the work because the number of patients was small, the results were negative, and no-one really expected it to work in the first place. But ever since then my criterion for any allergy test for foods is this:– how many patients with this condition can I get better by applying a diet based on intelligent informed guesswork, not using any test at all (the answer is usually around 55%), and how well does the proffered test perform in comparison with that? Hunter’s results from Cambridge throw the comparison into even stronger relief, since that same percentage of his IBS patients get better by the application of a standard diet, without even attempting to individualise it [9]. If a test achieves merely that same overall result, it adds nothing apart from the possible fringe benefit of improving patient compliance. So far, I am unaware of any food-allergy or intolerance test that gets more patients better than I can by taking a careful dietary history and using intelligent guesswork (although it could be that using a test in addition might improve results further). It would also be true to say, I think, that patients without access to an allergist could possibly get useful clues from one of these tests, if it scores as well as we do, although that would also entail the well-known risks of unsupervised elimination dieting.

I cannot resist adding, in closing, that those of us who desensitise using EPD and/or neutralisation really hold the trump cards, because we don’t actually need to know what foods (and environmentals) the patient is sensitive to – we can simply desensitise, blunderbuss-fashion if need be, and put a stop to (or at least alleviate) most reactions in most patients. That’s a whole new ball-park, with success rates (complete or worthwhile partial resolution of symptoms) in the region of 80%, sometimes approaching 90% [10]. But it gets us nowhere nearer our goal of making confident allergy diagnoses.

REFERENCES

1) Freed DLJ, Wilson P, Downing NPD, Musgrave D.
The cytotoxic test in the diagnosis of immediate-type respiratory allergy.
Int J Biosoc Res 1986, 8: 1-9.

2) Finn R, Cohen NH. Food allergy – fact or fiction?
Lancet 1978, 1: 426-8.

3) Pearson DJ. Problems with terminology and with study design in food sensitivity.
In Dobbing J (ed) Food Intolerance, Baillière Tindall, London, 1987 pp 1-23.

4) Freed DLJ. False-negative food challenges.
Lancet 2002, 359: 980-1

5) Pearson DJ, Rix KJB, Bentley SJ. Food allergy: how much in the mind?
Lancet 1983, 1: 1259-1261.

6) Freed DLJ, Waickman FJ. Laboratory diagnosis of food intolerance.
In Brostoff J, Challacombe SJ, Food Allergy and Intolerance, Saunders, London, 2002, p 839.

7) Ditchfield J. Antibodies to Foodstuffs in Illness.
PhD Thesis, Manchester University, 1983

8) Egger J, Food allergy and the central nervous system in childhood.
In Brostoff J, Challacombe SJ, Food Allergy and Intolerance, Saunders, London, pp 695-701

9) Mullan MMC, Hunter JO. Diagnosis of gastrointestinal food allergy and intolerance in adults.
ibid pp 867-874.

10) Maberly DJ, Anthony HM, Birtwistle S. Polysymptomatic patients:
A two-centre outcome audit study.
J Nutr Env Med 1996, 6: 7-32.

Return to Medical Information Index

Return to the Home Page