Dr Abram Hoffer MD – Lies, Damn Lies and Statistics: The Statistics Game

Townsend Letter for Doctors and Patients – October 2004.

Editor:

   The first major medical discovery probably arose after mankind learned how to count. I visualize the first clinical study, an N=1, when a bright mother 10,000 years ago told her partner that a  friend had eaten a certain plant and had died. Her family would no longer eat that plant. She might even have observed that that plant did not taste good and have associated toxic food with bitterness or some other bad taste.

This observation would establish the rule that bitterness is associated with poison. No statistics were necessary. All she needed to do was to count. And that is how the relatively few species of plant life that are safe to eat were discovered.

Animals also use the N=1 clinical trial. If a rat eats a poisoned bait and survives he will not thereafter eat the same bait, an automatic reflex that is life saving. This reflex has been used to train wild predators not to eat lambs, for example.

Our first clinical scientist did not demand a double-blind prospective randomized controlled experiment costing hundreds of thousands of dollars. Of course the end point was easy – Alive or dead. We use the same end point for judging the efficacy of cancer treatment.

Counting remained the mainstay of clinical science. In the first recorded human clinical trial, Daniel from the Book of Daniel persuaded the overseer to allow his Israelite young men, who had been brought to the palace to learn administration, to eat the food they were accustomed to, rather than the richer food of the palace. After two weeks, the overseer noted that Daniel’s charges were healthier, and he permitted them thereafter to eat their own food. All he had to do was count the men who were well. This was not a double-blind randomized clinical trial and the holy 0.05 probability point was not invoked. Counting was adequate.

Sir Thomas Sydenham, considered the father of bedside or clinical medicine, also was able to count.

About 300 years ago he noted that his smallpox patients had a very high death count in the summer and a comparatively low death count in the winter. This was a direct attack on the ancient theory that disease was caused by the presence of humors under great pressure and that the fever ought to be increased to drive these humors out more effectively. Dr Sydenham should have kept his mouth shut: Instead he told his colleagues. He was challenged to a duel, he nearly lost his medical license from the Medical Society and in the end he pleaded for help from one of the nobility.

Now he is honored by a plaque on one of the buildings near Parliament in London. He did not use statistics. He knew how to count and to keep track of what happened to his patients. Today increasing fever is not generally considered good medical practice.

Coming closer to our modern times, the earliest experiments by which a few vitamins were discovered were done by physicians who knew how to count. Dr. James Lind was the first physician who proved that certain foods would cure and prevent scurvy. He selected eight sailors. Two of them were given fruit containing vitamin C and the other six were treated with the most up-to-date techniques of that day, which included allowing two of the sick sailors to lie on the ground face down to inhale the vapors of the earth.

The two fortunate sailors on fruit recovered and were put to work nursing the other six who did not.  But Dr. Lind was merely proving what was already known. English housewives of that day knew that scurvy grass was needed to cure “The Scurvy.” It was a tradition that was passed on from mother to daughter.

How can you provide a statistical analysis of a small series of two on the fruit and six who were deprived? But we now are convinced that scurvy is caused by an absence of vitamin C. It took the British Navy many years before they believed this and began to give their sailors lemons or limes. This simple trick saved England from an invasion by Napoleon. The French sailors could not stay out to sea very long before they became ill, while the healthier English sailors could stand out and wait for the French Navy to come to them.

Dr Takakai, the Japanese physician, proved eating rice bran would prevent beri beri. Many experiments were run but no statistic was required. The observations were clear and elegant.

Over the past 100 years counting was modified into statistics. This arose from the needs of English gentlemen who wanted to win money when they bet at Monte Carlo. It was observed that when one throws the dice, if the dice are properly designed and made, any one of six numbers in a large series of random tosses will come up one-sixth of the time. This was adapted to studies in agriculture using plants and animals. But probability theory was very carefully described. The two basic preconditions are: (1) that the phenomenon being studied must be invariant; that means it will be as active in 1,000 years as it is now, and of course, the dice were so designed that this is true; and (2) that in drawing a sample from any larger population, that sample must be truly representative of the larger population.

The first condition is never true in biosocial experiments because nature is not that accommodating and it will change no matter how much we might prefer that it remained static. The second condition is very hard to achieve because it is too expensive, too unwieldy and time-consuming.

Sir Lancelot Hogben(1) in his book analyzed the question whether the probability theory could be used in human clinical studies, and at the end of the book he concluded that it could not. How many modern statisticians have read Sir Lancelot Hogben? Nevertheless, probability theory remains the main method of doing these clinical studies.

A modern twist was added when it was realized that patients are human and they respond to such things as hope, faith, belief, and so on. This is the placebo effect which of course is real, very important and ought to be an essential element of therapy for every physician. Now the question arose, how does one distinguish what is the real therapeutic phenomenon; is it the placebo effect or is there a real therapeutic effect indeed?

I wonder why this question was ever raised. if the placebo is an essential element of all healing, removing it removes the healing and will injure the patients.

The answer is the double-blind. The fact that this method has itself never been tested seems not to matter. It has become the gold standard of clinical research. It is probably a very useful test for officials who have to decide if a drug has any value in treatment, and for editors who have to decide whether they should publish a paper or not. For total dependence on the P value= 0.05 removes the need to think, to reason, and to do so acutely and with wisdom.

I have been critical of the double blind for many years, even though under my direction we were the first psychiatrists to conduct these experiments starting in 1952. 2-5 But I will not reiterate my objections to this test which I consider not the gold standard of modern investigations but a rather inefficient, expensive method of doing human experimental trials. I consider it unethical since it is based upon lies- lies to the patient and to the dr running the test. And I doubt that anymore than a very small number of these trials are really blinded as this is very difficult to do.

But I do want to discuss a rather new way of conducting these trials to maximize the chance that positive results would be obtained, if that is one of the objectives, and the way changes in %s are emphasized rather than total numbers. For example, if 5% of the treated group responds compared to 3% of the placebo group this is proclaimed as a 66% improvement. It is more honest to report that there was a 66% improvement in the percentage values, not in the improved rate.

Reported in this way, the press and the unwary readers will not really know what is happening. A recent example is the report on Femara.

(Letrozole), which was recently critically examined by Ralph Moss, in this journal. This report appeared in the New England journal of Medicine, October 8, 2003. The study was double blind, perspective randomized pacebo-controlled on post menopausal women. Who had survived 5 years on tamoxifen. The primary endpoint was disease-free survival, not the death rate. This 5-year study was not completed because in the first interim analysis there was a significant statistical difference in rate of metastasis and new disease between the group. The US National Cancer Institute stated that, “…The study was stopped early because women taking Letrozole were Much less likely to have their cancer return than women taking placebo, without serious side effects.”

About 2600 women were studied in each arm. 75 from the Letrozole group had recurrence while 132 from the placebo group. The difference reached 0.001 significance. But of course even minor differences with huge sample sizes reach these significant levels. This means that patients on the drug lived a little longer before they were hit by recurrence or metastasis. In actual numbers, out of 2600 women on this powerful drug taken for 4 years, there were 57 fewer cases of metastasis and recurrence. 2% of the 2600 patients benefited. Do you agree that the term “Much less likely” really applies?

When one looks at the hard data- the number of patients who died- the situation suddenly looks quite different. From he placebo group 42 women died =within the study period and from the treated group 31 died. But P did not reach the 0.05% level; It reached only 0.25, i.e. there was no significant difference. Again 2600 women will have to take this new drug 4 years in order to save 11 (0.4%) of them from dying. But what about side effects? From the treated group 5.8% or about 151 women developed osteoporosis while from the placebo group 4.5% or 117 did; An increase of 34 cases. So in order to save 11 women, 2600 will have to be on a drug with major side effects for 4 years, of whom 34 will develop osteoporosis.

But the study did not proceed for the needed 5 years which is probably lucky for the study, for the incidence of side effects most likely increases with time and it is possible, at least one cannot deny the possibility, that had the study continued for the full 5 years the difference in the incidence side effects would have been even more significant. And we would have had a much more accurate estimate of the frequency of new cases of osteoporosis.

The US National Cancer Institute concluded that since only 4.5% of women on the drug had side effects compared to 3.6% of women on placebo, that this is tolerable. Tthey considered this difference not significant. But when 2.8% on the drug had recurrences compared to 5% on the placebo group this was not of major significance. It is likely many of these women will have to add medication against osteoporosis. They recommend that all women on this drug should have their bone density checked and followed.

How many women, when told that their chances of not dying is improved by 0.4%, will agree to take letrozole?

Suppose that instead of Letrozole, the treated group had been on vitamin C. With the almost universal reaction of excitement and optimism that emanated from the cancer specialists had been the same, or would they have found that the differences were not significant?

This is the type of large scale therapeutic trial that david Horrobin found unethical. He did not agree that patients volunteered for these trials because they are altruistic. They enter because they want to survive and the fact a trial is being conducted gives them some additional hope. Secondly he was concerned because large scale clinical studies use up skill, time and money, and make it it much more difficult to conduct more effective smaller scale trials. Horrobin recommended that “…We largely abandon large scale trials looking for small effects and instead do large number of small trials, often in single centers, looking for large effects. ” (7)

I agree with David Horribin’s view. Drugs which yield a significant difference compared to other drug or placebo control, should be considered not effective if very large scale trials must be conducted in order to reveal this minor difference. For these tests we would depend much less on statistical significance and look for real differences, which are so obvious that a glance would show the results. We need to continue to count. We need to use statistical analyses very rarely.

A. Hoffer, MD PhD, FRCP (C)

    1. Hogben L. Statistical theory. The relationship of probability, credibility and errorW. W. Norton & Company (1967)

References

  1. 1963 Dr Abram Hoffer M.D., Ph.D., Humphry Osmond M.D. Some Problems of Stochastic Psychiatry J. Neuropsychiatry. PMID: 14112294 1963 Dec;4:97-111
  2. 1973 Dr Abram Hoffer M.D., Ph.D. An examination of the double-blind method as it has been applied to megavitamin therapy. Orthomolecular Psychiatry 2:107-114
  3. 1974 Dr Abram Hoffer M.D., Ph.D. Double blind studies. Can Med Assoc J 111:752 only, 1974
  4. 1967 Dr Abram Hoffer M.D., Ph.D. A theoretical examination of double-blind design Can Med Assoc J 97:123-127, 1967
  5. Moss R. Ph.D. The war on cancer; Femara takes the cancer world by storm From Townsend Letter for Doctors and Patients, 1/1/04 by Ralph W. Moss
  6. Dr David F Horrobin, MD Are large clinical trials in rapidly lethal diseases usually unethical? (Full Text) Lancet. 2003 Feb 22;361(9358):695-7. PMID: 12611394 DOI: 10.1016/S0140-6736(03)12571-8

Reprinted with permission by the Townsend Letter for Doctors & Patients – October 2004. Telephone (360.385.6021)

IMPORTANT:  Information provided is intended for educational purposes and is not intended to be medical advice nor offered as a prescription, diagnosis or treatment for any disease, illness, infirmity or physical condition. Always consult your own medical provider about your health and medical questions before making any health related decision. These statements have not been evaluated by the Food & Drug Administration.