Knowledge From Structure
azelastine nasal spray data
A recent paper in the Journal of the American Medical Association: Internal Medicine touts the apparent efficacy of an azelastine nasal spray in combating covid. Azelastine is an antihistamine available over the counter for many years under names like Astelin and Astepro, and is known to have anti-viral properties.
It looked interesting, but before I had time to dig into it, Micheal Hoeger, a prof at Tulane, posted on Twitter:
I read this article with such excitement, especially being published in JAMA IM. Unfortunately, the devil is in the details. When the analyses account for dropout (correctly), the findings largely fall away. This is buried in the Supplemental Materials.
Huh.
There are a bunch of ways to look at the data in this paper, and I’m going to talk about a few of them but only really look closely at two: the statistical significance of the effect, and the time structure of the data (that is: when the cases occur).
This kind of analysis mostly involves counting things. It’s incredibly hard to do well.
In the present case, the things we are counting are a) people and b) cases of covid.
The study involve 450 people, of whom 20 got covid over the 56 day study period. The people were divided into two groups: the treatment group, which consisted of 227 people who got azelastine spray, and the control group, which had 223 people. The control group got a dummy spray.
BUT: not everyone in each group completed the study. People were eliminated for various reasons, leaving just 179 in the treatment group and 174 in the control group.
Handling people who don’t make it to the end of a study is a hard problem. In cancer research we used to talk about “local control”, which meant the patient didn’t have a recurrence of the same disease in the same part of the body for five years post-treatment. The problem is that people who died due to other causes before the five years was out had less chance for their cancer recur, and cancer is a disease of age, so the patient population is selected for susceptibility to other causes of death. Without careful handling of other causes, you could get 100% local control and nobody surviving five years.
I started out writing this as a report on my handling of their drop-out data, as their scenario-based approach is not very good. But in the course of writing up that analysis—which shows that a bit more than 5% of the time their drop-outs will have been infected with covid often enough to wipe out the signifcance of their primary analysis—I made the mistake of doing a more basic analysis of the structure of the data, and wasted a few evenings on it. That’s what I’m writing up here.
The raw numbers are: of the 179 people left in the treatment group at the end of the study, 5 got covid. Of the 174 people left in the control group, 13 got covid.
One interesting question is: given the treatment and control groups are about equal size, is the difference between 5 infections and 13 infections statistically significant? What this means in conventional statistics is, “How frequently would we expect to get a difference between groups that is at least this big by chance if there really is no difference between them?” This frequency is called the “p-value” of the result and is just the probability of two identical groups appearing different due to random chance.
The authors of this study use something called a two-proportion z-test to get their p-values, which is a fairly reasonable choice.
It’s easy to reproduce the p-values found in the paper. Two lines of Python are sufficient:
from statsmodels.stats.proportion import proportions_ztest
print(proportions_ztest([5, 13], [179, 174], 0))This gives a z-value of -1.9975 and a p-value of 0.046, so we can “reject the null hypothesis at a p-value of 0.05”. That means there is a (just) significant chance of the effect being real, although this says nothing about how big the effect is. It’s easy to get a result that’s statistical significant but clinically insignificant. For example, a badly made die might come up six 1% more frequently than chance alone would predict, and after 50,000 roles the difference would be apparent with a p-value of less than 0.00005, which is extremely statistically significant, but practically useless. For any practical number of rolls the die might as well be fair.
This disconnect between statistical significance and practical or clinical significance is one of the major weak points of non-Bayesian analysis. Bayesian analyisis is better suited to asking questions like, “How plausible is it that this effect is at least a factor of two, given the data?”
For a study like this, my preferred way of approaching statistical significance is resampling: you run a Monte Carlo simulation of the experiment assuming there is no difference between the groups, and see how often you get a difference at least as big as the one you observe.
In this case, the overall infection rate was (5+13)/(179+174) = 18/353 = 0.051… or 5.1%. So one way to simulate the outcome is to generate 353 random numbers between 0 and 1, each of which represents a person, and then count how many are less then 0.051 in the first 179 and how many are less than 0.051 in the following 174. Those are the treatment and control covid cases for that run. Then do it again, maybe ten thousand times, and at the end you’ll have ten thousand runs of this study with the number of covid cases in each group… assuming there’s no difference between them.
Because of its use of random numbers, Monte Carlo simulation was named after the famous casino by the Italian physicist Enrico Fermi, who invented it. It is a powerful and relatively simple technique to figure out the odds without any of the complicated stuff that more sophisticated statistical analysis requires, which is why it is so rarely used.
In the present case, if we run a Monte Carlo we get 5 or fewer cases in the treatment group and 13 or more cases in the control group 1.2% of the time when both groups have the same 5.1% chance of infection.
That is, if you ran this precise study 100 times, you would on average get a result like the one observed just once by chance alone, if there was no difference between the groups.
Different types of statistical test have different levels of power. So the p-value from resampling is 0.012, as opposed to the p-value from the two-proportions z-test, which is 0.046. Both of these are considered “significant” by the lax standard of “one in twenty times this result would come up by chance if the groups are really the same”, which is what p = 0.05 means.
The nice thing about Monte Carlo analysis is that it’s easy to start messing with the assumptions that lie behind more conventional statistical tests. What if, for example, the risk of infection varies with time, or has some other non-uniformity? Monte Carlo simulation lets us explore those possibilities in simple and transparent ways, and this is important because if we set p = 0.05—one time in twenty—as the threshold for “statistical significance” there are a lot of things that can mess up the results that are non-statistical: it is incredibly easy for subtle systematic effects to create differences between groups that are larger than the commonplace statistical threshold, which is why so many published results are wrong, and why my own personal threshold for statistical significance is more like 0.001, not 0.05.
Systematic effects abound. Achieving anything like true randomness is fantastically hard. In radiation detection we manage it without too much difficulty because gamma rays and the like are more-or-less devoid of personality: there are very few dimensions they can vary on. People, on the other hand, are multi-dimensional. I mean, have you met any?
This multi-dimensionality allows for all kinds of subtle correlations, every one of which is a violation of strict randomness, and such things never average out: they always push the result in one direction or another, usually in the direction the experimenters want it to go, strangely enough. Some of this may be due to publication bias in non-scientific fields, where null results are more difficult to get past editors and referees than positive ones.
So having done enough work to see that maybe there is a statistically significant effect here, it’s worth looking at the structure of the data to see if there is any evidence of other non-statistical features.
When we do this, we see the time series of cases is peculiar:
This is odd. And remember Issac Asimov’s observation: the most important statement in science is not, “Eureka!” (I’ve got it!) but, “That’s odd…”
The control group cases are bunched up in the first 20 days, with only 3 of 13 occurring after that. Looked at another way: all but one of the control group cases happen in the first half of the study. I’m not sure what the outlier is doing at day 63 of a 56 day study, but life is too short for me to go looking for it: I’m doing this analysis out of curiousity, and that only takes me so far these days.
The treatment group cases, on the other hand, are bunched up in the middle, between 20 and 45 days.
Assuming the risk of infection in each group was uniform across the whole 56 days of the study, what are the odds of that happening? This is particularly important when you realize that covid runs in waves, and those waves have significant structure on a scale of a month or two. They also have geographic structure, affecting different areas at different times. Any subtle correlation between control or treatment group and a particular time or place might therefore introduce a significant non-random element into the mix.
But is this casually-observed feature of the data statistically significant?
Let’s take the infection risks of the two groups at face value, and ask, “What are the odds of 10 out of 13 infections in the control group happening in the first 20 days?” and “What are the odds of all 5 of the infections in the treatment group happening between 20 and 45 days?” These questions are statistically generous, as the intervals we’re asking about are actually a bit bigger than strictly necessary. So if we get answers that say, “Not very likely” we have strong evidence that something non-random is going on, and something non-random in an randomized controlled trial (RTC) is a very bad thing. The odds of 10 out of 13 infections in the control group happening in an given interval is completely unrelated to the relationship between the control and the treatment group: it is just asking, “If the risk of infection is constant in time within this group, what are the odds of most of the infections happening so early? Or so bunched up in the middle?”
To do this properly requires that we knock out people already infected as time goes on. The probability of infection per day is 13/65 for the control group, and but the number of susceptible individuals drops as people get infected, so the mean number of infected people on any given people drops slowly over the course of the study due to this.
If we run a simple Monte Carlo for 1000 shots we find that we get 10 or more of 13 cases coming in at 20 days or less just 1.0% of the time: p = 0.01. If we ask how often we get 12 or 13 cases in just 31 days we get p = 0.03.
If we ask how often we get all five of five cases between 20 and 45 days we get p = 0.03, and remember: these are generous limits. The actual cases lie between 21 and 43 days, inclusive.
So the time structure of the data looks pretty improbable, and if we think there is a significant effect because p = 0.046 (based on the two-proportions z-test) or even p = 0.012 (based on my Monte Carlo) between the treatment and control group, we have to think there is something decidedly non-random in the temporal distribution of cases based on the same kind of analysis, because the p-values are similar.
The paper includes an analysis showing the time structure of the treatment and control cases are different from each other, but near as I can tell does not include this kind of analysis, which shows they are different from what one would expect if the odds of an individual being infected were constant with time.
That’s as far as I’m going to take this. I don’t really care how big the effect appears to be, because on the basis of what I’ve discussed here, I’m not convinced the effect is real. This isn’t my field, and it’s perfectly possible I’ve messed something up in the analysis, but the effect isn’t huge, it’s significance goes away if we treat the drop-outs properly even if we accept the rest of the paper at face value, and the time structure of the data as I understand it is not consistent with constant infection risk.
Personally, I’m not going to be adding this type of spray to my arsenal quite yet.


