Pandemic Babies II
On January 4, JAMA Pediatrics published a study claiming to show that babies born during the pandemic are developmentally delayed, even if they were never exposed to COVID-19. The study made national news a few days later, stirring up anxiety among new and prospective parents.
After reading this study, I was so aggravated by the statistical flaws that I e-mailed the editor with my concerns. I expected a polite, noncommittal reply, which I did receive. The editor also invited me to submit a formal commentary. My commentary was published in JAMA Pediatrics this Monday, accompanied by a response from the researchers.
In this newsletter I want to talk about why I wrote the editor, why he took my concerns seriously, how the researchers responded, and whether their response makes sense. Along the way I'll revisit some statistical points I laid out in an earlier newsletter, and I'll add a few remarks on the misuse of stats in science.
Why write the editor?
It's not hard to find misused or misinterpreted statistics in published studies. Usually I find this intriguing rather than aggravating, because it seems like there's no limit on how creatively statistics can be deployed. (I mean “creatively” in both the positive and the negative senses of the word.) Stats are used to help evaluate the authenticity of some of Shakespeare's writing, for example, and just last week I discovered that paleontologists have been squabbling for years about which stats are most suitable for determining the sex of dinosaur fossils (more on that in a future newsletter!).
What aggravated me about this particular study is that it used flawed statistics to justify conclusions that are broad, terrifying, and wrong. At one point the researchers even referred to "the potential for a significant public health crisis" that would affect "hundreds of millions of children". No wonder some people who heard about this study were alarmed! Ultimately, I wrote to the editor in hopes of influencing editorial policy in a small way, by calling attention to specific statistical problems and needlessly distressing conclusions.
Why did the editor publish my commentary?
I have zero expertise in pediatrics, in medicine more generally, or in epidemiology. The fact that the editor of one of the world's leading pediatric journals wanted to publish commentary from a non-expert tells me that my concerns were both understandable and legitimate. That is, even if everything I wrote was incorrect, the editor seems to have assumed that other readers might independently entertain these concerns, or that some of the more clinically-oriented readers wouldn't realize that there was anything to be concerned about in the first place.
The original study
For this study, Columbia University researchers obtained data from 62 women who gave birth prior to the pandemic (between November 2017 and January 2020) as well as 255 women who gave birth during the pandemic. (This latter group was further subdivided into mothers who had vs. had not been exposed to COVID-19; the results turned out to be the same for these groups. None of the babies appear to have been exposed.)
All mothers completed an Ages and Stages Questionnaire (ASQ-3) for their 6-month-olds. The ASQ-3 consists of simple questions about the baby's communication, gross motor skills, fine motor skills, problem solving, and personal-social development. The main finding was that the pandemic babies showed slightly lower ASQ-3 scores, on average, than babies born prior to the pandemic. The researchers concluded that babies born during the pandemic exhibit developmental delays at 6 months, even if they'd never been exposed to COVID-19.
Why would the mere fact of being pregnant during the pandemic cause developmental delays? Well, maternal stress has been shown to adversely affect both pre- and postnatal development, and so the researchers speculated that pregnancy during the pandemic caused women to experience more stress than they would've experienced a few years earlier.
My concerns
I described some statistical concerns with this study in a prior newsletter. The commentary I wrote for JAMA Pediatrics is a highly abbreviated version of that newsletter, with a few additional details folded in. (You can see part of my commentary here on the JAMA Pediatrics website; the rest is behind their paywall. I’ve included the entire commentary in this newsletter as Appendix 1.)
Briefly, I raised three concerns with the study.
1. Group differences in developmental delay were determined by comparing raw scores on a screener (the ASQ-3).
This is widely understood to be an inappropriate way of using screeners. Here's how I put it in my earlier newsletter:
Screeners are quick, easy to use, and good at exactly one thing: Identifying individuals who are at risk of some problem (or already experiencing the problem). They're not substitutes for thorough assessments.
Screeners come with cut-off scores. Anyone who exceeds a cut-off score will need further attention, usually beginning with a more thorough assessment. Even if the possible range of scores on a screener is large, only the scores close to or exceeding the cut-off are meaningful. Screeners aren't appropriate for comparing other scores. In fact, screeners like the ASQ-3 don't even allow for a full range of possible scores. As noted in the ASQ-3 technical report, item selection "was restricted by allowing only items that targeted a skill that occurred at the middle to low end of the developmental range" for each age group.
In short, the researchers shouldn't have used the ASQ-3 to compare babies, because the measure isn't sensitive enough for that purpose.
To illustrate, here's a simple analogy. Suppose I make 20 requests, one at a time, and each request is for you to make a simple movement (e.g., "raise your hand", "touch your knee", etc). Assuming that your range of motion is normal, requests like this are a simple way of screening for language comprehension problems. I might say, for example, that a score of 17 or lower indicates a possible problem that calls for further assessment. Methodologically speaking, this would be fine. What's not fine is to use my 20 requests as a test of comprehension skill. Because these are such simple requests, I wouldn't want to say that someone who carries out 19 requests correctly has better comprehension than someone who does 18 correctly. Rather, I would say that the person with the lower score must've gotten careless at some point. Or distracted. Or they misheard one of my requests. Or they just happened to not know one specific word I used ("touch your clavicle.").
Although I didn't mention this in my commentary, I suspect that the pandemic babies' slightly lower ASQ-3 scores reflects greater anxiety and/or pessimism on their mothers' part. Studies have shown that anxiety, pessimism, and other mental health problem increased among the general public during the pandemic. Imagine being a new mother and feeling slightly freaked out about raising your baby in these difficult times. You might give your baby slightly lower ASQ-3 scores than you would've pre-pandemic, because you're more attentive now to behaviors that seem less than ideal, or you might be more likely to interpret otherwise appropriate behaviors as lacking in some way. (You might even hope that lower scores will signal the researchers that your baby needs further assessment.)
2. Even appropriate use of the ASQ-3 failed to support the researchers' conclusions.
Here's how I put it in my prior newsletter:
Since the ASQ-3 was the only measure used to identify developmental delay, only one type of analysis would've been appropriate: A comparison between pre-pandemic and pandemic groups in the proportions of babies who exceeded cut-off scores (i.e., showed signs of delay).
Buried deep among all the irrelevant statistical comparisons of ASQ-3 score means, there's a sentence – one sentence – in which the researchers report the right kinds of stats:
"Although not specifically powered to detect differences in the proportion of infants who met screening cutoffs for delay, a greater proportion of infants in the pandemic cohort met the gross motor cutoff ".
[What] this tells us...is that gross motor skills is the only domain where pre-pandemic and pandemic babies differed. For the other four domains (communication, fine motor skills, problem-solving, personal-social development) there were no significant group differences in proportions of babies who showed signs of delay. No differences, in four out of the five domains tested.
3. The sample is too small.
Although small sample size made it impossible to run the appropriate statistics, the sampling concern mentioned in my JAMA Pediatrics commentary wasn't a technical one. Rather, I simply noted that ASQ-3 scores for 62 babies is way too few to justify broad conclusions about the impact of the pandemic on human development. (How many babies would you need? Thousands perhaps, so that you could take into account differences in gender, health, SES, and other key variables known to influence development.)
The researchers' response and my rebuttal
(You can see part of the researchers' response on the JAMA Pediatrics website; here again, the rest is behind the JAMA paywall.)
1. In response to my concern about the use of a screener to compare groups, here's what the researchers wrote:
"We feel strongly that screening tools with high sensitivity and specificity, such as the ASQ-3, confer the greatest benefit for efforts toward identifying early neurobehavioral markers in the context of a novel disease or environment. The ASQ-3 is recommended by the American Academy of Pediatrics and Family Physicians and the US Preventive Services Task Force and is a measure of choice for large National Institutes of Health–funded efforts. The ASQ-3 is widely used by both researchers and practicing general pediatricians, thus allowing for comparison across many geographic and temporal contexts. Throughout the pandemic, researchers and clinicians have relied more on parent-report measures of infant development following the shift to telehealth. The choice to use a parent-report measure was made not only for practicality, but also to circumvent confounding factors associated with novel conditions to the testing environment (eg, researchers wearing masks during the assessment)."
Every sentence in this statement is both true and deeply misleading.
Regarding the first sentence, yes, the ASQ-3 has high sensitivity and specificity, but as a screener, not as a means of comparing babies in the general population. As for the second and third sentences, yes, it's recommended and used by prominent organizations and people, but, again, as a screener. This is like saying when you have a headache, you should take anti-diarrheal medicine, because prominent organizations and people recommend anti-diarrheal medicine. Of course they recommend it, but not for headaches!
Regarding the final two sentences, yes, during the pandemic, researchers have relied more on parent-report measures of infant development. But that's not relevant here. The parents could've been asked to use a suitable measure instead of the ASQ-3.
2. Regarding my concern that even appropriate use of the ASQ-3 failed to support the conclusions, because group differences were only found for one of the five ASQ-3 domains, the researchers wrote the following:
"Our analysis of the proportion of infants who met the cutoffs for delay, which Springer considered to be more appropriate [than the main analyses], demonstrated a significant difference only with respect to the gross motor subdomain. Importantly, this does not detract from the differences on the continuous fine motor and personal-social scores. Gross motor skills develop at an earlier age than fine motor and personal-social skills; therefore greater variability is expected in this domain at this age. Finally, decrements in gross motor skills are one of the earliest indicators of risk for autism spectrum disorder. Given the lack of autism spectrum disorder screening and diagnostic tools at this early age, decrements in gross motor skills may be particularly meaningful."
Reading this passage, I felt aggravated all over again.
Regarding the first sentence, it's not just "Springer" who considers this analysis more appropriate. The makers of the ASQ-3, and anyone else with expertise in assessment, would feel this way.
The second sentence is plain ludicrous. It's true that group differences in proportions of delayed infants don’t "detract from" the differences found for the continuous scores. This is because, as I've said, the ASQ-3 shouldn't have been used to compare the continuous scores in the first place. In effect, what the researchers are saying here is that doing things the right way and finding results that don't support your conclusion “does not detract” from doing things the wrong way and finding results that do support your conclusion. (Indeed.)
As for the reference to greater variability in gross motor skills, the researchers were implying that no group differences were found for the other four domains because we don't see much variability from baby to baby in those domains. If that were true, they could've noted that their data indeed showed more variability in gross motor skill scores. It's suspicious that they didn't mention this. In any case, what they were implying here is simply wrong, given the content of the ASQ-3. For example, one of the gross motor questions is "Does your baby roll from his back to his tummy, getting both arms out from under him?" One of the communication questions is "When a loud noise occurs, does your baby turn to see where the sound came from?" (For all questions, the answer options are Yes, Sometimes, or Not Yet.) It's clear from developmental research that at 6 months of age, there won't be much variability in responses to either question. Because the ASQ-3 is a screener, and somewhat skewed to the lower end of developmental expectations, most babies who aren't delayed are going to get "Yes" or "Sometimes" responses to these questions.
Finally, the way the researchers linked gross motor differences to autism spectrum disorder is reprehensible. I realize that sounds harsh, but consider: Using a tiny sample, they examined five ASQ-3 domains and found significant group differences for one of them. This could reflect what's known as a Type 1 error (the more analyses you run, the greater the chances that one will turn up significant by chance, not because there's an actual difference). More importantly, interpreting the gross motor differences now, after the study is completed, is not how significance testing was intended to work. You're supposed to test hypotheses that you develop in advance, based on current research and theory. You're not supposed to run a bunch of analyses, and then, when one turns up significant, create a story that accounts for it. If you decide to speculate anyway, you're not supposed to trot out an incredibly alarmist story (the pandemic is increasing the risk of autism) when simpler ones are conceivable, and when your stats are flawed to begin with!
Ok, I'll take a breath before continuing.
3. Regarding my concern about sample size, here's the researchers' final paragraph:
"We encourage readers to interpret our preliminary report as one would a screener—there is an early indication of potential developmental differences in infants born during the COVID-19 pandemic that should be carefully monitored. Given the small sample size, it is critical not only to replicate our findings, but also to expand on them using objective assessments of neurodevelopment, which we are actively pursuing as part of the longitudinal COVID-19 Mother Baby Outcomes (COMBO). While acknowledging the limitations of our analysis, we have a responsibility to disseminate the preliminary results of our study in an effort to shed light on how the COVID-19 pandemic may be impacting infants born during this unprecedented time."
As you might imagine, I found this passage semi-acceptable, because the researchers acknowledge the limitations I described. After all, the the study is already published, so I wouldn't have expected them to just say: Oh, Springer was right, we take it all back. Even so, I would argue that as prominent researchers at an Ivy League institution, they have a responsibility to not disseminate preliminary results when the conclusions are so broad and scary ("the potential for a significant public health crisis") and grounded in such limited data. (They didn't even refer to the study as "preliminary", or something of that sort, in the original article.)
Final comment
Scientists are the first to admit that they're just as conventional as the rest of us. That is, they often think and do certain things simply because that's what others in their field think and do.
In some ways this is a good thing. Scientists don't have to reinvent the wheel (literally), any more than they need to depart from conventional ways of identifying atomic structure, analyzing biopsies, or running statistics. In short, relying on existing knowledge and techniques facilitates progress in science.
At the same time, scientific conformity impedes progress when mistakes are perpetuated. Consider geocentrism, the view that the sun and other planetary bodies revolve around the earth. This wasn't some quaint misconception that pre-modern astronomers held because they didn't have enough data. Some of the greatest thinkers of antiquity (Aristotle, Ptolemy, etc.) developed geocentric models that were mathematically complex and yielded accurate predictions about planetary motion. Scientists clung to these models for millenia, in spite of periodic dissent from respected members of their scientific communities.
I thought of this when I saw that the researchers' first response to my commentary was to mention that prominent organizations recommend the ASQ-3. As I said, it's primarily recommended as a screener, not as a comparative measure, but even so: Should we being doing something simply because others do it? It's not so bad if you walk around thinking that the sun revolves around the earth – after all, it looks like it does – but in many other cases, we're probably grateful that scientists broke with conventional wisdom in their field. I mean, when I visit my doctor, I'm delighted to know he's not going to apply leeches.
Sadly, the ASQ-3 is one of many screeners that's used inappropriately for individual and group comparisons, even though the instruction manuals for these screeners often warn against doing do. I've seen this mistake in other biomedical studies, for example. Presumably what happens is that data becomes available to medical researchers from routine screening procedures, and the researchers then try to glean something useful from the data. I have a hunch – no evidence for this, just a hunch – that the JAMA Pediatrics editor invited my formal commentary because he recognizes that misuse of screeners occasionally does occur in his field.
So, once again, if you're a new parent, or considering becoming one, or just a concerned citizen, there's no evidence that getting pregnant during the pandemic puts the infant at risk of developmental delay. Be fruitful and multiply, if that's what you want to do!
Appendix 1: JAMA Pediatrics commentary
Here is my entire commentary as it appeared in JAMA Pediatrics on Monday:
A recent JAMA Pediatrics cohort study1 showed that birth during the COVID-19 pandemic is associated with lower neurodevelopmental functioning at 6 months, even in the absence of maternal SARS-CoV-2 infection. However, methodological limitations prevent the data from being interpretable.
First, the only measure of neurodevelopmental functioning was the Ages & Stages Questionnaire (ASQ-3). Data analyses focused on ASQ-3 mean score differences between pre-pandemic and pandemic cohorts. However, the ASQ-3 is merely a screener used to identify risk for developmental delay. Screeners are not suitable for comparing individual performance within the general population. When used to identify cut-offs for at-risk status, the ASQ-3 has good reliability and validity2; psychometric data on its use as a comparative measure are not provided, because such usage would be inappropriate (for many reasons, including the fact that item selection for each age group is restricted to items representing middle to low levels of development for that group2). In sum, the main findings are uninterpretable owing to misuse of a screener.
The authors did report one analysis that treated ASQ-3 data appropriately. This analysis showed that compared to the pre-pandemic cohort, proportionally more infants in the pandemic cohort exceeded cut-offs for delay with respect to gross motor skills. No cohort differences were observed for the other four dimensions of the ASQ-3. In other words, even the appropriate analysis fails to support the authors' general summary of cohort differences. Moreover, sample size was insufficient for conducting this analysis in the first place, which renders even these results uninterpretable.
The sample was not only small from the perspective of statistical power. Data from only 62 pre-pandemic infants served as the standard for evaluating development among infants born during the pandemic. This is questionable, scientifically speaking, given the many known influences on neurodevelopment that could not have been controlled for.
Reports on this study have appeared in national news media and other venues, and such reports have spurred more than one lay person to approach me with concerns. Although the authors claim that their data suggest "the potential for a significant public health crisis", I believe the data should not be taken as suggesting anything, owing to the methodological limitations noted here.
Appendix 2: A technical note
For those of you with some training in stats, the researchers also noted in their reply to my commentary that "studies examining the latent factor structure of the ASQ-3 using continuous subdomain scores demonstrate they measure specific developmental constructs."
Of course, my first response is: We can't trust these studies, because it's not appropriate to analyze continuous scores in this way. As noted in the ASQ-3 manual, item selection for each age group "was restricted by allowing only items that targeted a skill that occurred at the middle to low end of the developmental range". In short, the ASQ-3 cannot have adequate construct validity if continuous scores are used, because these scores have an artificially restricted range. To put it concretely, some relatively high-scoring babies would score even higher, to a greater or lesser extent, if ASQ-3 questions had not been selected so that it functions as a screener.
Even if analyses of latent factor structure reveal that each subdomain of the ASQ-3 "measure specific developmental constructs", this only means that each subdomain is independent from the rest. Gross motor skills, for instance, develop independently from communication skills – progress in one isn't correlated with progress in the other. Statistically speaking, this can be demonstrated even with restricted-range data such as ASQ-3 continuous scores. To illustrate, imagine an IQ test that's too easy for the most intelligent test-takers, so that those who obtain the highest 10% of scores would actually show a much greater range of scores if the test had been more challenging. Analyses of latent factor structure might identify the usual factors (e.g., verbal reasoning, mathematical reasoning, etc.) and yet the test would still be unsuitable for the general population, because it doesn't accurately represent scores at the upper end of the distribution.