Pandemic babies
If you want to feel scared about the pandemic, statistics can help you. If you want to learn something, stats can help with that too.
For example, this week I learned how to distinguish between the terms used for different shades of red (scarlet, crimson, carmine, maroon, etc). Why did I learn this? Because every day, the nytimes.com front page shows a map of COVID-19 hot spots in the U.S., and every day, as case rates increase, the states on this map have turned darker shades of red. Most of the country is now somewhere between merlot and mahogany.
This week I also read about a study, published January 4th, which claims to show that babies born during the pandemic will be developmentally delayed at 6 months, even if their mothers weren't exposed to COVID-19. That's quite scary – for parents, for all of us. I decided to take a closer look at the study. What I found is a team of Columbia University researchers, an eminent journal (JAMA Pediatrics), and yet, in spite of that pedigree, a statistical approach so flawed there's no reason to be alarmed by the results. However, the study was picked up by a variety of news media, and I imagine the results have been creating some anxiety among parents, or parents-to-be.
I found this aggravating. In fact, I was so aggravated, I did something I've never done before: I e-mailed the journal editor to air my concerns. Yesterday, the editor asked me to turn what I wrote into a formal "letter to the editor", which will be published in the journal along with the authors' reply. In a later newsletter I'll post what I wrote and how the authors responded. Here I'll describe the study and its statistical flaws, and why it needn't be added to our list of things to worry about this year.
Rationale for the study
Prior research has shown that viral infections during pregnancy cause developmental delays. Even when the babies themselves are uninfected, changes in maternal immune system functioning adversely affect the prenatal environment. And so, Dr. Dani Dumitriu and colleagues at Columbia asked a simple question: If a woman gets COVID-19 during pregnancy, will her baby show developmental delays?
Methods
Sadly, the pandemic made it easy for the researchers to address this question. They sampled 114 women who tested positive for COVID-19 during pregnancy, as well as 141 women whose serological data showed no infection during pregnancy. At 6 months of age, the babies of all 255 mothers were individually tested for developmental status using a measure called the Ages and Stages Questionnaire (ASQ-3), in which each mother answers simple questions about her baby's communication, gross and fine motor skills, problem solving, and personal-social development.
Findings
Happily, the researchers found no group differences. Having a COVID-19 infection during pregnancy did not impact development at 6 months. However, there was a surprising twist. The researchers also had access to data from 62 women who gave birth prior to the pandemic (between November 2017 and January 2020) and who had completed the ASQ-3 for their 6-month-olds. Compared to these 62 pre-pandemic babies, both groups of pandemic babies showed slightly lower ASQ-3 scores. (These were mean scores; not the percentages of babies identified as "delayed".)
In short, the researchers concluded that babies born during the pandemic show developmental delays at 6 months. Even if their mothers never had COVID-19.
Interpretation
The findings are shocking. The researchers referred to their study as suggesting "the potential for a significant public health crisis" that would affect hundreds of millions of children.
The findings are also mysterious. Why would the mere fact of being pregnant during the pandemic cause babies to be developmentally delayed? The researchers' hypothesis is that owing to the pandemic, pregnant women experience more stress than they did in pre-pandemic years.
Stress during pregnancy is known to adversely affect development, and people (generally speaking) have become more stressed during the pandemic, so on the face of it, the stress hypothesis makes sense. I don’t think there’s anything wrong with the logic. My concern is that the researchers didn’t actually find what they claim to have found. Unfortunately, their findings have already been picked up by national media on both ends of the political spectrum (e.g., NBC and Fox) as well as local media and high-circulation outlets for folks with some expertise (e.g., Science, The BMJ, MedPage Today), so we have a situation in which flawed, scary results are being publicly disseminated. Hence my message to the editor. And hence, I suspect, his desire to publish it, because JAMA journals are devoted to public health issues, and there's nothing particularly idiosyncratic about my concerns. Here they are now:
Limited sampling
From a purely statistical perspective, simple comparisons between 62 pre-pandemic babies and 255 pandemic babies may be acceptable; from a scientific perspective, the sample is way too small to justify the conclusion that pandemic babies are delayed.
You can't draw big conclusions from small samples. In this case, because so many variables influence a baby's development, you can't be sure that the two groups (pre-pandemic vs. pandemic) were exactly the same other than when the babies were born.
Small size exacerbates other sampling-related limitations. For example, only about a third of the pandemic mothers who received an invitation for the study actually agreed to participate. Who is this subset of mothers? Are they unusually sensitive to less-than-ideal behavior from their babies, because they're worried their babies have been exposed to COVID-19? Are they rating their babies less favorably on the ASQ-3 than pre-pandemic mothers did, because they wouldn't want researchers to miss any signs of delay? Speculative questions like this (which the researchers don't address) lead me to a second problem:
Questionable interpretation
Differences between the pre-pandemic and pandemic groups may have nothing to do with developmental delay. For example, the ASQ-3 is completed by each baby's mother. Studies show that the pandemic has increased anxiety, depression, and pessimism among the general public. So, mothers might rate their babies less favorably than they would’ve done, because their general outlook is more gloomy now.
I'm not presenting this as a plausible interpretation of the results. I'm only saying it can't be ruled out. Small sample size exacerbates the problem, as do some of the specific ASQ-3 items. For example: "Does your baby act differently toward strangers than he does with you and other familiar people?" The response options are "yes," "sometimes", and "not yet". During the pandemic, babies may receive a less favorable rating (e.g., "not yet") simply because they've had less exposure to strangers, not because they're delayed compared to pre-pandemic babies.
Misuse of a screener
The most egregious flaw in this study is the researchers' use of the ASQ-3. Specifically, their reliance on mean ASQ-3 scores. (Raw scores can range from 0 to 60 for each of the five domains measured – i.e., communication, gross motor skills, fine motor skills, problem solving, and personal-social development).
The ASQ-3 is a screener, and prior studies show that it works well in that capacity. However, screeners are not assessments. You can't use a screener in place of an assessment. Everyone knows that. (By "everyone" I mean experts in stats, assessment, and methodology, as well as clinicians, educators – and most researchers, but evidently not all of them.) Bear with me for a few paragraphs and the problem here will become clearer.
A screener is a tool used to identify individuals at risk for some adverse outcome. We have screeners for a variety of problems, ranging from cancer to depression to developmental delay, and screening is a routine prerequisite for entry to institutions ranging from the military to public school.
Screeners are quick, easy to use, and good at exactly one thing: Identifying individuals who are at risk of some problem (or already experiencing the problem). They're not substitutes for thorough assessments.
Screeners come with cut-off scores. Anyone who exceeds a cut-off score will need further attention, usually beginning with a more thorough assessment. Even if the possible range of scores on a screener is large, only the scores close to or exceeding the cut-off are meaningful. Screeners aren't appropriate for comparing other scores. In fact, screeners like the ASQ-3 don't even allow for a full range of possible scores. As noted in the ASQ-3 technical report, item selection "was restricted by allowing only items that targeted a skill that occurred at the middle to low end of the developmental range" for each age group.
In sum, the ASQ-3 shouldn't be used to compare babies who score better than cut-off values for possible delays, because the measure isn't sensitive enough to distinguish among those babies.
Here's an analogy: Imagine that I give some adults a test of reading comprehension designed for 6th graders. Anyone who has dyslexia will obtain a low score on this test. However, if your reading skills fall within the normal range for adults, your score will be high, and small differences in scores between you and another adult won't reflect differences in reading skills but rather in other factors (e.g., how bored you got with the test). This test, in effect, works as a screener for dyslexia among adults.
Now let's extend the analogy. Imagine that I give this 6th grade reading test to two groups of non-dyslexic adults, and I find that group A performs slightly more poorly than group B. Can I conclude that the members of group A are poorer readers? No. The test isn't sensitive enough to permit that inference. I have to look for some other cause for the difference. Perhaps group A consisted of a few more people who were bored or preoccupied while taking the test. Perhaps the room where group A took the test was colder and more noisy. There are lots of possibilities. All we know for sure is there's no basis for concluding that on average group A consists of poorer readers.
You can see now why the researchers' use of ASQ-3 score means was wrong. Since most babies aren’t delayed, this measure isn’t a suitable basis for comparing them. We can't be sure exactly why pandemic babies had slightly lower ASQ-3 means (although earlier I suggested some plausible reasons, such as greater anxiety or pessimism among new mothers during the pandemic). All we can know for sure is that inappropriate use of a screener ensures uninterpretable data. Garbage in, garbage out.
What should've been done?
Since the ASQ-3 was the only measure used to identify developmental delay, only one type of analysis would've been appropriate: A comparison between pre-pandemic and pandemic groups in the proportions of babies who exceeded cut-off scores (i.e., showed signs of delay).
Running the stats that way would've been informative, though the results would’ve still only been suggestive, because a screener is not an assessment (I know...I keep saying that). In most instances, a screener merely suggests a problem and calls for more thorough assessment.
Here's the punch line. Buried deep among all the irrelevant statistical comparisons of ASQ-3 score means, there's a sentence – one sentence – in which the researchers report the right kinds of stats. Here it is:
"Although not specifically powered to detect differences in the proportion of infants who met screening cutoffs for delay, a greater proportion of infants in the pandemic cohort met the gross motor cutoff ".
Let's break this down. "Powered" is a technical term. What it means here, in plain English, is that the sample wasn't large enough to answer the right question (i.e., did significantly more pandemic babies exceed ASQ-3 cut-offs for delay?). This analysis shouldn't have been run in the first place. Even if we could trust it, what it tells us (and this is reiterated in one of the researchers' tables) is that gross motor skills is the only domain where pre-pandemic and pandemic babies differed. For the other four domains (communication, fine motor skills, problem-solving, personal-social development) there were no significant group differences in proportions of babies who showed signs of delay. No differences, in four out of the five domains tested. Based on an analysis that shouldn’t have been run in the first place, because the sample was too small….
Summary
If your baby is born during the pandemic, you might worry about the risk of COVID-19 infection. You might worry about the relative lack of social stimulation. You might worry about supply-chain disruptions of access to baby products. But you needn’t worry about developmental delay. The study I discussed here is the first of its kind, and there’s zero evidence of increased risk. Zero.
Final words
When I read studies published by Ivy League scholars in prestigious journals, I tend to nod in appreciation of their brilliance and close my screen feeling that I've learned something. I may quibble about minor details, but I don't question the main findings. At the same time, some of these studies are deeply flawed. Flawed in ways that anyone with statistical expertise would recognize. I talk about some of these studies in this newsletter; there are others out there. Keep this in mind next time you read about some finding that sounds scary. Even Michael Jordan missed sometimes. So does LeBron. And sometimes, as in this case of this particular study, what you see is not just a miss. It's an air ball.
This newsletter is dedicated to my granddaughter Sylvie, born during the pandemic, and to my daughter Cecilia, who was born during a different pandemic (we called it MTV) and is working tirelessly and enthusiastically now on behalf of Sylvie's development.