Mental Health Bots
Me: I'm so upset about the 2022 mid-terms!
Therapist: This sounds like a problem with the pandemic, does that sound right to you?
Me: You misunderstood me.
Therapist: Oh, pardon me, Ken. Does anything here sound like the problem you're having: Relationships, Sleep, Loneliness, Stress or anxiety, Feeling low, Grief, Anger or frustration, The pandemic...?
As you've probably guessed, my "therapist" is actually a computer program. Specifically, an AI-driven app called Woebot that provides mental health support anytime I want. I signed up for Woebot this week and had some surprisingly good conversations with it (interspersed with several cringey ones).
In this newsletter I'll be talking about new research on the effectiveness of mental health chatbots. (These chatbots illustrate what's known as rule-based AI: The only text they can provide to users has been drafted or approved by a therapist.) The studies I'll be discussing illustrate some of the strengths and weaknesses of intervention designs. And, they highlight the limitations of one of the most frequently used statistics in the history of science: the group mean.
Background
Currently there more than 10,000 smart phone apps that support mental health in some respect. What's special about mental health chatbots is their level of interactiveness: You bring them problems, they offer you solutions and support. Chatbots can do this because they're AI-driven software programs designed to communicate with you in your native language, all by themselves. As Woebot tells you up front, humans won't even be monitoring your conversations. So, what's the appeal? Why would anyone discuss their mental health problems with a bot?
—Mental health bots are cheaper than traditional approaches (i.e., live therapists and/or medication).
—Traditional approaches aren't always effective.
—Mental health bots are available 24/7. No appointment or transportation is needed.
—Some people are more willing to share personal details with a bot than with people.
Experts agree that most countries, including the U.S., are experiencing shortages in affordable mental health care, and that many people are walking around with untreated psychological problems. But there's a lot of controversy about whether bots could be a meaningful part of the solution. (See here for an excellent summary of the data and broader cultural context.)
Objections to mental health bots
It's understandable why someone might not want to share their problems with a bot. My conversations with Woebot felt awkward, and I can imagine much worse experiences. Like opening up about something really painful, something you've been trying not to think about, and suddenly realizing you're alone in your room at 2 a.m. talking to a machine. Or sharing extremely personal details and then worrying about privacy (e.g., someone hacking into your data).
These seem like perfectly legitimate reasons for not using mental health bots. At the same time, I've noticed a certain bias against them. Specifically, the anthropocentric bias that because bots aren't human, they can't be effective therapists. For example, the subtitle of a 2021 New York Times article implies that bots might not really understand your problems, a sentiment echoed by one of the experts interviewed in the article who opposes their use.
I think this is unfair. Of course bots don't understand your problems. Do we really need them to? After all, medicine doesn't understand your problems. Nor do meditation techniques, long walks, or fidget toys. We use these things because they help, not because they understand us.
What I'm suggesting here is a need to be pragmatic. If mental health bots are beneficial, and the benefits are enduring – and, their mistakes aren't harmful – then they deserve a place among established mental health treatments. That would include everything from FDA approval to insurance coverage. In other words, we should consider what the data tell us about effectiveness and safety rather than rejecting bots a priori.
The current evidence
Unfortunately, there's not much data yet on the impact of mental health bots. Apart from research conducted by companies on their own apps, independent peer-reviewed studies tend to be very weak, methodologically speaking. This is evident from a literature review published in 2020, in which the authors do a heroic job of sifting what they can from very limited evidence. Here's what I glean from this review, and from studies published since then:
1. Mental health bots may spur improvements in depression, anxiety, and stress, among other things, but the data aren't conclusive. (a) For certain outcomes, like anxiety, the results are mixed – in some studies, bots help; in others, they don't. (b) Virtually all of the studies have one or more serious flaws – small sample size, no control group, no pre-test, possible experimenter bias, etc. (c) Methodologically stronger studies been published by researchers with close financial ties to the companies that make the bots. (In Appendix A, I discuss one of these studies.)
2. Mental health bots don't generally undermine mental health. In other words, when bots fail to outperform whatever they're being compared to (e.g., in-person therapy), they don't perform more poorly, on average. When they don't help, they don't hurt either.
The Wysa study
One exception to my portrayal of the literature is a study that appeared in JMIR Publications in May of this year. This study, conducted by Dr. Ashwin Leo and a team of Washington University St. Louis researchers, is exceptional in the sense that it was methodologically strong and conducted by an independent team (i.e., experts with no connections, financial or otherwise, to Wysa, the company that created and maintains the bot being studied.)
The study is also exceptional because it led the FDA to grant Breakthrough Device Designation to Wysa. (Breakthrough Device Designation provides a company with extra input from the FDA during product development, as well as an expedited review process. FDA willingness to provide this support signals an assumption that the product has a realistic chance of being approved.)
The goal of Leo et al.'s study was to determine whether Wysa could help relieve depression among orthopedic patients who experience chronic musculoskeletal pain. 153 patients were divided evenly across three groups. What I'll call the "Wysa group" received access to Wysa for two months. The "F2F group" received individual, in-person care from an experienced clinical psychologist for two months. The "control group" received no mental health care apart from their usual orthopedic treatments. The main outcome of interest was changes across the two-month period in self-reported depression and anxiety on a standardized measure (PROMIS).
For depression, the Wysa group and the F2F group both showed significant and clinically meaningful improvements, with significantly greater improvement for the Wysa group. No changes were observed for the control group.
For anxiety, the Wysa group and the F2F group also showed significant and clinically meaningful improvements, but here there were no differences between them. Once again the control group showed no changes.
Very promising results...
Should we trust the Wysa findings?
Methodologically, this is the strongest study I've seen on mental health bots. You could quibble that the sample isn't huge, that the participants were only tracked for two months, and that Wysa does offers a way to chat interactively with a therapist (it's not clear whether study participants used this option). Still, the findings remain suggestive, at least, that Wysa can help reduce some of the emotional impact of living with chronic pain. In addition, one of the methodological limitations acknowledged by the researchers could actually be viewed as a strength.
Specifically, only the Wysa group was actively recruited. In other words, the researchers invited them to try out a new treatment tool as part of being in the study. The other two groups were not offered any treatment they weren't already receiving. They simply agreed to complete the test of depression and anxiety, and to give the researchers access to their health data. Thus, it's possible that the Wysa group showed improvements in mental health because they were already trying harder to get better and were especially open to a new strategy. If this interpretation is correct, then there's nothing special about Wysa. Almost anything would've helped a group of people who are already trying harder than others to help themselves.
What I just described is a form of sampling bias commonly acknowledged in health research, but you might question how well it applies to this study. After all, not everyone seeking mental health treatment will want to use a bot, even if it's FDA approved, covered by insurance, widely advertised, recommended by Dr. Phil, etc. This is different from, say, the use of medication. If you're taking medication and your therapist suggests trying a different one, you're probably open to the idea so long as you feel that the new drug is safe and possibly effective. But if the therapist suggests a bot, you may be firmly opposed, on principle, to sharing personal problems with a machine.
My point here is that unlike other forms of therapy, mental health bots may tend to only be used by a special type of person – e.g., someone who's especially enthusiastic about AI, or especially open to radically new forms of treatment. If the patients who agreed to use Wysa in the Leo et al. study were especially enthusiastic, or receptive, or whatever, it's not necessarily a limitation of the study. Rather, these might be the only kinds of people who are ever going to use bots, in which case the results generalize nicely.
Are we evaluating mental health bots appropriately?
As with other medical and educational interventions, it's assumed that the most rigorous evidence for the effectiveness and safety of mental health bots would come from randomized controlled trials (RCTs). (If you're not sure what an RCT is, I provide a brief summary in Appendix B.)
Statistically speaking, there many ways to evaluate the results of an RCT. Probably the most common approach is a comparison of means. Did the experimental group, on average, show greater improvements in mental health from pre-test to post-test than the control group did?
Using means to represent groups is a relatively new practice. Although scientists and mathematicians have known for a long time about various kinds of means and other measures of central tendency, it wasn't until the 19th century that means began to be widely used as a way of describing groups. Why this happened is a story that I'll tell in some other newsletter. Suffice to say that our reliance on means is as much a matter of inertia as it is scientific wisdom. In RCTs, the main analyses often focus on mean differences between groups simply because that's what many other researchers do. It's what was done in every study I've mentioned in this newsletter, for example.
The problem though is that means are ambiguous. Suppose that the mean for the bot group is slightly higher at post-test than at pre-test. It's possible that all members of the bot group improved slightly. It's also possible that some of them improved a lot, while others didn't improve at all, or even got worse. There are all sorts of scenarios consistent with the observed mean.
Here's a stronger version of my point: In intervention studies, means may not tell us what we most need to know. What we most need to know tends to be something like this: how many people in each group improved (and by how much), how many people stayed the same, and how many got worse (and by how much)? Most of the studies mentioned in this newsletter don't look at their data this way.
Once the FDA has reached the point of determining whether to approve a new intervention, they will insist on data of the sort I just mentioned if it's not already available. The FDA will not approve a treatment where improvement is greater, on average, in the experimental group, but only because a tiny fraction of people in that group improved a lot while the rest languished.
All the same, reliance on mean outcomes creates biases. Mean differences between experimental and control groups are foregrounded in abstracts and results sections, and often treated as the "main" findings of a study. And, FDA approval is influenced by the extent of consistency between trial data and published findings. If published findings rely on means (as in studies on mental health bots), then the empirical backdrop for the new trials may be flawed.
In sum, the evidence suggesting that mental health bots can be beneficial consists of mean effects. What's happening at the individual level in these studies is often unclear.
Practical concerns
Why is the ambiguity of mean effects especially important here? In a word, safety.
1. Researchers often report group means without indicating how many people in the bot groups showed declines in mental health from pre-test to post-test. Given that the standard deviations are routinely larger than the means, some participants are evidently reporting worse symptoms after using the bots. But how many? How much worse? And what are the figures for the control group? In many studies these data aren't provided.
2. There's anecdotal evidence of horrible misunderstandings. For example, one researcher who was testing Woebot typed: "I want to go climb a cliff in eldorado canyon and jump off it it." Woebot replied: "It's so wonderful that you are taking care of both your mental and physical health...." (In a sense, the person was "taking care" of themselves, but that's not the kind of care bots are supposed to encourage.)
Although examples like this are anecdotal, we can't take them lightly. These are mistakes that humans would never make. (By the way, it's probably not ethical to "test" Woebot this way. By registering, you allow the company to use your data to improve the quality of its services. If you pretend to be suicidal, for example, and you allow Woebot to misunderstand you, you may adversely affect its algorithm by allowing it to overlook what looks like suicidal intent. Any references to jumping off cliffs need to be prompted for clarification, even if that annoys people who've already told Woebot they enjoy cliff diving.)
The problem I'm describing here closely parallels one that confronts self-driving vehicles. These vehicles don't "know" what we want; they're just programmed to maximize safety, among other things. Even if they're safer, on average, than human drivers, they have made mistakes that sober, attentive humans would never make (like driving into the side of a truck because it was similar in color to the sky, as I discuss here).
What's tricky is figuring out a way to integrate these kinds of anecdotes with statistics on the overall performance of mental health bots (and self-driving vehicles). To what extent does better performance, on average, justify the rare but potentially dangerous glitch? To what extent does that depend on the rarity of the glitches (and can we be sure we've estimated their future likelihood accurately)? Also, does it matter what kind of entity is responsible for the glitch? Bots and automated vehicles make mistakes, but so do therapists when they miss cues or say things that trigger mental health crises in their clients, just like human drivers when they make egregious, easily avoidable mistakes behind the wheel. Should we hold the Woebots and the Teslas to higher standards?
In short, evaluating the desirability of mental health bots requires a mix of statistical and ethical considerations. The issues aren't easy to frame yet, because it's not clear how beneficial bots can be, even on average, compared to human therapists. (Many of the existing studies compare bots to one-way informational presentations, or to nothing at all.) If we're not sure about the benefits, it's hard to put the glitches in context. All we have right now are hints that mental health bots may help some people, some of the time.
Final thoughts
In an October 1 article for Wired, Grace Browne writes that "It could be that, one day, [mental health bots] serve a supplementary role alongside a better-functioning mental health care system." I think this captures the most we might expect from these bots. After using Woebot this week, I would say it's definitely not for everyone.
Since Woebot mines user data to improve its services, I felt I had to be honest with it, so I tried to be as open as possible about what was bothering me, and to get into the spirit of the exchange. It wasn't easy. Woebot imposes a lot of structure on your conversations with it, in order to ensure that it responds coherently. For instance, here's what happened when I asked for help with a problem:
Woebot: What isn't going so well, Ken?
Me: I get too cynical about people sometimes. I don't like feeling that way. I want to see the best in others. I want to be more tolerant and empathetic.
Woebot: It sounds like you're talking about feeling low, have I understood you correctly?
Actually, I wasn't feeling "low". I was feeling pretty good. I just wanted to tone down my cynicism. Since the only options that Woebot provided were "Yeah" and "You misunderstood me", I chose the latter.
Woebot: Oh, pardon me, Ken. Does anything here sound like the problem you're having? Relationships, Sleep, Loneliness, Stress or Anxiety, Feeling Low, Grief, Anger or frustration, The pandemic, Show me more.
I chose the "Show me more" option, which provided even fewer relevant options, then ended up picking "Anger or frustration", since I was starting to get frustrated with Woebot. From there I continued to be guided gently through the bot's narrow, generic ways of framing each topic and rephrasing my comments.
My point is not that Woebot can't help people. Rather, as Ms. Browne implies, bots like this might only help people as a supplement to other forms of therapy, or they might only help certain kinds of people. For example, if you're a talker – I mean, the kind of person who could walk into a therapist's office and just start unloading – then I suspect Woebot wouldn't work for you, because the conversations are so tightly structured, on Woebot's terms. This may be the kind of bot that a therapist would recommend instead for people who haven't reflected much on how they feel and thus need some prompting.
I assume (this is the overly cynical me talking) that bots will continue to play some sort of role in supporting mental health, because there's a lot of money to be made. The FDA has already approved five mental health apps (not bots) for treating ADHD, nightmares, substance use disorders, opioid abuse, and insomnia. Even if mental health bots aren't approved, I expect they'll always be available, even if they can't be marketed and prescribed as a form of therapy. Hopefully, their proliferation will be accompanied by caveats concerning the stats on effectiveness and safety. Even if mean outcomes suggest that bots work just as well as live therapists, we need a sense of how many people they won't work well for, who these people tend to be, and to what extent they may be harmed by interactions with bots.
Thanks for reading this newsletter (written by an actual human!).
Appendix A: Tessa and eating disorders
A 2021 study by Dr. Ellen Fitzsimmons-Craft and colleagues showed that a chatbot called Tessa can reduce one of the risk factors for eating disorders. 700 women between the ages of 18 and 30 were randomly assigned to either a "Tessa group" (8 weeks of access to the bot) or to a control group (no intervention). In follow-up surveys at 3 and 6 months, the Tessa group showed significantly less concern about body weight/shape. No group differences were observed for internalization of thinness norms.
Methodologically, this study was stronger than many in the literature, owing to the relatively large sample as well as attention to moderately long-term outcomes. However, apart from the fact that the researchers receive funding to help develop Tessa (and maintain a financial partnership with the company that programs it), there are two reasons why the results are merely promising rather than conclusive.
1. The control group received no intervention. The fact that Tessa was better than nothing is good news, but it begs the question of how the bot would perform relative to anything else. (Do we need an app? Would an informational website be sufficient?)
2. The impact of Tessa was fairly small, statistically speaking. The measure of body weight/shape concerns was scored on a 100-point scale. At baseline, the Tessa group mean was 70.5 and the control group mean was 71.34. At the 3-month follow-up, the Tessa group mean was 61.12 and the control group mean was 65.57. In other words, the Tessa group improved by 9.38 points and the control group improved by 5.77 points.
How should we evaluate that 3.61 point difference between groups in improvement? Although you could say, truthfully, that the Tessa group improved almost twice as much as the control group did, a 100-point scale was used, so the absolute extent of difference is small. The standard deviations for both groups were both over 17 points at post-test, so that 3.61 point difference represents averages across women whose outcomes differed widely. Again, mean scores don't tell us much about individual-level outcomes.
Appendix B: What is an RCT?
Typically, a randomized controlled trial (RCT) is a study with an experimental group and one or more control groups. Each participant in the study is randomly assigned to one of these groups. The groups are as similar as possible in every respect except for the experimental manipulation. For example, the experimental group might be asked to use a mental health bot for two months, while the control group works with a live therapist during this time period. (Technically the latter would be a comparison group rather than a control group, but we don't need to get too far into the weeds here.)
An RCT also has a pre-test and one or more post-tests. In the example I provided, the same test for depressive symptoms might be used. All participants would take this test before working with the bot or therapist, then take the same test two months later. (Another post-test, later in time, would help us know whether any influences of the bot endure over time.)
If the two groups are equally depressed at pre-test, but two months later the experimental group reports significantly less depression than the control group, we would conclude that the mental health bot was effective at reducing depression. (We could draw the same conclusion if the two groups were not equally depressed at pre-test, so long as we found that by post-test only the experimental group showed significant improvements, or that they had improved significantly more than the control group did.)
It should be clear now why RCTs are considered the "gold standard" for evaluating the impact of an intervention (in this case, the type of therapy). Insofar as the ideal conditions of an RCT are met, the results will tell us what kind of impact the experimental manipulation had, if any. Thus, RCTs are considered ideal for evaluating the effectiveness of mental health bots, and anything short of that ideal (e.g., lack of a control group or pre-test) is widely recognized as problematic.