Dear Dr. Springer, thank you for your in-depth analysis of the Therabot study. I resonate with your thoughtful and detailed observations (the Woebot cliff example is indeed shocking). I would love to hear more about how the Dartmouth team accumulated over 100,000 professional hours to contribute to the system.
As a therapist, I also wondered about the size of the effects and the tools used to assess depression and anxiety in this context. I condensed some of my thoughts into a shorter, less detailed article than your work, but I would welcome your feedback on it.
100,000+ professional hours is a strikingly high number. The Dartmouth team didn't provide details, but in 2024 they did note that they'd been accumulating training materials for 5 years.
I love your article and completely agree with you that Gen AI chatbots are not psychiatry's next "moment". (Since Therabot is trained on CBT, one could view the future implementation of this bot as simply part of CBT's "moment".)
I believe that Gen AI chatbots could help some people, at least some of the time, with some mental health issues. For this reason, my article intentionally avoids a detailed discussion of measurement and effect sizes in the Dartmouth study. Here's my logic:
(a) Because this was a small-sample study on an evolving technology, the effect sizes aren't likely to generalize.
(b) The effect sizes are limited in meaningfulness anyway, because the study compared Therabot to a control group rather than to some other kind of mental health support (arguably, the best test would be a comparison with face-to-face CBT), and because there was no long-term followup.
(c) The study tells us very little about the kinds of people and conditions for which Therabot may be helpful.
For these reasons, I didn't want to overemphasize specific quantitative findings such as effect sizes.
Dear Dr. Springer, thank you very much for your response; I really appreciate it. I completely agree with the points you highlighted: effect sizes are not generalizable, there is no comparison to an effective therapy, and there is little information about the participants, as well as a need for more details about the chatbot was used.
I agree that such an approach can be helpful for easier access in certain cases. I also believe that, given the pace of AI development, these systems will improve. However, they fundamentally lack the meaningful component of human interaction—the corrective experience that face-to-face psychotherapy provides in terms of human connection and therapeutic alliance.
Thank you for providing more information about those ominous 100,000 hours. On X, I found a hint suggesting that Falcon and Llama were used as LLMs, and the input consisted of "synthetic" therapist-patient dialogue. This suggests that it was probably generated by AI and later reviewed and edited by a professional; at least, that's what I would assume.
Yes... if the goal of psychotherapy is to improve the experiences of humans who interact with other humans, I would want a human therapist to support me on that journey.
I appreciate the link to that X post. I assume you're right about some of the content being AI-generated, as 100,000 hours over a 5-year period would require nearly 55 hours of dialogue per day, every day. I find this worrisome. I think we would like to see as much human input as possible into the content of LLM training materials, not to mention the finer details of the guardrails we build in.
Strongly agree. Thank you!
Yes never give up on someone who is struggling with their mental health. Thank you for ending on that note.
Dear Dr. Springer, thank you for your in-depth analysis of the Therabot study. I resonate with your thoughtful and detailed observations (the Woebot cliff example is indeed shocking). I would love to hear more about how the Dartmouth team accumulated over 100,000 professional hours to contribute to the system.
As a therapist, I also wondered about the size of the effects and the tools used to assess depression and anxiety in this context. I condensed some of my thoughts into a shorter, less detailed article than your work, but I would welcome your feedback on it.
https://wfmai.substack.com/p/no-psychiatry-did-not-just-experience?r=3row1i
Thank you for your comments!
100,000+ professional hours is a strikingly high number. The Dartmouth team didn't provide details, but in 2024 they did note that they'd been accumulating training materials for 5 years.
(https://www.c4tbh.org/funded-pilot/use-of-a-generative-ai-gen-ai-chatbot-for-treating-anxiety-and-depression-among-persons-with-cannabis-use-disorder-cud/)
I love your article and completely agree with you that Gen AI chatbots are not psychiatry's next "moment". (Since Therabot is trained on CBT, one could view the future implementation of this bot as simply part of CBT's "moment".)
I believe that Gen AI chatbots could help some people, at least some of the time, with some mental health issues. For this reason, my article intentionally avoids a detailed discussion of measurement and effect sizes in the Dartmouth study. Here's my logic:
(a) Because this was a small-sample study on an evolving technology, the effect sizes aren't likely to generalize.
(b) The effect sizes are limited in meaningfulness anyway, because the study compared Therabot to a control group rather than to some other kind of mental health support (arguably, the best test would be a comparison with face-to-face CBT), and because there was no long-term followup.
(c) The study tells us very little about the kinds of people and conditions for which Therabot may be helpful.
For these reasons, I didn't want to overemphasize specific quantitative findings such as effect sizes.
Dear Dr. Springer, thank you very much for your response; I really appreciate it. I completely agree with the points you highlighted: effect sizes are not generalizable, there is no comparison to an effective therapy, and there is little information about the participants, as well as a need for more details about the chatbot was used.
I agree that such an approach can be helpful for easier access in certain cases. I also believe that, given the pace of AI development, these systems will improve. However, they fundamentally lack the meaningful component of human interaction—the corrective experience that face-to-face psychotherapy provides in terms of human connection and therapeutic alliance.
Thank you for providing more information about those ominous 100,000 hours. On X, I found a hint suggesting that Falcon and Llama were used as LLMs, and the input consisted of "synthetic" therapist-patient dialogue. This suggests that it was probably generated by AI and later reviewed and edited by a professional; at least, that's what I would assume.
Here's the post: https://x.com/amytongwu/status/1906788134525772170
Yes... if the goal of psychotherapy is to improve the experiences of humans who interact with other humans, I would want a human therapist to support me on that journey.
I appreciate the link to that X post. I assume you're right about some of the content being AI-generated, as 100,000 hours over a 5-year period would require nearly 55 hours of dialogue per day, every day. I find this worrisome. I think we would like to see as much human input as possible into the content of LLM training materials, not to mention the finer details of the guardrails we build in.