Study Examines the Quality and Empathy of AI Retina Consults

The internet has long provided answers—of varying accuracy—to patients’ many health-related queries, and now artificial intelligence models like ChatGPT are in the mix too. How good is this information, though? New research published in Ophthalmology Science suggests it has potential.¹ Researchers assessed the quality, safety and empathy of responses to common questions from retina patients by human experts, by AI and by AI responses edited by human experts. They concluded that clinical settings might make good use of AI responses.

In the masked, multicenter study, researchers randomly assigned 21 common retina patient questions among 13 retina specialists. A few examples include the following:

• What causes age-related macular degeneration?

• How long do I need to keep getting anti-VEGF injections?

• Can I pass AMD to my children?

• How long can I go between eye injections?

• Is there a good treatment for floaters?

Each expert created a response and then edited a response generated by the large language model (LLM) ChatGPT-4. They timed themselves for both tasks. Five LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing and Bard) also generated responses to each of the 21 questions. Other experts not involved in the initial response-writing process evaluated the responses and subjectively judged them for quality and empathy (very poor, poor, acceptable, good or very good) and for safety (incorrect information, likelihood to cause harm, extent of harm and missing content).

The researchers collected 4,008 grades (2,608 for quality and empathy and 1,400 for safety metrics). They reported significant differences in quality and empathy between the three groups: LLM alone, expert alone and expert+AI. The latter—expert+AI—performed best overall in terms of quality, with ChatGPT-3.5 as the top-performing LLM. ChatGPT-3.5 had the highest mean empathy score followed by expert+AI. Expert responses placed fourth out of seven for quality and sixth out of seven for empathy (mean score), according to the study. Expert+AI responses significantly exceeded expert responses for quality and empathy.

“Busy surgeons may respond to patient questions accurately and quickly, but may not respond with as much empathy as LLMs,” says study senior author Matthew R. Starr, MD, of the Mayo Clinic.

Fortunately, AI seems poised to help. In the study, the researchers reported time savings for expert-edited AI responses vs. expert-created responses. “AI is here—it’s not ‘coming’ anymore,” Dr. Starr says. “It’s part of what we do, and I think we need to continue to be at the forefront of incorporating AI into how we practice. We as physicians spend a lot of time responding to patient questions, and if we could harness LLMs to safely and appropriately respond to questions that would give us a lot more time back.”

Dr. Starr points out, however, that AI-generated responses still need oversight. “Many of the [AI-generated] responses were great, but there are still some inaccuracies and potential for harm, so they need to be edited and vetted appropriately. That will take time upfront. Hopefully as they improve over time, they’ll require less oversight for responses to basic questions.” He adds that in this case, it will be important to disclose to patients that some responses are AI generated but vetted by physicians.

Future LLMs for patient queries would need some modification. “These LLMs are open-source platforms, and not HIPAA compliant,” Dr. Starr says. “If we can make something that’s created specifically for patients that we created, then we may be able to actually use it and get it HIPAA compliant.”

One limitation of the study came about due to the time it took to write and edit responses. “We missed about 100 or so questions out of about 4,000,” Dr. Starr says. He also notes that a Hawthorne effect, where individuals modify behavior in response to awareness of being observed, may also have occurred, though physicians did not grade their own responses.

Overall, the researchers conclude in their paper that LLM responses were comparable to those written by experts, and that an expert-LLM collaboration can result in responses with better quality and empathy than human experts alone while saving time, potentially reducing physician burnout and improving patient care. The authors write that a “natural next step would be testing an editable LLM-generated draft to patient messages.”

Another group of researchers set out to determine the accuracy of information patients get when they use ChatGPT.²

It’s no surprise that, today, patients are likely to know a good deal about the conditions affecting them, given the instant knowledge available at our fingertips. Despite the internet providing a plethora of reputable information, patients may not know where to look for trusted sources on medicine and health practices across specialties, leaving them vulnerable to accessing incorrect information.

With the emergence of AI chatbots, this problem is on the precipice of tentative improvement, as such services could in theory help to improve accuracy by weeding out spurious reports. Used in a recent study,
ChatGPT may not resolve this issue greatly right now, but the idea that patients in the future may gain information from a continually learning and improving bot may be more suitable for adjunctive patient education than aimlessly browsing search engines.

To assess the accuracy of ophthalmic information provided by ChatGPT, five diseases from eight ophthalmologic subspecialties were assessed by researchers from Wills Eye Hospital in Philadelphia. For each, three questions were asked:

What is [x]?

How is [x] diagnosed?

How is [x] treated?

Responses were scored with a range from -3 (unvalidated and potentially harmful to a patient’s health or well-being if they pursue said suggestion) to 2 (correct and complete). To make these assessments, information was graded against the American Academy of Ophthalmology’s guidelines for each disease.

A total of 120 questions were asked. Among the generated responses, 77.5 percent achieved a score of ≥1.27, while 61.7 percent were considered both correct and complete according to AAO guidelines. A significant 22.5 percent of replies scored ≤-1. Among those, 7.5 percent obtained a score of -3. ChatGPT was best at answering the first question and worst on the topic of treatment. Overall median scores for all subspecialties was 2 for “What is [x]?,” 1.5 for “How is [x] diagnosed?” and 1 for “How is [x] treated?”

Results were published in the journal Eye. The study authors point to reasoning for why the median scores were highest in the definition question and lowest in the treatment question, surmising that it has to do with the dataset of information ChatGPT drew from for training.

As the authors explained in their paper, “The definition of a common disease is usually standard and well-known, and thus the information the chatbot has received in its training regarding the definition of a disease should be very straightforward. When prompted about diagnosis and treatment, it’s more likely that the inputs contained conflicting information.”

The same hypothesis could be applied to the trend seen for differences in median score across subspecialties. ChatGPT answered all the general subspecialty questions correctly, potentially because conditions from this category are more well-known pathologies. As such, a greater amount and more consistent set of information may have been drawn from to learn about. Supportive of this idea were the maximum scores obtained within other subspecialties for well-known and common pathologies, including cataracts, glaucoma and diabetic retinopathy.

Of course, this research demonstrates that chatbots are nowhere near capable of robust use for disseminating medical information. However, the authors believe “it appears that artificial intelligence may be a valuable adjunct to patient education, but it is not sufficient without concomitant human medical supervision.”

Moving forward, they convey that “as the use of chatbots increases, human medical supervision of the reliability and accuracy of the information they provide will be essential to ensure patient’s proper understanding of their disease and prevent any potential harm to the patient’s health or well-being.”

^{1. Tailor PD, Dalvin LA, Chen JJ, et al. A comparative study of responses to retina questions from either experts, expert-edited large language models (LLMs) or LLMs alone. Ophthalmology Science 2024. [Epub ahead of print].}

^{2. Cappellani F, Card KR, Shields CL, Pulido JS, Haller JA. Reliability and accuracy of artificial intelligence ChatGPT in providing information on ophthalmic diseases and management to patients. Eye. January 20, 2024. [Epub ahead of print].}

TikTok “Challenges” Pose a Threat to Users’ Ocular Health

The social media platform TikTok has been used to share mostly harmless trends and challenges among its predominantly young audience, but some do pose serious risks to adolescents and teenagers who seek validation and attention from peers. Considering about 41 percent of the user base falls between the ages of 16 and 24, and a third are 14 or younger, it’s important to highlight those trends that pose potential harm. A recent research paper in the online journal Ophthalmology and Therapy cataloged a variety of reckless and foolhardy activities shared on TikTok that can endanger eye health.¹

Included in discussion of this new research were the “rubbing castor oil trend,” “bleach/bright-eye challenge,” “mucus fishing challenge,” “eggsplosions,” “beezin challenge,” “Orbeez challenge,” “blow-drying eyelashes,” “sun gazing” and “popping styes” TikTok trends/challenges. The number of views, likes and shares was documented for each video of the respective challenge with the highest like count.

The first on the list, rubbing castor oil onto the eyes, has the purported benefits of decreasing wrinkles and—somehow—improving vision. A few studies do show that castor oil can enhance the lipid component of the tear film and decrease evaporation time, but can be dangerous to employ without medical supervision, due to many over-the-counter versions containing irritating or harmful preservatives. As well, excessive eye rubbing is linked with keratoconus.

Next is the bright-eye challenge. This involves putting on the eye a bag filled with jelly, hand sanitizer, bleach and shaving cream to lighten eye color. This can cause irritation and permanent cellular damage due to protein denaturation, a property of bleach. Permanent damage can occur in case of leaks and extravasation into the eyes; this challenge has been removed from the platform, though. This challenge, however, may have begun on TikTok as a prank or parody and largely received as such by users rather than something to be acted upon. It’s also worth noting that the bleach eye challenge dates back to 2019, an eternity ago in the fast-paced world of social media, and TikTok in particular, and thus is likely to be long forgotten by today’s users.

Another challenge noted by the researchers is to force out mucus from an irritated eye using a Q-tip or finger. This can lead to a cycle of “mucus fishing syndrome,” a cyclic condition involving extraction of mucous strands from the eye, and is often triggered by ocular irritation. This leads to more mucous discharge, perpetuating eye irritation and the cycle. Mucus fishing can also cause mechanical conjunctivitis.

“Eggsplosions” happen from hard-boiled eggs being microwaved and then cut into pieces to intentionally burst open. This is a concern when hitting nearby objects, like the eye, leading to direct trauma. Similarly, the “Orbeez challenge” involved paintball guns to shoot gel pellets, also leading to potential ocular trauma. In fact, this challenge has caused 19 serious ocular injuries, as reported in one 2022 review, with 11 out of 19 occurring in those younger than 18. Another indirect cause of harm may occur from a trend that advocated blow-drying one’s eyelashes, since the eyes are not well-suited to endure such forceful air in close proximity. Attempting this could cause dryness, irritation or even long-term consequences of corneal abrasions and infections.

Other trends of concern include sun gazing, in which viewers intentionally look at the sun for five to 10 seconds, which can lead to solar retinopathy and scotomas. “Stye popping” is yet another trend that can harm the eyelid through people using their fingers, needles or tweezers for expression of lid lesions. Spread of infection is possible or worsening of the stye, with possible complications of pigmented scars, scar tissue and pitting scars.

All of these videos have millions of views each, the paper notes, highlighting how pervasive these trends can become and harmful if followed uncritically by impressionable young viewers. The study authors warn that “encountering substandard medical information on social media platforms presents significant hazards to patients. It may lead them to make critical medical choices relying on potentially erroneous data. This could result in adverse consequences, such as applying over-the-counter castor oil to treat various medical conditions.” As such, doctors and parents should be vigilant for incorrect and just plain foolish medical content on social media platforms.

^{1. Hassan SA, Ghannam AB, Saade JS. An emerging ophthalmology challenge: a narrative review of TikTok trends impacting eye health among children and adolescents. Ophthalmol Ther. February 5, 2024. [Epub ahead of print].}

Study May Describe New Dry AMD Variant

In patients with age-related macular degeneration, macular neovascular lesions are usually responsible for the presence of intraretinal fluid (IRF), but in some reports this fluid occurs in the absence of MNV lesions. To describe this new AMD variant, researchers conducted a retrospective study of patients with IRF and intermediate AMD. Their results, published in Retina, show that non-exudative IRF is a novel and distinct finding in intermediate AMD.¹

The study included 10 eyes of 10 patients (aged 68 to 60; mean BCVA 20/40) who demonstrated IRF in intermediate AMD. No macular neovascularization was seen on multimodal imaging, and optical coherence tomography-angiography didn’t detect any abnormal flow signal associated with IRF.

The researchers described two distinct phenotypes of patients in which IRF occurred: (1) those with serous/drusenoid pigment epithelium detachment (PED) and (2) those with an area of nascent geographic atrophy (nGA). They explained in their Retina paper that when seen on structural OCT, the “IRF associated with PED was usually found at the apex of the PED, that was surrounded by hyperreflective deposits,” while “in eyes with nGA, IRF appeared as hyporeflective cystoid spaces that follow the course of Henle’s fiber layer.”

Proposed causes and/or mechanisms for non-exudative IRF in intermediate AMD include:

• PED lesions “with considerable height,” causing mechanical stress and hydrostatic pressure;

• concomitant Muller cell loss and outer segment cell impairment, ultimately leading to cystoid IRF accumulation;

• blood retinal barrier breakdown and protein deposit accumulation between the choriocapillaris and ELM, leading to increased osmotic pressure and hyperosmolar stress;

• local hypoxia resulting from increased distance between retinal pigment epithelium (RPE) and choriocapillaris; or

• outer retinal injury leading to RPE migration to the inner retinal layers.

“These findings are of paramount relevance in the clinical setting, highlighting that we need to discern between the presence of IRF due to MNV and that due to non-MNV causes,” the researchers emphasized in their paper. They concluded that larger cohorts are needed along with multimodal approaches to “improve the understanding of the mechanism at play causing the IRF” in intermediate AMD and to improve the management of patients in this subgroup.

^{1. Servillo A, Kesim C, Sacconi R, et al. Non-exudative intraretinal fluid in intermediate age-related macular degeneration. Retina 2024. [Epub ahead of print].}

Teresa Horan, MD, awarded Rick Bay Scholarship

This year’s recipient of The Rick Bay Excellence in Eyecare Education Scholarship is Teresa Horan, MD.

Dr. Horan is currently completing a glaucoma fellowship at Wills Eye Hospital in Philadelphia. She’s a native of Connecticut and graduated cum laude with a BA in Biology from St. Mary’s College of Maryland, going on to earn her medical degree from Sidney Kimmel Medical College at Thomas Jefferson University in Philadelphia. Dr. Horan completed her ophthalmology residency at Tufts Medical Center/New England Eye Center in Boston.

Lauren Hock, MD, a glaucoma specialist at Wills, says, “Dr. Horan was selected for the Rick Bay Excellence in Eyecare Education Scholarship for her compassionate care of glaucoma patients, her integrity as a physician, and her dedication to ophthalmology education as a future academic glaucoma specialist.”

Dr. Horan says she was honored to be selected. “I value the importance of training to further our field and am committed to training the next generation of ophthalmologists to care for our growing population of glaucoma patients,” she says. “I’m grateful to the Rick Bay Foundation for the support to continue my training and advance glaucoma care.”

Upon completion of her fellowship, Dr. Horan will continue her career at the University of Maryland.

The Rick Bay Foundation honors the legacy of its namesake, the former president and publisher of Review of Ophthalmology. Ophthalmology scholarships are awarded annually to fellows at Wills Eye Hospital who embody Bay’s qualities of integrity and compassion.

10 JUL 2024	Tuesday, August 20, 2024 \| Rethinking Dry Eye Disease: A Contemporary Approach to a Complex Condition
10 JUL 2024	Tuesday, September 24, 2024 \| Rethinking Dry Eye Disease: A Contemporary Approach to a Complex Condition

Study Examines the Quality and Empathy of AI Retina Consults

Current Issue

Table of Contents

Read Digital Edition

Read PDF Edition

Archive

Subscriptions

Meetings

Related Articles

Current Issue

Meetings