Background: Triple negative breast cancer (TNBC) is an aggressive subtype of breast cancer characterized by the lack of estrogen receptor(ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2). The absence of these receptors reduces the effectiveness of targeted treatment approaches. With the increasing use of artificial intelligence (AI) in medical research and clinical decision- making, there is growing interest in evaluating the accuracy and reliability of large language models (LLMS), such as chatGPT-4, in oncology related applications. Objective: The research aims to systematically assess the reliability of ChatGPT-4 in addressing frequently asked questions related to TNBC in four critical areas: diagnosis, treatment, prognosis and survival, and quality of life. Expert evaluations and statistical analyses are employed to measure the accuracy of the models responses. Methods: A set of 100 questions related to TNBC was gathered from credible medical sources, including peer-reviewed journals and clinical oncology specialists evaluated the response generated by ChatGPT-4 using a structured assessment framework, classifying each answer into one of four accuracy levels, completely inaccurate, partially accurate, accurate but lacking depth and highly accurate. Results: To evaluate the consistency among reviewers, Cohen’s kappa coefficient was calculated, and descriptive statistical analysis was conducted to identify overall accuracy patterns. The findings indicated that 73% of the responses were classified as either “Accurate” or “Highly Accurate”, suggesting the potential of ChatGPT-4 as a supplementary resource for obtaining information on TNBC. However, 27% of the responses were categorized as “partially accurate” or ‘Completely Inaccurate, “highlighting gaps in contextual understanding and instances of misinformation.. Cohen’s kappa coefficient was recorded at 0.007, reflecting a week level of agreement among evaluators and highlighting the impact of subjective interpretation. The model demonstrated strong performance in well-established areas such as chemotherapy protocols and diagnostic procedures but faced challenges with emerging research topics, personalized treatment recommendations, and fertility related concerns. Conclusion: ChatGPT-4 exhibits significant potential in summarizing information on TNBC; however, the accuracy of its responses varies depending on the complexity and specificity of the queries. Due to inconsistencies and low inter-rater reliability, AI-generated medical content requires verification by medical professionals before being applied to patient care or clinical decision-making. Future developments in large language models should focus on reducing inaccuracies, incorporating the latest medical data, and improving adaptability to better support personalized medicine.
Key words: Triple Negative Brest Cancer (TNBC), Large Language Models (LLLMs), ChatGPT-4, Oncology Decision Support, Cohen’s Kappa Coefficient, Medical AI Accuracy.
|