Background
Cognitive impairment affects 20-80% of individuals across neurodegenerative diseases (ALS, Alzheimer’s and Parkinson’s diseases) and 20% in the general aging population, yet cognitive screening remains rater-dependent. Many assessment components which have verbal fluency, memory, sentence completion are influenced by administrator judgment, experience, and instruction adherence, limiting scalability.
Objective
We developed a human-in-the-loop AI copilot to assist administration and scoring of high-burden cognitive tasks, reducing administrator workload while improving standardization.
Methods
In a preliminary pilot of 22 remote and in-clinic administrations across 5 raters in ALS and AD clinical and research settings, we identified that error rates varied dramatically by task complexity: high-burden tasks (verbal fluency 32%, memory task 29%) showed error rates 6x higher than lower-burden tasks (naming 5%, digit span 5%). Workflow analysis identified key sources of administrator burden: manual transcription, rapid tracking of responses, application of task-specific scoring rules, and compliance with instructions.
These findings informed an AI-powered copilot targeting high-burden tasks through real-time decision support. The copilot captures participant responses via live speech-to-text transcription and applies task-specific logic using large language models to generate structured scoring suggestions. For fluency, it tracks responses, detects rule violations, and validates against dictionaries. For recall, it evaluates semantic consistency with predefined scoring elements, allowing for paraphrasing. For sentence completion, the copilot assesses semantic and contextual relevance, flagging ambiguous cases for administrator review. All AI-generated outputs require administrator confirmation. The system is modular, enabling adaptation across multiple cognitive batteries.
Preliminary Results
Administrators reported reduced cognitive load, improved participant interaction focus, faster administration, and robust speech recognition across native/non-native English speakers, supporting broad feasibility.
Next Steps
Ongoing work focuses on AI copilot validation by measuring its effects on scoring accuracy, inter-rater reliability, administration time, and user satisfaction. We hypothesize AI reduces high-burden errors to ≤5%, improving standardization and clinical adoption.