EDUCATION AND INFORMATION TECHNOLOGIES, 2026 (SSCI, Scopus)
Building on cognitive theories, this study investigates interrater reliability (IRR) and interrater agreement (IRA) between human raters and artificial intelligence (AI) models (ChatGPT-4 and Gemini) in scoring open-ended (constructed response) and single best answer questions. The study involved 73 third-year undergraduate stu-dents from the Department of Primary School Education, enrolled in the Mea-surement and Evaluation course. Of these, 46 students answered the open-ended question, and 69 answered the single best answer question. The analysis focused on comparing IRR and IRA between human raters and AI models, as well as be-tween the AI models themselves. Results indicated that AI models demonstrated higher IRR and IRA than human raters for the open-ended question, characterized by less structured responses and multidimensional scoring criteria, while human rat-ers performed slightly better for the single best answer question. AI models showed relatively consistent scoring performance across task types, whereas human raters exhibited greater variability, particularly for the open-ended task. These findings highlight task-dependent differences in scoring consistency that are interpretable through cognitive theories, contributing to ongoing debates about the appropriate role of AI in educational assessment