Bridging minds and machines: a comparative study of AI and human rater agreement and reliability in educational assessment


ACAR GÜVENDİR M., KILIÇ A. F., GÜVENDİR E., KAÇAK T.

EDUCATION AND INFORMATION TECHNOLOGIES, 2026 (SSCI, Scopus) identifier

  • Yayın Türü: Makale / Tam Makale
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1007/s10639-026-13949-7
  • Dergi Adı: EDUCATION AND INFORMATION TECHNOLOGIES
  • Derginin Tarandığı İndeksler: Social Sciences Citation Index (SSCI), Scopus, IBZ Online, Education Abstracts, Educational research abstracts (ERA), ERIC (Education Resources Information Center), INSPEC
  • Trakya Üniversitesi Adresli: Evet

Özet

Building on cognitive theories, this study investigates interrater reliability (IRR) and interrater agreement (IRA) between human raters and artificial intelligence (AI) models (ChatGPT-4 and Gemini) in scoring open-ended (constructed response) and single best answer questions. The study involved 73 third-year undergraduate stu-dents from the Department of Primary School Education, enrolled in the Mea-surement and Evaluation course. Of these, 46 students answered the open-ended question, and 69 answered the single best answer question. The analysis focused on comparing IRR and IRA between human raters and AI models, as well as be-tween the AI models themselves. Results indicated that AI models demonstrated higher IRR and IRA than human raters for the open-ended question, characterized by less structured responses and multidimensional scoring criteria, while human rat-ers performed slightly better for the single best answer question. AI models showed relatively consistent scoring performance across task types, whereas human raters exhibited greater variability, particularly for the open-ended task. These findings highlight task-dependent differences in scoring consistency that are interpretable through cognitive theories, contributing to ongoing debates about the appropriate role of AI in educational assessment