Sunday, 23 January 2022

Assessment in English (Part 3- Reliability and Validity)

This is the third instalment of my blog series on assessment in English.  Following on from my last blog on some of the key terms when it comes to curriculum planning, I now want to take a brief look at the terms ‘assessment’ and ‘validity’, in order to consider the questions they raise when it comes to planning and implementing effective assessment.  

Validity

Dylan Wiliam’s statement that ‘there is no such thing as a valid test’ (Wiliam, 2020) really challenged my thinking when it came to considering how we use assessments (for both formative and summative purposes).  Instead, it’s all about considering the purpose of the data we gather and considering the validity (or accuracy) of the inferences we make- whether the data be end of unit test scores, mock results or more qualitative data (such as students’ responses to a mini-whiteboard task).


As English teachers, we are often adept at unpicking underlying meanings within language and assessing their validity; it’s important that we apply the same thought process to the assessment data we gather.


Wiliam explains (in the brilliant ‘ResearchEd Guide to Assessment’- a must-read for anyone evaluating the efficacy of assessment in the classroom) how there are two main threats to this:

  • Construct underrepresentation (meaning that the sample of knowledge being assessed is too small to make valid inferences about students’ learning)
  • Construct-irrelevant variance (where a lack of knowledge not relevant to the knowledge being assessed prevents students from demonstrating what they have learnt)


Take, for example, this GCSE-style question:

Explore how Shakespeare presents Macduff as virtuous.


Construct underrepresentation is easily illustrated through the use of literature mock exams, where they are used to assess students’ knowledge of a set text.  For example, a low mark for an essay that focuses on Macduff could lead a teacher to infer that the student’s knowledge on the play ‘Macbeth’ needs much work.  However, it might be that their knowledge of Macbeth, Lady Macbeth and other characters/themes is much stronger: the inference, therefore, wouldn’t be valid.


Likewise, the word ‘virtuous’ might also pose a barrier to valid inferences, since it might generate construct-irrelevant variance.  If students aren’t confident with the meaning of this word, then they are less likely to be able to communicate their knowledge of the text.  Whilst it is important to teach and promote a wide vocabulary, we do need to be aware that the use of it in assessment questions might impact our ability to make accurate inferences of what students do and don’t know.


Reliability

Reliability is the measure of how consistent an assessment result would be if the assessment was repeatedly administered over time. Assuming that no learning took place, you’d expect a completely reliable assessment to generate the same mark for a single student, no matter when they took it or who marked it.


Again, I’m referring to Wiliam (the champion of evidence-informed assessment), who highlights that total reliability of an assessment isn’t logistically possible.  Instead, ‘we need to be aware of the limitations of our assessments so we do not place more weight on the result of an assessment that its reliability would warrant’.


I feel that a specific threat to English assessment is the subject nature of much of the success criteria we use (whether this be on the mark schemes for the KS4 and KS5 exams, or on internal criteria).  You only need to look at the difference in grading after GCSE English re-marks have been submitted to see that even standardised tests cannot be wholly reliable.


Questions for English teachers and leaders

I’m definitely not advocating for exam-style questions to be banned from the English classroom, as they are important to prepare students for the exams they will sit.  This also isn’t where I start to explain my views on exam reform.


That being said, it is clear that any summative use of assessment needs to be planned carefully to maximise the validity of the inferences we make and mitigate for the impact of issues with reliability.


When we design these assessments, we need to consider:

  • What do we want to assess?  Does the assessment sample a wide enough range of this knowledge?
  • Which inferences do we want to make from the data?  How does the assessment set up these inferences to be valid?
  • Are there any barriers (especially gaps in knowledge and vocabulary gaps) that prevent students demonstrating the knowledge they have?  How can these be mitigated?
  • If the same student completed the assessment on different days, how consistent (or reliable) would their score be, assuming no new learning has taken place?
  • If different teachers marked the same assessment, how consistent (or reliable) would their marking be?
  • What actions could be taken to mitigate the risks to reliability?

To conclude, when it comes to reliability and validity, Wiliam’s advice at the end of his chapter ‘How to think about assessment’ is rock-solid:




References:

Wiliam, D. (2020) ‘How to think about assessment’ The ResearchEd Guide to Assessment John Catt Educational

No comments:

Post a Comment