3. Analysis of test results
When considering how well you and your students performed, you are frequently asked to report the percentage of the students that passed the course. However, analysing their scores will reveal more detail and enable you to make informative decisions for improving the assessment and the course as a whole.
Test results analyses take place on three different levels (see Figure 4):
- At test/assessment level
- At item level
- At answer level
A test result analysis will give insight into:
- How well the students mastered the individual learning objectives of the course (test level)
- The quality of the individual test questions or assignment criteria (item & answer level, 0)
- Whether the answer model need to be revised (3.3)
- The overall quality of the assessment (test level, 3.4)
- Whether the grading needs to be revised (3.5)
- How to adjust rubrics and criteria (3.6)
This chapter explains the steps that you can take to perform a test result analysis and to improve the grading, future assessments and future courses based on your findings.
Keep in mind that it is practically impossible to make flawless assessments (unless you had unlimited time). Therefore, be prepared to adjust the answer model or rubric grading after the test result analysis.
First, please take note of the following definitions:
- Test: any assessment, including projects, assignments, exams with open-ended questions and multiple-choice exams.
- Grade: the grade (usually on a scale from 1 to 10) that a student receives for the whole test
- Score: the number of points that a student obtained from this test, before it is transformed into a grade.
- Item: the smallest unit in a test. This can be a criterion or subcriterion for assignments/projects, or a subquestion or question for an exam.
In case your exam is a digital exam or a paper-scan exam, digital exam tools do the test analysis result for you (in case of Ans and Brightspace Quizzes). Check the Teaching Support pages for an explanation and how to use this part of the assessment tool. More information on how to interpret test result analyses, see sections Error! Reference source not found., 3.4.
If you are grading exams or projects/assignments with pen-and-paper, you will store the following data in a spreadsheet while (or after) scoring your students’ work:
- Scores per item per each student
- Total scores per student
- Grades per student
For smaller datasets (few students (<20) or few items), you may not be able to draw strong conclusions from your data. However, you are encouraged to run a test result analysis to check if your experience during the course and grading matches the test result analysis.
3.1. Analysis of achievement of learning objectives
The first question you want to ask yourself, is how well your group of students master the individual learning objectives. Are they performing better at certain learning objectives than others? Did my new teaching approach for a certain learning objective work? Are there learning objectives in which they perform worse than in others? These and other questions might be answered by grouping the (normalized) item scores per learning objective, like in Figure 5.
Graphically summarizing the scores of your students per learning objective will make it easier to interpret the results. Plot a measure of performance (average and/or median) and spread (standard deviation or boxplot), and if helpful, the individual data points.
When analysing the graph, think about what scores you as a teaching professional find acceptable for a particular course or learning objective. Also consider what caused the problems or success in LOs during the course, and how you can help your colleagues and students to work on (and prevent) knowledge gaps. You can use any graph of your choice, as long as it summarises the distribution of the scores per learning objective.
Typically, problems in learning objective achievement are caused by a lack of practice at the level of the test (constructive alignment).
3.2. Analysis of the quality of the test items and answers
In this section, you will learn to analyse the quality of the individual items ((sub)questions, or assessment (sub)criteria) with the outcome of the item-specific analyses. Use this to pick the most worrying items to check for errors or unclarities. This helps you to improve the scoring of these items for the sake of the students who just took the exam and will have a fairer grade, and for the sake of selecting how you are going to further improve next year’s test.
Test result analyses result in a number of variables. The most useful variables to check which items are (most) worrisome are the following:
- Maximum score: Did at least a few students answer this individual question correctly or get the full score for this criterion?
- Average score: Is the average score very high or very low (and did you expect this)? I.e. was the question / criterion very easy or very hard?
- Correlation with the other scores: How did the good-performing students do on this question or criterion?
Use these variables to pick, for example, four items to study in detail.
In the following sections, these values are first discussed individually. After you comprehend what kind of information the individual variables can reveal, read how you can use their combination to focus your attention on potential problems (and solutions).
-
The goal of your course is to facilitate your students to master the learning objectives, and the goal of the assessment is to measure whether you and they succeeded. For each individual item, you would expect that there are students who get full score (if there are a reasonable number of students). Therefore, you check for example the maximum score (maxa in the TU Delft Excel), expressed in points.
If no (or too little) students got the full score for an item, there may be problems with the answer model, or with the course (learning activities):
- For exams: Will students who master the applicable learning objectives be able to give the model answer, after reading the question? Or could the question lead to other, valid answers that are currently not rewarded?
- For assignments/projects: is it feasible for good students who took your course (considering both the available time as well as the learning activities, supervision, feedback, material, assignment instructions and rubric/assessment sheet) to obtain the maximum level for the criterion?
-
p is the average, normalised score and has a value between 0 (no points) to 1 (full score). The higher p, the higher your students scored on this item, and the easier the question or the criterion. For closed questions, p equals the fraction of students who answered the question correctly. To summarise: p is a reverse measure for the difficulty of an item.
Note that p in test-result analysis is not related to p as in probability in statistics. The p in test-result analysis has a p that stands for proportion, not probability. Confusingly, p is called the ‘difficulty’, although the higher the value of p, the ‘easier’ the item.
\(p = \frac{\text{Average score}}{\text{Maximum score}}\)
The complete formula for calculating the p-value is:
\(p_j = \frac{\sum_{i=1}^{N_{\text{stud}}} s_i}{N_{\text{stud}} \cdot S_j} \)
with \(p_j\) the p-value for subquestion \(j\), \(N_{stud}\)the total number of students, \(S_j\) the maximum score of subquestion \(j\), and with \(S_i\) the score of student \(i\) on subquestion \(j\).
But how can you determine what p-value is okay? When designing an exam, you want to include questions that cover a wider range of difficulty, so that the test can distinguish between good and very good performing students, as well as between pass and fail students. Most important is to check whether the difficulty matches your expectations. Poor performing students refer to those students who did poorly on the assessment overall, while good performing students are those who received a good grade for the entire assessment.
For open-ended questions, the ‘optimal’ p-value that distinguishes between pass and fail students is in the range between 0.4 and 0.6 (See 2.2.e ‘How question difficulty influences the cut-off score’ for considerations to deviate from the ideal value of p). Although the 'ideal' value of p may be 0.5, you don’t want your students to get 50% of the points on average. It would upset your students and depress yourself during the scoring process.
In case of MCQs, \(p\) is ideally halfway between the guessing score (1 / (number of options)) and 1 (see Table 7). Some programs like Ans also calculate a p that is corrected for guessing (p’), meaning that a p’ of 0 is defined as the guessing score.
Table 7. ‘Ideal’ p-values
Number of options Guessing score Ideal p-value Ideal p-value with correction for guessing 2
0.5 0.75 0.5 3 0.33 0.67 0.5 4 0.25 0.63 0.5 5 0.20 0.6 0.5 𝐩 below guessing score: In case of closed-ended questions (MCQs), p-values below or around the guessing score (1 / number of options, see Table 7), this might indeed have been caused by guessing, for example because the topic was not included in the course. If p is lower that the guessing score, there either is a misconception amongst students, or another option might be the correct answer instead.
See 2.2.e ‘How question difficulty influences the cut-off score’ for considerations to deviate from the ideal p-value.
Extreme 𝐩-value (either close to 0 or close 1): This may indicate that the question is either too easy or too difficult.
-
Item discrimination is the ability of an item to distinguish between good and poor performing students. If the item discrimination is high, good performing students answer the question correctly and poor performing students answer the question incorrectly.
There are three item discrimination coefficients: \(R_{it}\), \(R_{ir}\)and \(R_{iG}\). You can always use \(R_{iR}\), but not always the other two.
Keep in mind that discrimination may be low if the item could be improved, but also if engaging in the learning activities did not contribute to getting a high score on this item. Either students already knew/mastered this before entering the course, or they did not get enough/effective learning activities during the course.
Terminology: The capital R stands for ‘correlation’ (referring to Pearson’s correlation coefficient ρ) and ‘it’ stands for item-test, while ‘ir’ stands for item-rest, and ‘ig’ for item-grade. All three measure the correlation between the item score and a measure of ‘true student score’:
If available, use the \(R_{ir}\)and ignore the \(R_{it}\) . Rit measures the correlation of the item score with the entire test score. \(R_{ir}\)measures the correlation of the item score with the score on the entire test, minus the item score itself. This is useful when you have a test with fewer than 25 questions or if not all questions have the same weight/amount of points. In that case, the Rir score is more reliable (less biased by e.g. outliers). In other cases, the difference between \(R_{it}\) and \(R_{ir}\) will be low.
In some projects/assignments, lecturers do not calculate the grade directly from the criteria scores. Instead, they determine the students’ grades separately and use the rubric/criteria scores to explain the grade. In these cases, the item-grade correlation helps to determine the (unconscious) importance of the individual criteria for determining the grade.
The \(R_{ir}\) of items is calculated using the following formula:
\(R_{it,j} = \frac{\sum_{i=1}^{N_{\text{stud}}} (x_i - \mu)(s_i - \mu_j)}{\sqrt{\sum_{i=1}^{N_{\text{stud}}} (x_i - \mu)^2 \sum_{i=1}^{N_{\text{stud}}} (s_i - \mu_j)^2}} \)
With \(N_{stud}\) the total number of students, \(x_i\) the final score of student \(i\), and \(𝜇\) the mean final score, and with \(s_i\) the item score of student \(i\) on item \(j\), and \(𝜇_𝑗\) the mean score on item \(j\).
The \(R_{ir}\) of subquestion \(j\) is calculated using the following formula:
\(R_{iv,j} = \frac{\sum_{i=1}^{N_{\text{stud}}} \left( (x_i - s_i) - \tilde{\mu}_j \right) (s_i - \mu_j)}{\sqrt{\sum_{i=1}^{N_{\text{stud}}} \left( (x_i - s_i) - \tilde{\mu}_j \right)^2 \sum_{i=1}^{N_{\text{stud}}} (s_i - \mu_j)^2}} \)
Where \({\mu}_j \) is the mean test score calculated from all subquestion scores minus the score from item \(j\). The \(R_{ir}\) and \(R_{it}\)-values are always between -1 and 1.
\(R_{ir}\) squared equals the percentage of variance in the final grade that is explained by the score for the item. So if \(R_{ir}\) of question 4b equals 0.5, it indicates that 25% of the variance of the final score (i.e. the grade) can be explained by the score of question 4b, under the assumption of a linear relation.
These values can be interpreted as follows in case of closed-ended questions:
Ideal values: Items with a \(R_{it}\)/\(R_{ir}\)of at least 0.20 (see Table 8) are considered sufficiently distinguishing between poor and good performers. Note that these values are less reliable when less than 50 students took the test.
For open-ended questions, projects and assignments, the correlations tend to be much higher (see 3.6 on when \(R_{ir}\) values are too high and might require action). It is wise to always look at the lowest \(R_{ir}\)-values of a test.Table 8. Interpretations of Rir and Rit values
\(R_{ir}\) and \(R_{it}\) Item discrimination quality 0.40 and higher very good
0.30 - 0.39 good
0.20 - 0.29 mediocre, the question should be improved 0.19 and lower bad, the question should not be used or altered completely Negative values bad, good students have answered the question incorrectly and vice versa. Negative values: In case \(R_{ir}\) is quite negative (i.e. not ~0), this indicates that overall well-performing students performed worse on this item. It might have been a trick-question, which they have overthought. Or if \(p\) is low, only bad performing students seem to have given the correct answer. A multiple-choice question with a low \(R_{ir}\) might be an indication that the answer key (answer model) is incorrect, or that there are multiple correct answers.
Value near zero: In case the \(R_{ir}\) is near zero (below 0.2), the score for this item is not correlated with the overall score of the other items. In other words, the score on this item does not give information on how well they do in the course.
-
For MCQs only: determine the quality of the distractors (the incorrect answer options) by calculating the a-value. This will give you the proportion of students who choose a particular distractor, and must be calculated for each distractor.
The formula for calculating the a-value is:
\(a_k = \frac{N_{\text{stud},k}}{N_{\text{stud}}} \)
With \(a_k\) the a-value for distractor k, \(N_{stud,k}\),𝑘 the number of students that chose distractor k, and \(N_{stud}\) the total number of students.
For each item, the sum of p-values (proportion of students who picked the right answer) and the a-values (proportion of students who picked each of the distractors) is equal to 1.Ideal value: Ideally, the a-values should be about the same for each distractor, because distractors should be equally plausible.
Plausibility distractors: If one of the a-values is much lower than the others, that option is not plausible for students, which increases the guessing score. The option could be rewritten, or removed. Formulating plausible distractors is time consuming and very difficult and should not be underestimated. Setting MCQ tests are, therefore, not an ‘easy way out’.
Underlying issue: problems with key: If an a-value is higher than a p, students might have chosen the distractor because it was the key (correct answer) after all, or because it was a trick question. A relatively low a-values (compared to the other a-value) indicate that an distractor was not attractive enough. Of course, when 90% of students correctly answer a question, the a-values can never be high and in case of low number of students, you cannot draw strong conclusions.
-
Use the average total (rest) score of students of the options to check whether the better performing students chose the correct answer, and not a specific distractor. You want this value to be higher for the correct answer compared to each distractor’s value. This check is possible in the TU Delft Excel for mc exams.
Underlying issue: There might be a distractor that lures otherwise good scoring students into overthinking a question. Check whether these possible trick distractors are also (partially) correct and consider giving students full or partial points. You might also have made a mistake in the answer key (happens to the best of us).
-
If you are grading open-ended questions using Ans, the tool keeps track of how often students get a step in a calculation right (or wrong), and how often they make specific mistakes. This gives you a good indication of what your students do and do not master, and if specific issues might be related to unclarities in the test assignment, in the answer model, in the course material, or during lectures/tutorials.
-
As discussed previously, the most important indicators that you might need to change the answer model are:
- (almost) no students got the maximum score,
- \(R_{ir}\)s are negative or relatively low
- p-values are low
- a-values are high (for closed-ended questions)
In order to select the most worrying items, you analyse the combination of these indicators, in the order of importance that is indicated below.
Indicator for worrying item Implication 1) Maxa < max None of your students got the maximum score. Was it possible for them to achieve the maximum score, judging from the question, the model answer, and the learning activities? You might conclude that you want to adjust the answer model. 2) Rir < 0 (e.g. -0.2) Good students performed not good on this question, and/or not-so-good performing students performed good on this question. This is always problematic.
In case p is small, this indicates that the few students who answered the question correct, were the bad-performing students.
In case p is large, this indicates that the students who answered the question incorrectly, were the good-performing students. Maybe the question was a trick-question, that was overthought by the good students?3) Rir ~ 0.0 (<0.2) or for open questions: the lowest Rir-values This question was not good at discriminating between good performing and bad performing students. Assuming that performance depends on course participation, the item did not give information on whether or not students actively participated in the course, which is not ideal. 4) a-value < p-value (MCQ) This alternative was chosen more frequently that the correct answer. Especially if the Rir is negative, this might be an indication that the key is incorrect. 5) p-value small Only few students got this question correct. If the Rir is high (relatively), it is ‘just’ a difficult question, that was only answered correctly by good-performing students, which can be fine. Unless the whole test has low p’s and many students failed. Whenever you have few students, you cannot draw strong conclusions. In general, whatever the grades tell you, you know what happened in class and might have ideas on what is going on.
3.3. How to adjust answer models based on test result analysis
The last section discussed how you can identify the most worrying items using a combination of indicators of the test result analysis, and gave you some hints on what the underlying problems might be. This section discusses how you can adjust the scoring of the items for the students who did the test. Furthermore, if the grades or passing rates are low and not representing the level of LO mastering after adjusting the scoring, you will find out how you could change the grading.
-
It is important to keep in mind is that it is impossible to make perfect exams, even after thorough peer reviews. On the other hand, you are the expert of the course, and you may have perfectly good reasons not to take actions; as long as you can justify your decisions.
For example, if your exam consists of calculating questions, and of essay questions, students who are good at calculating, might not have good writing skills, and vice versa. This will decrease the Rirs and Cronbach’s alpha, without implying problems with the exam questions at all. However, you might consider offering extra exercises for students who are less skilled in calculating, and exercises for those who are less skilled in writing good essays.
When considering to adjust the grading, you always start by considering to adjust the answer model on item level. Only if this does not have the desired effect and if you consider it justifiable, you adjust the calculation of the grade.
-
The first thing you will do is to consider whether the answer model needs to be changed on item level. This can be justifiable if the question was unclear and does not lead to the current model answer, or when the question was too hard or was not aligned with the learning activities and you consider giving partial answers full points. In order to make this decision, you first need to find the cause of the problem. Ask yourself the following:
- Will the question lead to the model answer for students who master the applicable learning objective(s), or are there other, valid answers?
- Was the question clear to the students? Or was it a trick question or could the student interpret it as a trick question?
- Is the model answer correct?
- In case of closed questions: does the question assess only one learning objective at a time?
- Exams: Was the question part of the learning objectives and of the to-be-studied material?
- Assignments: Is the rubric evaluating students on skills that are not related to the learning objectives (i.e. writing/grammar)?
To a certain extent, any answer that answers the question correctly should be granted full points.
For example, if you asked ‘Explain whether theory B is applicable to the case?’, and the student came up with a plausible answer that you did not think of, you can add it to your answer model. Another example: if the question is ‘What is the length of beam A?’, and you expected your students to write down the whole, lengthy calculation, but did not ask for it, you should grant full points to the question, even if you are not sure whether this student used the correct calculation.
-
When considering to exclude questions or criteria from grade calculation by for example giving full points to all students, you have to make a trade-off between the following factors:
- Validity: deleting a question or criterion (for assignments) will diminish the representability of your exam of the learning objectives. Reflect on if you have enough questions left per learning objective (and level) for the validity of your exam or assignment.
- Reliability: deleting a question or criterion that has a low or negative Rir-value will improve the reliability of the grade. That is, the grade is probably a better reflection of the level to which the students master the learning objectives that were measured in the course.
- Fairness: consider whether simply deleting it is fair for all students. Is it probable that students spent a lot of time on this question or criterion? Consider giving students who correctly (guessed?) the answer a bonus point, or giving everybody full grades, although both options will diminish the reliability of the grade.
- Transparency: in order to provide transparency, you will need to communicate the change in test grade calculation to the students. If you feel reluctant to do so, it might be because of fairness issues. Because of fairness and transparency, it is not advisable to change the weighing/division of points between questions /criteria afterwards: students who might have put a lot of time in a criterion/question with a high weight, will be disadvantaged if the weight diminishes.
- Constructive alignment: Is this question/criterion part of a learning objective? Are you sure that your students had enough possibility to practice with this type of question/criterion? Did the students get feedback on their performance level on this question/criterion during the course? If one of these question results in a ‘no’, you could remove the question.
3.4. Reliability of the test (Cronbach’s alpha)
Watch the video above on the difference between reliability and validity in test result analyses.
The reliability of a test is the same as the reliability of the grade. Does the student with a 6.0 really deserve to pass, or are you not so sure, due to measurement errors? And does the student with a 10.0 really master all learning objectives? One way to estimate the measurement error is to calculate the score reliability (reliability coefficient), like Cronbach’s alpha.
Assumptions of reliability: All reliability coefficients assume that the test intends to measure one single thing, namely how well as student masters the course. It also assumes that each student should perform more or less equally well on all test items, considering the fact that your job as teachers is to help student master all learning objectives of a course. If your students participate in all learning activities of your constructively aligned courses, you would find it unexpected and worrisome if the highest performing students would have the lowest scores on the easiest questions, or the other way around.
Reliability coefficients are a measure of whether students are performing consistently well on all test items (i.e. internal consistency of the test). There are several methods for calculating the reliability of an assessment. Cronbach’s alpha is one of these methods. It estimates the test-retest by considering each question in the test as a separate test and then calculating the correlation between the questions. A simplified version for multiple-choice exams is KR-20.
The value of 𝜶 lies between 1 and 0. The closer the value is to 1, the smaller the measurement error. A lower reliability can mean that a student whose ‘true score’ is just above the cut-off score may fail the test due to test inaccuracy. Test reliability is very important when the consequences of the test results are large, and therefore the reliability coefficient should be higher for tests of higher stakes.
Grades can be considered reliable if Cronbach’s alpha is high enough. This depends on the importance of the assessment (van Berkel, 1999):
Types of assessment | Cronbach's alpha |
High stake assessment (e.g. only assessment of course | α ≥ 0.8 |
Medium/low stake assessment (e.g. 50% of final grade): | α ≥ 0.7 |
Formative assessment (e.g. 0% of final grade) | α ≥ 0.6 |
If your reliability is low, this may be due to the following factors (van Berkel, 1999):
- Test length: There may not be enough items in the test, which diminishes the reliability.
- Group composition: a more heterogeneous group of students leads to lower reliability, since some students might be good at e.g. the math part of the test, and other students might perform better at other questions. This can be an indication that you might want to tailor your course for these two groups and have your students practice on their weak points. This will increase Cronbach’s alpha, as well as the item correlations (see 3.2.c on page 28). This is frequently encountered in multidisciplinary master courses.
- Test heterogeneity: If the items represent very different topics or skills, this will lead to a lower reliability coefficient.
- Mostly low or high scoring items: the reliability coefficient will be lower if there are mostly items that result either in a low score in most students, or a high score in most students. Consider including items of average difficulty.
- Little difference between student levels: the reliability coefficient will be lower if students are at more or less the same level.
- Low item correlation: lower quality items (with higher Rir) decrease reliability of the entire test (see 0 to analyse this in detail).
The formula for calculating the reliability coefficient Cronbach’s alpha is as follows:
\(\alpha = \frac{K}{K - 1} \cdot \frac{\sigma_x^2 - \sum_{j=1}^K \sigma_j^2}{\sigma_x^2} \)
With 𝛼 the reliability coefficient, 𝐾 the total number of items, \(𝜎_𝑥^2\) the variance in the total scores of all students, i.e.:
\(\sigma_x^2 = \frac{1}{N_{\text{stud}}} \sum_{i=1}^{N_{\text{stud}}} (x_i - \mu)^2 \)
With \(N_{stud}\) the total number of students, \(x_i\) the final score of student 𝑖, and 𝜇 the mean final score.
The variance of the item scores \(𝜎_j^2\) is calculated equivalently:
\(\sigma_j^2 = \frac{1}{N_{\text{stud}}} \sum_{i=1}^{N_{\text{stud}}} (s_i - \mu_j)^2 \)
with \(s_i\) the sub-question score of student \(i\) on sub-question \(j\), and \(\mu_j\) the mean score on sub-question \(j\).
The reliability coefficient gives an indication of the reliability of the test as a whole by comparing the difference of the variance in the final test scores of all students with the variance in the test score per sub-question. The reliability coefficient can have a value between 0 (unreliable) and 1 (reliable). In very rare cases it can be negative. In a reliable test, the variance in the final scores of the students (\(𝜎_𝑥^2\)) is much larger than the sum of variances in the sub-question scores (\(𝜎_j^2\)).
-
The meaning of reliability will be illustrated by discussing how you can use Cronbach’s alpha to calculate the measurement error that was introduced by chance. Test theory assumes that every student has a true score, which reflects that student's actual capability in the area of expertise that an assessment is testing. If a student would take the same (unbiased) test an infinite amount of times, the average of all these scores would constitute the true score. Because this would not be practical to carry out, it is important to recognise that the score of a student taking a test once consists of the true score plus the measurement error, either systematic or accidental. To ensure that grades are correct and, more specifically, that students correctly pass or fail the course, it is important that the error of measurement is as small as possible.
You can calculate the measurement error as follows: First, you calculate the Standard Error of Measurement (SEM) from Cronbach’s alpha or KR-20: \(\text{SEM}(x) = \text{SD}(x) \sqrt{1 - \alpha} \)
in which x is the achieved test score, SD is the standard deviation and 𝛼 is the reliability coefficient (Cronbach’s alfa or KR-20).
From here, you can calculate the 68% (most common) or 95% confidence intervals in which the ‘true score’ of the student lies:
Table 9. Confidence intervals of a test score, based upon the standard error of measurement (SEM)
Certainty Confidence interval 68% (used most often) [test_score – 1*SEM, test_score + 1*SEM] 95% [test_score – 2*SEM, test_score + 2*SEM] Meaning of confidence interval: The confidence interval indicates that if the student would repeat the test for an infinite times in the same circumstances, the average grade (and hence the true grade) would be within the 68% confidence interval in 68% of the cases, and within the 95% CI in 95% of the cases. That is, if the circumstances stay the same, i.e. the student does not get tired, anxious, bored etc.
Example:
- a student scores 26 out of 50 points
- the cut-off score is 28 i.e. grade of 6.0 (grade = 1 + 9*score/50)
- SEM is 5 points
- the 68% confidence interval is 21 to 31 in points i.e. a grade between 4.8 and 6.6
- The student will get a 5.7 for the test, which is rounded to a 5.5 (fail) if the course consists of one test.
This means that the student has failed, but maybe should have passed based on his true score (actual capacity) and wasn't able to because of the either systematic or accidental measurement error.
The 95% confidence interval is even wider:
- the 95% confidence interval is 16 to 36 points i.e. a grade between 3.9 and 7.5
Use of confidence intervals: The confidence intervals can be used to determine which students’ work might benefit from a second reviewer to (independently if feasible) rescore these students’ work. This could be the case for students whose confidence intervals contain the cut-off score.
Consequence: The uncertainty of grades is a reason to allow for compensation between partial grades within a course, and in some cases even between courses. The latter is up to the Board of Examiners to decide.
Example: The Board of Examiners could decide that students who received a 5 for Dynamical Systems 1, 35 Analysis of test results but got a 7 for Dynamical Systems 2, could still receive a ‘pass’ for Dynamical Systems 1 (or at least not have to take a resit for Dynamical Systems 1 in order to graduate). Especially if the learning objectives of the second course build on the ones in the first course.
-
You can represent a frequency distribution of the grades in a histogram, or a cumulative percentage, like in Figure 6. Use this to decide whether or not to increase the grades, based on whether your experience during the course is that is or is not an accurate reflection of the level of the students. This might depend on the percentage of students that passed the course.
In Figure 6, if the cut-off score is set to 10 points (𝑔𝑟𝑎𝑑𝑒 = 1+9∗𝑠𝑐𝑜𝑟𝑒/18), only 22% of the students will pass. You could use the histogram to determine a new cut-off score.
In this case there is a strong indication that your test may have been too difficult and there might be a problem with validity. If, after critically going through the entire analysis, this is proven to be the case, you can use this table as a tool to assess your pre-determined cut-off score. You could for example state that 56% of the students should pass the test. In that case, you could use 8 points as the cut-off score (44% of the students would fail the test).
Putting the frequency distribution into a histogram will show you if the distribution is normal or whether there is a ceiling or a floor effect. When you have a floor effect (see Figure 7), most students have a relatively low score, meaning the test was too difficult for a large group of students. When you have a ceiling effect (see Figure 8), most students have a relatively high score, meaning that the test was too easy for a large group of students. In case of courses that have a ‘steep learning slope’ (i.e. require a relatively large portion of the course to reach a basic level), you might see a combination (camel) effect with two bumps (one at a very low grade for the students who did not reach the basic level, and one centred around e.g. 7.5). to You are the expert who knows what happened during the course and are the expert at explaining your grade distribution.
Examples of both are shown below:
3.5. Adjusting the grades
-
It should be possible for at least some of your students to score a 10. So, what to do if all the grades are too low? If there was a mistake on the test, or if a question was too vague, you probably already adjusted the answer model. If you still think that the grades do not represent how well students master the learning objectives, you might want to adjust the grade calculation.
It might be a good idea to check the assessment policy whether you should discuss changes in grade calculations that are based on the test result analysis with your Board of Examiners (since they have given you the mandate to grade students), your programme director and/or the educational advisor of your faculty.
There are several ways to adjust the grading. The simplest one is to simply add a constant number to the grade. Another way is the Cohen-Schotanus adjustment. This one is described below.
-
Cohen-Schotanus (University of Groningen, Medical Faculty) explains that because lecturers could (and often do) make mistakes with their exams (and courses), it is possible to underestimate students’ abilities. In short, her method assumes that the top 5% of the students is supposed to get a 10. Therefore, it calculates the average score of the top 5% students and assigns them a 10. This method uses a knowledge percentage that to find the cut-off score (after correcting for the guessing score).
The following example is the procedure is for a multiple-choice exam with 60 questions of 1 point each.
- Total number of points = 60
- Average score of the 5% best students = 55 (example)
- Correction for guessing = 60/4 = 15
- Average corrected score top 5% - correction for guessing = 55 – 15 = 40 à students get a 10
- Knowledge percentage = 60% (example)
- Cut-off score = 15 + 0.6*(55-15) = 39 points. Students that have 39 points and more will get a pass.
The Cohen-Schotanus method is only meant to correct grades in large, ‘normal’ student populations. For retakes, you have a sample of students that is likely to score lower than the whole student population. Therefore, you cannot do a Cohen-Schotanus correction.
-
It is good to check in your regulations for whether your faculty has specific advice on how to determine the cut-off score before and after delivering an exam to your students. For example, 3mE uses an Angoff method to determine the cut-off score before delivering the exam by estimating how many points the students, who are performing at the minimum pass-level (the level of a 6), will get for each item. After analysing the exam results, the cut-off score is adjusted using the Hofstee method. After this, the examiner can decide to apply a version of the Cohen-Schotanus method to make sure that the student(s) with the highest score will get a 10.
-
If Cronbach’s alpha stays low after having adjusted the answer model, the assessment most likely does not have enough (sub)questions for a valid analysis, and so you do not have enough information to estimate reliably the students’ grades.
Another explanation of a low reliability may be that your course assesses different skills, for example, writing skills and calculation skills. As mentioned previously, students who have good writing skills might not be performing well when doing calculations. Could you customize the learning activities to improve ‘writing skills’ for some students, and ‘calculation skills’ for other students?
3.6. How to use correlation table to adjust criteria or rubrics
In the previous sections, some ideas on how to adjust/create a rubric based on the test result analysis were already described. In this section describes how you can use the correlation between criteria to update your assessment criteria and/or rubric (see Table 10).
What does it mean? The correlation between criteria indicates if two criteria increase and decrease together. If they do, the value is positive (between 0 and 1), and if they ‘anti-correlate’, the value is negative (between -1 and 0). This happens if students who do relatively well on one criterion, actually do worse on another criterion.
What can I do with this information?
If some criteria have a large positive correlation, this implies that they are scored pretty similarly.
- There may be too many criteria, causing the assessors to become ‘lazy’. Consider combining these two criteria, but only if it makes sense to combine them.
- The criteria may not be distinctive enough to assessors, although you think that there is a clear and meaningful distinction. Consider rephrasing the criteria, their description or their rubric descriptors, to clarify this.
- Example in Table 8: ‘correctness’ and ‘design’ have a large correlation. Assessors might think that they are the same, judging by their names. You might consider combining them, or if you find their difference very important in this course, find a way to clarify their distinction to the assessors.
If some criteria have a ‘significant’ (e.g. smaller than -0.1) negative correlation, this implies that students who do well on one criterion, do relatively bad on the other one.
- Either or both of the criteria may not be trained during the course (i.e. there is a constructive alignment issue). Consider training students on this criterion and giving them feedback, or taking out the criterion.
- Example in Table 8: Students who do well on the summary, do not do so well on the presentation, and vice versa. Maybe the lecturer did not train students or give them feedback on the presentation or summary? Or gave half the students feedback on the summary, and the other half on the presentation because there was not enough time for them to hand in drafts for both?
If two criteria have no correlation (e.g. between -0.1 and 0.1), this implies the criteria are independent. That is actually not per se a bad thing, since you want the criteria to measure different things. Why? If they measure the same thing, it can be a waste of time. However, you would expect some positive correlation, because the assumption is that students learn all criteria in equal proportions (i.e. they all have the same ‘easiest’ and ‘hardest’ learning objective). Therefore, you may want to double check for pairs of criteria that you expected to correlate and did not correlate. If you find any, check if you provided sufficient training and feedback.
References
For a list of references used in creating this manual please visit this page.