Anchor Links Example

3. Analysis of test results

When considering how well you and your students performed, you are frequently asked to report the percentage of the students that passed the course. However, analysing their scores will reveal more detail and enable you to make informative decisions for improving the assessment and the course as a whole.

Test results analyses take place on three different levels (see Figure 4):

  1. At test/assessment level
  2. At item level
  3. At answer level
Figure 4. Three levels of test result analysis.

A test result analysis will give insight into:

  1. How well the students mastered the individual learning objectives of the course (test level)
  2. The quality of the individual test questions or assignment criteria (item & answer level, 0)
  3. Whether the answer model need to be revised (3.3)
  4. The overall quality of the assessment (test level, 3.4)
  5. Whether the grading needs to be revised (3.5)
  6. How to adjust rubrics and criteria (3.6)

This chapter explains the steps that you can take to perform a test result analysis and to improve the grading, future assessments and future courses based on your findings.

Keep in mind that it is practically impossible to make flawless assessments (unless you had unlimited time). Therefore, be prepared to adjust the answer model or rubric grading after the test result analysis.

First, please take note of the following definitions:

  • Test: any assessment, including projects, assignments, exams with open-ended questions and multiple-choice exams.
  • Grade: the grade (usually on a scale from 1 to 10) that a student receives for the whole test
  • Score: the number of points that a student obtained from this test, before it is transformed into a grade.
  • Item: the smallest unit in a test. This can be a criterion or subcriterion for assignments/projects, or a subquestion or question for an exam.

In case your exam is a digital exam or a paper-scan exam, digital exam tools do the test analysis result for you (in case of Ans and Brightspace Quizzes). Check the Teaching Support pages for an explanation and how to use this part of the assessment tool. More information on how to interpret test result analyses, see sections Error! Reference source not found., 3.4.

If you are grading exams or projects/assignments with pen-and-paper, you will store the following data in a spreadsheet while (or after) scoring your students’ work:

  • Scores per item per each student
  • Total scores per student
  • Grades per student

For smaller datasets (few students (<20) or few items), you may not be able to draw strong conclusions from your data. However, you are encouraged to run a test result analysis to check if your experience during the course and grading matches the test result analysis.

3.1. Analysis of achievement of learning objectives

The first question you want to ask yourself, is how well your group of students master the individual learning objectives. Are they performing better at certain learning objectives than others? Did my new teaching approach for a certain learning objective work? Are there learning objectives in which they perform worse than in others? These and other questions might be answered by grouping the (normalized) item scores per learning objective, like in Figure 5.

Graphically summarizing the scores of your students per learning objective will make it easier to interpret the results. Plot a measure of performance (average and/or median) and spread (standard deviation or boxplot), and if helpful, the individual data points.

When analysing the graph, think about what scores you as a teaching professional find acceptable for a particular course or learning objective. Also consider what caused the problems or success in LOs during the course, and how you can help your colleagues and students to work on (and prevent) knowledge gaps. You can use any graph of your choice, as long as it summarises the distribution of the scores per learning objective.

Typically, problems in learning objective achievement are caused by a lack of practice at the level of the test (constructive alignment).

Figure 5. Example graph indicating test scores and LOs in a boxplot

3.2. Analysis of the quality of the test items and answers

In this section, you will learn to analyse the quality of the individual items ((sub)questions, or assessment (sub)criteria) with the outcome of the item-specific analyses. Use this to pick the most worrying items to check for errors or unclarities. This helps you to improve the scoring of these items for the sake of the students who just took the exam and will have a fairer grade, and for the sake of selecting how you are going to further improve next year’s test.

Test result analyses result in a number of variables. The most useful variables to check which items are (most) worrisome are the following:

  • Maximum score: Did at least a few students answer this individual question correctly or get the full score for this criterion?
  • Average score: Is the average score very high or very low (and did you expect this)? I.e. was the question / criterion very easy or very hard?
  • Correlation with the other scores: How did the good-performing students do on this question or criterion?

Use these variables to pick, for example, four items to study in detail.

In the following sections, these values are first discussed individually. After you comprehend what kind of information the individual variables can reveal, read how you can use their combination to focus your attention on potential problems (and solutions).

3.3. How to adjust answer models based on test result analysis

The last section discussed how you can identify the most worrying items using a combination of indicators of the test result analysis, and gave you some hints on what the underlying problems might be. This section discusses how you can adjust the scoring of the items for the students who did the test. Furthermore, if the grades or passing rates are low and not representing the level of LO mastering after adjusting the scoring, you will find out how you could change the grading.

3.4. Reliability of the test (Cronbach’s alpha)

Watch the video above on the difference between reliability and validity in test result analyses.

The reliability of a test is the same as the reliability of the grade. Does the student with a 6.0 really deserve to pass, or are you not so sure, due to measurement errors? And does the student with a 10.0 really master all learning objectives? One way to estimate the measurement error is to calculate the score reliability (reliability coefficient), like Cronbach’s alpha.

Assumptions of reliability: All reliability coefficients assume that the test intends to measure one single thing, namely how well as student masters the course. It also assumes that each student should perform more or less equally well on all test items, considering the fact that your job as teachers is to help student master all learning objectives of a course. If your students participate in all learning activities of your constructively aligned courses, you would find it unexpected and worrisome if the highest performing students would have the lowest scores on the easiest questions, or the other way around.

Reliability coefficients are a measure of whether students are performing consistently well on all test items (i.e. internal consistency of the test). There are several methods for calculating the reliability of an assessment. Cronbach’s alpha is one of these methods. It estimates the test-retest by considering each question in the test as a separate test and then calculating the correlation between the questions. A simplified version for multiple-choice exams is KR-20.

The value of 𝜶 lies between 1 and 0. The closer the value is to 1, the smaller the measurement error. A lower reliability can mean that a student whose ‘true score’ is just above the cut-off score may fail the test due to test inaccuracy. Test reliability is very important when the consequences of the test results are large, and therefore the reliability coefficient should be higher for tests of higher stakes.

Grades can be considered reliable if Cronbach’s alpha is high enough. This depends on the importance of the assessment (van Berkel, 1999):

Types of assessment Cronbach's alpha
High stake assessment (e.g. only assessment of course α ≥ 0.8
Medium/low stake assessment (e.g. 50% of final grade): α ≥ 0.7
Formative assessment (e.g. 0% of final grade) α ≥ 0.6

 

If your reliability is low, this may be due to the following factors (van Berkel, 1999):

  • Test length: There may not be enough items in the test, which diminishes the reliability.
  • Group composition: a more heterogeneous group of students leads to lower reliability, since some students might be good at e.g. the math part of the test, and other students might perform better at other questions. This can be an indication that you might want to tailor your course for these two groups and have your students practice on their weak points. This will increase Cronbach’s alpha, as well as the item correlations (see 3.2.c on page 28). This is frequently encountered in multidisciplinary master courses.
  • Test heterogeneity: If the items represent very different topics or skills, this will lead to a lower reliability coefficient.
  • Mostly low or high scoring items: the reliability coefficient will be lower if there are mostly items that result either in a low score in most students, or a high score in most students. Consider including items of average difficulty.
  • Little difference between student levels: the reliability coefficient will be lower if students are at more or less the same level.
  • Low item correlation: lower quality items (with higher Rir) decrease reliability of the entire test (see 0 to analyse this in detail).

The formula for calculating the reliability coefficient Cronbach’s alpha is as follows:

\(\alpha = \frac{K}{K - 1} \cdot \frac{\sigma_x^2 - \sum_{j=1}^K \sigma_j^2}{\sigma_x^2} \)

With 𝛼 the reliability coefficient, 𝐾 the total number of items, \(𝜎_𝑥^2\) the variance in the total scores of all students, i.e.:

\(\sigma_x^2 = \frac{1}{N_{\text{stud}}} \sum_{i=1}^{N_{\text{stud}}} (x_i - \mu)^2 \)

With \(N_{stud}\) the total number of students, \(x_i\) the final score of student 𝑖, and 𝜇 the mean final score.

The variance of the item scores \(𝜎_j^2\)  is calculated equivalently:

\(\sigma_j^2 = \frac{1}{N_{\text{stud}}} \sum_{i=1}^{N_{\text{stud}}} (s_i - \mu_j)^2 \)

with \(s_i\) the sub-question score of student \(i\) on sub-question \(j\), and \(\mu_j\) the mean score on sub-question \(j\).

The reliability coefficient gives an indication of the reliability of the test as a whole by comparing the difference of the variance in the final test scores of all students with the variance in the test score per sub-question. The reliability coefficient can have a value between 0 (unreliable) and 1 (reliable). In very rare cases it can be negative. In a reliable test, the variance in the final scores of the students (\(𝜎_𝑥^2\)) is much larger than the sum of variances in the sub-question scores (\(𝜎_j^2\)).

3.5. Adjusting the grades

3.6. How to use correlation table to adjust criteria or rubrics

In the previous sections, some ideas on how to adjust/create a rubric based on the test result analysis were already described. In this section describes how you can use the correlation between criteria to update your assessment criteria and/or rubric (see Table 10).

What does it mean? The correlation between criteria indicates if two criteria increase and decrease together. If they do, the value is positive (between 0 and 1), and if they ‘anti-correlate’, the value is negative (between -1 and 0). This happens if students who do relatively well on one criterion, actually do worse on another criterion.

What can I do with this information?

If some criteria have a large positive correlation, this implies that they are scored pretty similarly.

  • There may be too many criteria, causing the assessors to become ‘lazy’. Consider combining these two criteria, but only if it makes sense to combine them.
  • The criteria may not be distinctive enough to assessors, although you think that there is a clear and meaningful distinction. Consider rephrasing the criteria, their description or their rubric descriptors, to clarify this.
  • Example in Table 8: ‘correctness’ and ‘design’ have a large correlation. Assessors might think that they are the same, judging by their names. You might consider combining them, or if you find their difference very important in this course, find a way to clarify their distinction to the assessors.

If some criteria have a ‘significant’ (e.g. smaller than -0.1) negative correlation, this implies that students who do well on one criterion, do relatively bad on the other one.

  • Either or both of the criteria may not be trained during the course (i.e. there is a constructive alignment issue). Consider training students on this criterion and giving them feedback, or taking out the criterion.
  • Example in Table 8: Students who do well on the summary, do not do so well on the presentation, and vice versa. Maybe the lecturer did not train students or give them feedback on the presentation or summary? Or gave half the students feedback on the summary, and the other half on the presentation because there was not enough time for them to hand in drafts for both?

If two criteria have no correlation (e.g. between -0.1 and 0.1), this implies the criteria are independent. That is actually not per se a bad thing, since you want the criteria to measure different things. Why? If they measure the same thing, it can be a waste of time. However, you would expect some positive correlation, because the assumption is that students learn all criteria in equal proportions (i.e. they all have the same ‘easiest’ and ‘hardest’ learning objective). Therefore, you may want to double check for pairs of criteria that you expected to correlate and did not correlate. If you find any, check if you provided sufficient training and feedback.

Table 10. Correlation between criteria in a project.

References

For a list of references used in creating this manual please visit this page

Anchor Links Example