AMBOSS Qbank: How we use psychometric analysis to create top-tier questions

Blaise Joseph, MBBS - Mar 07, 2023
A physician teaches her students as they prepare for standardized medical assessments.

Want to know how psychometric analysis is used to assess the quality of items on medical exams? Read on to find out! You'll also learn about the concepts of item difficulty and item discrimination, and how AMBOSS designs its items to be reliable and effective. 

If you're studying for a standardized medical assessment or helping to create one, you're probably familiar with the importance of having high-quality questions as a part of the assessment. After all, the better the questions, the better the assessment will be at accurately measuring knowledge. But how can we objectively ensure that the questions on a medical assessment are of the highest quality?  Enter psychometric analysis.

First, let's define what we mean by "item" in the context of psychometric analysis.  An item is a single question or problem that is used to test an examinee's knowledge or skills. For example, a standardized medical assessment item might ask a student to interpret a blood test result or diagnose a patient based on a set of symptoms. The items used for the USMLE® exams are multiple-choice questions in a single-best answer format, which means that the item is designed so that there is only one option that is unambiguously correct (called the key).

 

An AMBOSS Qbank question illustrating the use of psychometric analysis

 

In psychometric analysis, item difficulty refers to how easy or difficult a particular item is to answer correctly. This is typically measured using the proportion of test takers who answer the item correctly. An item is considered to be difficult if a relatively small proportion of test takers answer it correctly, while an item is considered to be easy if a relatively large proportion of test takers answer it correctly.

On the other hand, item discrimination refers to the ability of an item to differentiate between test takers who have different levels of knowledge or ability. This is typically measured using the difference in the proportion of high-ability and low-ability test takers who answer the item correctly. For example, a question that only high-performing individuals can answer correctly would have high discrimination, while a question that both high- and low-performing individuals can answer correctly would have low discrimination. 

High-ability and low-ability test takers are typically identified based on their overall performance on the test. Test takers who score above a certain threshold are considered to have high ability, while those who score below a particular threshold are considered to have low ability. The exact methods for identifying high- and low-ability test takers may vary. Some tests may use more complex statistical analysis to determine ability levels. Ultimately, the goal is to identify test takers who have different levels of knowledge or ability in order to assess the discrimination of individual test items.

To illustrate the difference between item difficulty and item discrimination, consider the example of a single-best answer item that asks students to diagnose a complex medical condition based on a set of symptoms. This item has four options with option A designed to be the key. After administering the item to a cohort of test takers, the proportion of test takers endorsing each of the four options is shown. 

 

Proportion of test takers endorsing each of the four answer options of a medical assessment question

 

Since 41% of students got the question right, more than half of the medical students who took the test would likely find this item to be difficult. 

Let us now compare the option endorsement rates in the high-performing group and the low-performing group for the same item.

 

Option endorsement rates in the high-performing group and the low-performing group for the same item

 

We see that 72% in the high group and 23% in the low group endorse the key. For the distractors, we are getting less endorsement from the high group and more endorsement from the low group. 

This is exactly what we want to see for a good quality question. We want a higher endorsement rate for the key from the high-performing group, and a higher endorsement rate for all the distractors in the low group. If this is not the case, we have a problem and the item would have to be rewritten. 

We can also quantify this difference in endorsement rate between the two groups with a number that is called the discrimination index.

 

The discrimination index of a medical assessment question

 

So the key, that is option A, has a discrimination index of 72 - 23 = + 49.  And all the distractors have corresponding negative values. This shows us that despite being a difficult question, this question is good at discriminating high from low performers.

The concepts of item difficulty and item discrimination come from classical test theory, which was developed in the early 20th century by psychologists who were interested in understanding the relationship between individual differences and overall performance on tests. While classical test theory has been very influential in the field of psychometric analysis, there have been many recent developments that have expanded upon and refined these concepts. For example, modern techniques such as item response theory allow for a more nuanced analysis of the relationship between an item and the ability of the test taker, but concepts such as item difficulty and item discrimination still form the foundation for these sophisticated techniques.

We use psychometric analysis here at AMBOSS to review and revise all of the questions in our Qbank. This means that every single question—even our most difficult ones—has gone through a rigorous process that ensures they can differentiate good test takers from weaker test takers. Thereby, we ensure our Qbank provides a fair and accurate assessment of individuals' knowledge and skills

In conclusion, item difficulty and item discrimination are different measures that are used to assess the quality of items or questions. Item difficulty refers to how easy or difficult an item is, while item discrimination refers to the ability of an item to differentiate between test takers who have different levels of knowledge or ability. While a difficult item may not necessarily be discriminatory, a discriminatory item can still be difficult. Understanding the difference between these two measures is crucial for ensuring the fairness and accuracy of evaluations and tests.

 

New to the AMBOSS Qbank? Start your free trial today.

Start free trial

 

Instagram