Swedish SAT example

Here we provide a usage example of the IRTorch package using a sample of data from the Swedish scholastic aptitude test (SAT). The test is used for admission to university programs and consists solely of multiple choice items. It has a quantitative and a verbal part. In this example, we will analyze the quantitative part. We will not do an in-depth analysis of the data, but rather showcase some of the main features of IRTorch to enable the user to analyze their own data.

Users are referred to the API reference section of the documentation for more detailed information of different functions and classes. Note that the results might vary slightly from those presented here due to differences in the computing hardware (CPU or GPU), the operating system, and the versions of the packages used.

If you prefer to run the code directly, you can download this example Jupyter notebook from the GitHub repository.

Load the libraries

We start by loading the libraries and modules that we will use for this example.

[10]:

import copy
import torch
import irtorch
from irtorch.models import NestedLogit, ThreeParameterLogistic
from irtorch.estimation_algorithms import MML

[11]:

import plotly
plotly.offline.init_notebook_mode()

The test data

The package requires the data to be in a PyTorch tensor with test takers as rows and items as columns. For an example on how to convert a pandas dataframe to a PyTorch tensor, check out the National Mathematics example.

In this example, we will load the data directly from within the package. For multiple choice data, we need a list containing the correct response options for the items. Each list element should correspond to one item. Note that for a 4 option item, the responses are coded 0, 1, 2, 3, and thus if the first option is correct, the item corresponding element in the list should be 0. Below we read both the item responses and the correct response options directly from the package.

[12]:

data_sat, correct_responses = irtorch.load_dataset.swedish_sat_quantitative()

We recommend splitting the test data into a training and a test set. The training set will be used to fit the model, and the test set to ensure we are not overfitting the model to the training data.

[13]:

irtorch.set_seed(42)
train_data, test_data = irtorch.split_data(data_sat, 0.8)

Initializing and fitting models

Details on available IRT models can be found under the IRT Models section. We will start by showcasing how to use the package to analyze correct vs. incorrect responses using a three-parameter logistic (3PL) model, then we will move on to models that can handle each incorrect response category separately. We start by initializing a 3PL model instance by supplying the number of items.

[14]:

threepl = ThreeParameterLogistic(items = data_sat.shape[1])

For the 3PL, we need code the data as correct/incorrect:

[15]:

train_data_binary = (train_data == torch.tensor(correct_responses)).float()
test_data_binary = (test_data == torch.tensor(correct_responses)).float()

The model can now be fit using the fit() instance method. fit() requires the model training data as a tensor, as well as an instance of the fitting algorithm. The Estimation algorithms section contains all available fitting algorithms. Here we will use marginal maximum likelihood (MML). Since fitting can take some time, it can be useful to save the parameters of the fitted model to disk for later use using save_model() with the filename of choice.

Fitting will be done using the GPU if available, otherwise the CPU will be used. This can also be specified using the device argument, i.e. nr.fit(train_data=train_data, algorithm=AE(), device = "cpu").

[16]:

irtorch.set_seed(42)
threepl.fit(train_data=train_data_binary, algorithm=MML())
threepl.save_model("swesat_3pl.pt")

Epoch: 99. Loss: 181164.1562

INFO: Stopping training after 2 learning rate updates.
INFO: Best model found at iteration 99 with loss 181164.1562.

We can load the saved model parameters to a new model instance using load_model(). Note that the instance needs to be created in the same way as for the original model for load_model() to work.

[17]:

threepl = ThreeParameterLogistic(items = data_sat.shape[1])
threepl.load_model("swesat_3pl.pt")

The plot property of our model contains methods for visualization. We can plot the training history to check if we appear to have converged to a good solution. We see that the loss function appear to have stopped decreasing. In our experience, there is usually slight improvements from training for more iterations even when the loss appears relatively flat. This is especially true for more flexible models. Refer to the Plotting section for more information on the available plotting methods.

[18]:

threepl.plot.training_history().show()

Estimation of latent traits

We can estimate the latent trait scores \(\theta\) of the test takers using the latent_scores() method. If we set the standard_errors parameter to True, we can also obtain the standard errors of the latent trait scores. By default, the method uses maximum likelihood (ML) to estimate the latent trait scores, but other methods can be used as well by setting the theta_estimation argument.

[19]:

theta_train_threepl, theta_se_train_threepl = threepl.latent_scores(train_data_binary, standard_errors=True)
theta_test_threepl, theta_se_test_threepl = threepl.latent_scores(test_data_binary, standard_errors=True)

INFO: Optimizing theta scores...

Iteration 9: Current Loss = 176718.828125

INFO: Converged at iteration 9.
INFO: Computing Jacobian for all items and item categories...
INFO: Optimizing theta scores...

Iteration 7: Current Loss = 44074.78125

INFO: Converged at iteration 7.
INFO: Computing Jacobian for all items and item categories...

We can plot the latent variable distribution get an overview of the test takers’ abilities. Since MML makes the implicit assumption of normality, the overall distribution is close to normal. There is a large amount of test takers mostly guessing, and they appear to be spread out in the lower end of the distribution.

[20]:

threepl.plot.latent_score_distribution(theta_train_threepl).show()

If we are interested in the \(\theta\) score of a specific test taker, we of course print the estimated \(\theta\) score along with the standard error. We can also plot the likelihood function for a specific response pattern using the log_likelihood() method. In this case, we plot the likelihood function for the first test taker in the training data (note that python uses zero indexing so 0 is the first element, 1 is the second, etc.). This test taker did relatively well on the test, but there is a lot of uncertainty in the estimate.

In addition, we can use the expected_sum_score argument to plot the likelihood function against the expected sum score instead of \(\theta\) (the expected sum score is the expected number of correct responses given the estimated \(\theta\)). From this plot, we see that the expected is close to the observed one which we print using the expected_scores() method on the last line below.

One should note that a lack of overlap between expected and observed scores does not mean that the model performs poorly. As opposed to the observed sum score, the estimated \(\theta\) scores (and the expected sum scores) also consider the characteristics of the items; thus, it will tend to be a more accurate estimate of the ability. For example, getting an less discriminating item with a more flat item response function correctly tends to have relatively low impact on the estimated \(\theta\) score. More IRT flexible models tend to have less overlap between expected and observed scores.

[21]:

test_taker = 0
print(f"Test taker {test_taker} estimate: {round(theta_train_threepl[test_taker].item(), 3)}. Standard error: {round(theta_se_train_threepl[test_taker].item(), 3)}")
threepl.plot.log_likelihood(data=train_data_binary[test_taker]).show()
threepl.plot.log_likelihood(data=train_data_binary[test_taker], expected_sum_score=True).show()
print(f"Test taker {test_taker} expected sum score: {round(threepl.expected_scores(theta_train_threepl[test_taker:test_taker+1], return_item_scores=False).item(), 3)}")

Test taker 0 estimate: 1.67. Standard error: 0.277

Test taker 0 expected sum score: 71.821

To visualize item response functions, we use the item_probabilities() method. Here we plot the curves for item 56.

[22]:

threepl.plot.item_probabilities(item = 56).show()

For parametric IRT models such as the 3PL, we can get the estimated item parameters using the item_parameters() method. The irt_format=True argument returns the parameters in the traditional IRT format.

[23]:

threepl.item_parameters(irt_format=True)

[23]:

	a1	b	c
0	1.966666	0.828008	0.251554
1	0.916558	-0.139290	0.022015
2	1.679640	-0.469832	0.141211
3	0.872167	1.546278	0.208094
4	2.076665	0.829094	0.293600
...	...	...	...
75	1.380502	1.361326	0.151783
76	2.575020	0.050730	0.216261
77	1.108828	0.514778	0.078823
78	1.827466	0.557874	0.120996
79	0.897126	0.155884	0.217186

80 rows × 3 columns

Evaluate the model fit

There are various ways to evaluate which model fits the data best and many are available within the evaluate model property (see the Model evaluate section). For example, we can compute the log-likelihood on our test data or various predictive metrics such as accuracy, precision, recall and F1 score. These are illustrated below. Refer to the Model evaluate section for more details on a specific metric.

[24]:

print(threepl.evaluate.log_likelihood(data=test_data_binary, theta=theta_test_threepl))
print(threepl.evaluate.accuracy(data=test_data_binary, theta=theta_test_threepl))
print(threepl.evaluate.predictions(data=test_data_binary, theta=theta_test_threepl).mean(axis = 0))

tensor(-44074.7773)
tensor(0.7104)
precision      0.690388
recall         0.664427
f1             0.667016
w_precision    0.704528
w_recall       0.710362
w_f1           0.699610
dtype: float32

We can also add grouped residuals to our item response function figures to visualize model fit by supplying plot_group_fit=True. For these figures, the test takers are divided into groups based on their latent scores. Within each group, the proportion of test takers responding with each response option is plotted and compared to the proportion expected by the model. Thus, a good overlap of the data and model dots in the figures indicate a good fit.

To speed up computation, since we already estimated the \(\theta\) scores, we can provide them together with the data to the method call. Otherwise, calling threepl.plot.item_probabilities(item = 56, plot_group_fit=True).show() does the same computation internally.

[25]:

threepl.plot.item_probabilities(item = 56, plot_group_fit=True, group_fit_data=train_data_binary, group_fit_population_theta=theta_train_threepl).show()

Modeling incorrect responses

We proceed by creating two nested logit models to model the incorrect response options (distractors). The first model uses a nominal response model for the incorrect responses, and the second model uses more flexible B-splines. Explicitly modeling the distractors can enhance the precision of latent trait estimation by considering how likely someone is to pick each specific distractor based on their ability level. These models take a model for the correct responses, such as the 3PL model we created earlier, as an input. We can choose whether to use a pre-fitted correct response model or simply a fresh model instance. Since the pre-fitted 3PL should provide good starting parameters, we use that one.

In this case, we create copies of the 3PL model, otherwise the models would update each other’s parameters which we do not want. If one simply fits one model, copying is not needed. We fit these models in the same procedure as with the 3PL model.

[26]:

threepl_nominal = copy.deepcopy(threepl)
threepl_bspline = copy.deepcopy(threepl)
nested_nominal = NestedLogit(mc_correct = correct_responses, correct_response_model = threepl_nominal, incorrect_response_model = "nominal", data = train_data)
nested_bspline = NestedLogit(mc_correct = correct_responses, correct_response_model = threepl_bspline, incorrect_response_model = "bspline", data = train_data)
nested_nominal.fit(train_data=train_data, algorithm=MML())
nested_bspline.fit(train_data=train_data, algorithm=MML())
nested_nominal.save_model("swesat_nested_nominal.pt")
nested_bspline.save_model("swesat_nested_bspline.pt")

Epoch: 89. Loss: 308682.6250

INFO: Stopping training after 2 learning rate updates.
INFO: Best model found at iteration 89 with loss 308682.6250.

Epoch: 100. Loss: 307686.1875

INFO: Stopping training after 2 learning rate updates.
INFO: Best model found at iteration 100 with loss 307686.1875.

We proceed by estimating the \(\theta\) scores using the nested logit models.

[27]:

theta_train_nominal, theta_se_train_nominal = nested_nominal.latent_scores(train_data, standard_errors=True)
theta_train_bspline, theta_se_train_bspline = nested_bspline.latent_scores(train_data, standard_errors=True)
theta_test_nominal, theta_se_test_nominal = nested_nominal.latent_scores(test_data, standard_errors=True)
theta_test_bspline, theta_se_test_bspline = nested_bspline.latent_scores(test_data, standard_errors=True)

INFO: Optimizing theta scores...

Iteration 5: Current Loss = 303979.4375

INFO: Converged at iteration 5.
INFO: Computing Jacobian for all items and item categories...
INFO: Optimizing theta scores...

Iteration 5: Current Loss = 302579.46875

INFO: Converged at iteration 5.
INFO: Computing Jacobian for all items and item categories...
INFO: Optimizing theta scores...

Iteration 3: Current Loss = 75308.203125

INFO: Converged at iteration 3.
INFO: Computing Jacobian for all items and item categories...
INFO: Optimizing theta scores...

Iteration 5: Current Loss = 75252.1328125

INFO: Converged at iteration 5.
INFO: Computing Jacobian for all items and item categories...

If we can compare the \(\theta\) standard errors by computing the mean difference between the nested logit standard errors and the 3PL standard errors from before, we see that on average, there is less uncertainty in the \(\theta\) estimates from the nested logit models. The extra information from the incorrect responses appears to provide slightly more reliable \(\theta\) estimates in general. However, this is not the case for every test taker, as some may have unusual incorrect response patterns given their estimated \(\theta\), leading to more uncertainty. Modeling the distractors with splines appears to lower the standard errors the most in this case.

We excluded test takers with very large standard errors as a result from mostly guessing using [~(theta_se_train_threepl>2)].

[28]:

print((theta_se_train_threepl-theta_se_train_nominal)[~(theta_se_train_threepl>2)].mean())
print((theta_se_train_threepl-theta_se_train_bspline)[~(theta_se_train_threepl>2)].mean())
print((theta_se_train_nominal-theta_se_train_bspline)[~(theta_se_train_threepl>2)].mean())

tensor(0.0393)
tensor(0.0696)
tensor(0.0303)

If we evaluate model fit of our nested logit models on our test data, most metrics are close to the same. We see that the spline model has a slightly better fit than the nominal model in terms of log-likelihood, and the nominal nested logit model actually does better in many predictive metrics. Maybe the added flexibility of the splines do not add too much for this dataset.

[29]:

print(f"nominal log-likelihood: {round(nested_nominal.evaluate.log_likelihood(data=test_data, theta=theta_test_nominal).item())}")
print(f"bspline log-likelihood: {round(nested_bspline.evaluate.log_likelihood(data=test_data, theta=theta_test_bspline).item())}")
print(f"nominal accuracy: {round(nested_nominal.evaluate.accuracy(data=test_data, theta=theta_test_nominal).item(), 3)}")
print(f"bspline accuracy: {round(nested_bspline.evaluate.accuracy(data=test_data, theta=theta_test_bspline).item(), 3)}")
print(f"nominal metrics:\n{round(nested_nominal.evaluate.predictions(data=test_data, theta=theta_test_nominal).mean(axis=0), 3)}\n--------------------")
print(f"bspline metrics:\n{round(nested_bspline.evaluate.predictions(data=test_data, theta=theta_test_bspline).mean(axis=0), 3)}\n--------------------")

nominal log-likelihood: -75308
bspline log-likelihood: -75252
nominal accuracy: 0.607
bspline accuracy: 0.606
nominal metrics:
precision      0.264
recall         0.296
f1             0.260
w_precision    0.469
w_recall       0.607
w_f1           0.515
dtype: float32
--------------------
bspline metrics:
precision      0.267
recall         0.294
f1             0.258
w_precision    0.474
w_recall       0.606
w_f1           0.515
dtype: float32
--------------------

We can plot the item response functions for these models to visualize how the incorrect response probabilities change over the latent variable. Analyzing distractors gives test creators a way to check how well the incorrect options are working. For example, it can highlight distractors that are hardly ever chosen or those that might be confusingly attractive to high-ability test-takers, indicating potential issues.

For example, option 2 appears to be the most popular distractor for item 34, and is the most common response for most of the \(\theta\) scale. In fact, it is chosen quite often even for test takers performing above average. Maybe this distractor is even too confusing? It could be worth looking at.

[30]:

nested_nominal.plot.item_probabilities(item = 34, plot_group_fit=True, group_fit_population_theta=theta_train_nominal).show()
nested_bspline.plot.item_probabilities(item = 34, plot_group_fit=True, group_fit_population_theta=theta_train_bspline).show()

Infit and outfit statistics can be computed using the infit_outfit() method for both items and respndents. These statistics are traditional fit statistics in IRT and are typically computed on the training data. Here we compute both statistics for each item with the level="item" argument.

[31]:

infit_item_nr, outfit_item_nr = nested_nominal.evaluate.infit_outfit(data=train_data, theta=theta_train_nominal, level="item")
infit_item_mmcnn, outfit_item_mmcnn = nested_bspline.evaluate.infit_outfit(data=train_data, theta=theta_train_bspline, level="item")

Infit and outfit are based on the residuals, and a value of 1 indicates a perfect fit. Values significantly greater than 1 indicate misfit, with higher values suggesting responses were less predictable than expected (possibly due to guessing, cheating, or misunderstanding). Values significantly less than 1 suggest overfit, where responses are too predictable, indicating possible redundancy among items or that the items are not contributing additional useful information about the construct being measured. Below we see that are infit and outfit statistics stay close to one for both models, indicating good fit.

[32]:

print(f"NR infit: {infit_item_nr}")
print(f"MMCNN infit: {infit_item_mmcnn}")
print(f"NR outfit: {outfit_item_nr}")
print(f"MMCNN outfit: {outfit_item_mmcnn}")

NR infit: tensor([0.9971, 1.0008, 0.9916, 1.0005, 0.9979, 0.9864, 0.9910, 1.0004, 0.9816,
        1.0039, 0.9988, 0.9949, 0.9930, 1.0014, 0.9899, 1.0021, 0.9934, 1.0007,
        0.9990, 0.9981, 0.9998, 0.9947, 0.9869, 1.0023, 0.9884, 0.9998, 0.9955,
        0.9915, 0.9995, 0.9990, 1.0020, 0.9923, 0.9841, 0.9987, 0.9964, 0.9993,
        0.9921, 0.9824, 1.0012, 1.0006, 1.0018, 0.9996, 0.9984, 0.9923, 0.9928,
        0.9990, 0.9815, 0.9993, 0.9899, 0.9949, 0.9952, 0.9559, 1.0025, 1.0020,
        1.0018, 0.9970, 1.0007, 0.9969, 0.9993, 0.9999, 0.9949, 1.0015, 0.9924,
        1.0023, 0.9944, 1.0004, 1.0013, 1.0016, 0.9948, 1.0014, 0.9828, 1.0019,
        0.9999, 0.9861, 1.0027, 0.9979, 0.9880, 0.9987, 0.9903, 1.0023])
MMCNN infit: tensor([0.9986, 1.0011, 0.9911, 1.0012, 0.9992, 0.9956, 0.9963, 0.9992, 0.9932,
        1.0045, 0.9989, 0.9951, 0.9877, 1.0020, 0.9909, 1.0028, 0.9954, 1.0020,
        0.9958, 0.9994, 1.0007, 0.9952, 1.0007, 1.0027, 0.9879, 1.0033, 0.9977,
        0.9873, 0.9998, 0.9982, 1.0035, 0.9981, 0.9860, 0.9992, 0.9956, 0.9998,
        0.9936, 0.9847, 1.0007, 1.0013, 1.0020, 1.0016, 0.9969, 0.9910, 0.9949,
        0.9985, 0.9901, 1.0002, 0.9915, 0.9955, 0.9970, 0.9542, 1.0028, 1.0043,
        1.0029, 0.9954, 1.0011, 0.9976, 0.9975, 1.0011, 0.9932, 1.0003, 0.9902,
        1.0025, 0.9981, 1.0027, 1.0022, 1.0020, 0.9984, 1.0014, 0.9793, 1.0030,
        0.9997, 0.9877, 1.0034, 0.9982, 0.9876, 0.9966, 0.9904, 1.0015])
NR outfit: tensor([0.9888, 0.9997, 1.0142, 0.9990, 1.0185, 0.9856, 0.9624, 1.0040, 0.9785,
        0.9578, 1.0002, 1.0296, 1.0316, 0.9994, 1.0096, 0.9926, 1.0033, 1.0034,
        1.0162, 0.9930, 0.9909, 0.9892, 0.9862, 0.9646, 1.0274, 0.9769, 0.9859,
        1.0154, 0.9966, 0.9957, 0.9981, 0.9919, 0.9444, 0.9997, 0.9942, 0.9587,
        0.9678, 0.9777, 1.0013, 1.0066, 1.0087, 0.9781, 1.0153, 0.9227, 0.9860,
        0.9944, 1.0564, 0.9903, 0.9757, 0.9668, 1.0089, 1.1339, 0.9867, 0.9925,
        0.9808, 0.9978, 0.9991, 0.9870, 0.9770, 1.0553, 1.0218, 0.9961, 1.0832,
        0.9977, 0.9203, 0.9943, 1.0000, 1.0038, 1.0041, 1.0018, 1.0101, 0.9855,
        0.9740, 0.9928, 0.9960, 0.9967, 0.9667, 1.0206, 1.0250, 1.0026])
MMCNN outfit: tensor([1.0006, 0.9959, 1.0240, 0.9992, 1.0195, 1.0174, 0.9767, 0.9998, 0.9760,
        0.9651, 1.0018, 1.0018, 1.0860, 1.0006, 1.0340, 0.9951, 1.0221, 1.0040,
        1.0296, 1.0021, 0.9983, 1.0108, 1.0511, 0.9768, 1.0603, 0.9851, 0.9875,
        1.0505, 0.9992, 0.9994, 1.0014, 1.0014, 0.9457, 1.0018, 1.0345, 0.9735,
        0.9703, 0.9843, 1.0010, 1.0107, 1.0087, 1.0044, 1.0201, 0.9533, 1.0008,
        0.9994, 1.0609, 0.9929, 0.9800, 0.9877, 1.0117, 1.2565, 0.9948, 0.9972,
        0.9864, 1.0114, 0.9993, 1.0010, 0.9835, 1.0723, 1.0423, 1.0028, 1.0911,
        0.9994, 0.9792, 1.0123, 1.0026, 1.0080, 1.0099, 1.0011, 1.0190, 0.9883,
        0.9861, 1.0633, 0.9972, 0.9986, 1.0165, 1.0155, 1.0350, 1.0028])

Information

We can plot the Fisher information using the information() method for both individual items and the test as a whole. Below, all models suggest that the test provides the most information a bit around \(\theta=0\) (with the exception of the B-spline nested logit model at the lower end, but this is a result of erratic spline behavior where there is close to no data-points).

[33]:

threepl.plot.information().show()
nested_nominal.plot.information().show()
nested_bspline.plot.information().show()

INFO: Computing Jacobian for all items and item categories...

INFO: Computing Jacobian for all items and item categories...

INFO: Computing Jacobian for all items and item categories...