National mathematics example

Here we provide a usage example of the IRTorch package using data from the national mathematics test for Swedish high school students. We will not do an in-depth analysis of the data, but rather showcase some of the main features of IRTorch to enable the user to analyze their own data.

Users are referred to the API reference section of the documentation for more detailed information of different functions and classes. Note that the results might vary slightly from those presented here due to differences in the computing hardware (CPU or GPU), the operating system, and the versions of the packages used.

If you prefer to run the code directly, you can download this example Jupyter notebook from the GitHub repository.

Load the libraries

We start by loading the libraries and modules that we will use for this example.

[1]:

import pandas as pd
import numpy as np
import torch
import irtorch
from irtorch.models import GeneralizedPartialCredit, MonotoneNN
from irtorch.rescale import Bit
from irtorch.estimation_algorithms import MML, JML

[2]:

import plotly
plotly.offline.init_notebook_mode()

The test data

The package requires the data to be in a PyTorch tensor with test takers as rows and items as columns. The item responses should be coded 0, 1, … and missing responses coded as nan. If you have, for example, a pandas dataframe, you can convert it to a tensor using torch.tensor(df.to_numpy()). This is illustrated with some random data below.

[3]:

np.random.seed(42)
example_df = pd.DataFrame({f"Item {i}": np.random.randint(0, 5, 10) for i in range(1, 6)})
missing_indices = np.random.choice(example_df.index, size=10, replace=False)
for col in example_df.columns:
    example_df.loc[np.random.choice(missing_indices, size=2, replace=False), col] = np.nan

example_tensor = torch.tensor(example_df.to_numpy())
print(example_tensor)

tensor([[3., nan, 1., 2., 3.],
        [nan, nan, nan, 3., 0.],
        [2., 4., 3., 3., 3.],
        [4., 1., 0., 0., nan],
        [nan, 3., 0., 2., 1.],
        [1., 1., nan, nan, 0.],
        [2., 3., 2., 2., 1.],
        [2., 4., 1., 4., 4.],
        [2., 0., 3., nan, nan],
        [4., 3., 3., 1., 3.]], dtype=torch.float64)

In this example, we will use data from the Swedish national mathematics test. Two datasets from this test are provided within the package and we will use the second one. Feel free to explore the first one one you own, using irtorch.load_dataset.swedish_national_mathematics_1().

[4]:

data_math = irtorch.load_dataset.swedish_national_mathematics_2()

Construct the model

We start by creating a Generalized Partial Credit (GPC) model. It requires the number of possible responses/scores for each item but these can be automatically computed from the data when supplied.

[5]:

irtorch.set_seed(42)
gpc = GeneralizedPartialCredit(data=data_math)

In the same fashion, we initialize a more flexible Monotone Neural Network (MNN) model for comparison. For more details on these IRT models and other available models, see the IRT Models section.

[6]:

irtorch.set_seed(42)
mnn = MonotoneNN(data = data_math)

Fitting the model

Each model can now be fit using the fit() instance method. fit() requires the model training data as a tensor, together with an instance of the chosen fitting algorithm. In these examples, we will use the MML algorithm for the GPC model and the JML algorithm for the MNN model.

We recommend splitting your test data into a training and a test set to ensure we are not overfitting the model as below. Fitting will be done using the GPU if available, otherwise the CPU will be used. This can also be specified using the device argument, i.e., gpc.fit(train_data=data_math, algorithm=AE(), device = "cpu"). Currently available fitting algorithms can be found in the Estimation algorithms section with additional information on how to fine tune each algorithm.

[7]:

irtorch.set_seed(42)
train_data, test_data = irtorch.split_data(data_math, 0.8)
gpc.fit(train_data=train_data, algorithm=MML())
mnn.fit(train_data=train_data, algorithm=JML())

Epoch: 193. Loss: 20799.9922

INFO: Stopping training after 2 learning rate updates.
INFO: Best model found at iteration 193 with loss 20799.9922.

Epoch: 2017. Loss: 18708.7871

INFO: Decreased learning rate to: 0.05

Epoch: 2523. Loss: 18691.8711

INFO: Decreased learning rate to: 0.025

Epoch: 2607. Loss: 18690.8984

INFO: Best model found at iteration 2607 with loss 18690.8984.

The plot property of our models contains methods for visualization. We can plot the training history to check if we appear to have converged to a good solution. We see that the loss function appear to have converged for both models. In our experience, there is usually slight improvements from training for more iterations even when the loss appears relatively flat. This is especially true for more flexible models. Refer to the Plotting section for more information on the available plotting methods.

[8]:

gpc.plot.training_history().show()
mnn.plot.training_history().show()

Estimation of latent traits

We can estimate the latent trait scores of the test takers using the latent_scores() method. Here we use the ML method.

[9]:

theta_train_gpc = gpc.latent_scores(train_data, theta_estimation="ML")
theta_train_mnn = mnn.latent_scores(train_data, theta_estimation="ML")
theta_test_gpc = gpc.latent_scores(test_data, theta_estimation="ML")
theta_test_mnn = mnn.latent_scores(test_data, theta_estimation="ML")

INFO: Optimizing theta scores...

Iteration 4: Current Loss = 19599.94921875

INFO: Converged at iteration 4.
INFO: Finding training data theta scores to use as initial theta estimates...
INFO: Sampling training data theta scores to avoid running out of memory...
INFO: Optimizing theta scores...

Iteration 5: Current Loss = 18665.93359375

INFO: Converged at iteration 5.
INFO: Optimizing theta scores...

Iteration 6: Current Loss = 4951.67431640625

INFO: Converged at iteration 6.
INFO: Finding training data theta scores to use as initial theta estimates...
INFO: Optimizing theta scores...

Iteration 10: Current Loss = 4928.32080078125

INFO: Converged at iteration 10.

We can plot the latent score distributions to get an overview of the test taker abilities. Since JML trained models make no assumptions about the distributions of the latent scores, these distribution will not be standard normal, and we can see we also have some extreme outliers.

[10]:

gpc.plot.latent_score_distribution(theta_train_gpc).show()
mnn.plot.latent_score_distribution(theta_train_mnn).show()

The latent scales are arbitrary and can thus be transformed to any other scale as long as the transformation maintains the ordering of the scores. This can be done by providing a scale transformation object to the add_scale_transformation model method. For these models, we suggest converting the latent trait scales to bit scales instead. Bit scales are ratio scales, and make scores from different models comparable as long as they are fitted to the same test data. We create Bit class instances and supply them to the add_scale_transformation method for each model.

The models will now default to using the bit scale for all subsequent operations. We use this below to compute the bit scores with the latent_scores method. Note that we can still compute the latent scores on the original scale by setting the rescale argument to False to methods where this is applicable, i.e., gpc.latent_scores(train_data, theta_estimation="ML", rescale=False). Refer to the Rescaling section for more information on bit scales and other available scale transformations.

[11]:

gpc_bit_transformation = Bit(gpc, population_theta=theta_train_gpc)
mnn_bit_transformation = Bit(mnn, population_theta=theta_train_mnn)
gpc.add_scale_transformation(gpc_bit_transformation)
mnn.add_scale_transformation(mnn_bit_transformation)
bit_scores_train_gpc = gpc.latent_scores(train_data, theta_estimation="ML")
bit_scores_train_mnn = mnn.latent_scores(train_data, theta_estimation="ML")

INFO: Optimizing theta scores...

Iteration 4: Current Loss = 19599.94921875

INFO: Converged at iteration 4.
INFO: Finding training data theta scores to use as initial theta estimates...
INFO: Sampling training data theta scores to avoid running out of memory...
INFO: Optimizing theta scores...

Iteration 6: Current Loss = 18666.03515625

INFO: Converged at iteration 6.

If we plot the bit score distribution for each model, we can see that they are more similar compared to the distributions of the original latent variables, both scales ranging from 0 to around 80 bits.

[12]:

gpc.plot.latent_score_distribution(bit_scores_train_gpc).show()
mnn.plot.latent_score_distribution(bit_scores_train_mnn).show()

Item response functions

To show the item response functions, we can use the item_probabilities() method in the plot model property. Here we plot the curves for item 24 on the bit scale. We can see that the probability of answering the item correctly increases with the latent variable. We see that only test takers with a relatively high bit score have a high probability of responding with a score of 2 on this item.

[13]:

gpc.plot.item_probabilities(item = 24).show()
mnn.plot.item_probabilities(item = 24).show()

Item parameters

For parametric IRT models such as the GPC model, we can get the estimated item parameters using the item_parameters() method. We use the irt_format argument to specify that we want to return the item parameters in traditional IRT format (and not as slopes and intercepts). Note that zeros are expected for items with fewer than 4 points.

[14]:

gpc.item_parameters(irt_format=True)

[14]:

	a1	b1	b2	b3	b4
0	2.529953	-2.310085	-0.000000	-0.000000	-0.000000
1	1.188545	-1.011769	-0.000000	-0.000000	-0.000000
2	1.860119	-1.501075	-0.000000	-0.000000	-0.000000
3	1.052285	-0.625713	-0.000000	-0.000000	-0.000000
4	1.358516	-0.085676	-0.000000	-0.000000	-0.000000
5	0.729730	-0.871447	-0.000000	-0.000000	-0.000000
6	1.213311	0.198134	-0.000000	-0.000000	-0.000000
7	1.190133	-2.031975	-0.000000	-0.000000	-0.000000
8	0.979525	-0.088714	-0.000000	-0.000000	-0.000000
9	1.132489	0.237387	-0.309951	-0.000000	-0.000000
10	1.306803	-0.260316	-0.684906	-0.000000	-0.000000
11	1.598211	-1.347207	-0.697698	0.439783	-0.000000
12	1.484667	-1.209232	-1.526546	0.777015	-0.000000
13	1.117311	0.654208	1.539641	-0.000000	-0.000000
14	1.553362	0.235789	2.699686	-0.000000	-0.000000
15	0.807890	-0.351218	-0.873549	-0.795557	-1.548087
16	1.265904	0.852304	0.748650	-0.000000	-0.000000
17	1.582821	-0.245806	-0.336996	-0.000000	-0.000000
18	1.607533	0.016159	0.909183	2.093414	-0.000000
19	1.376165	2.389322	3.884833	4.636078	4.936493
20	1.596748	-1.485482	-2.106120	-0.000000	-0.000000
21	1.107210	-0.354539	-1.929718	-0.000000	-0.000000
22	1.057379	-0.732737	-0.312110	-0.000000	-0.000000
23	1.893040	0.048051	1.340048	-0.000000	-0.000000
24	0.665695	0.618374	1.081607	1.269795	0.737873
25	2.557212	1.350395	2.525016	-0.000000	-0.000000
26	1.396081	1.874109	2.792426	-0.000000	-0.000000
27	1.275226	2.305732	4.095278	5.066998	6.079602

Evaluate the model fit

There are various ways to evaluate which model fits the data best and many are available from within the evaluate model property. For example, we can compare the log-likelihood of each model on our test data. A larger log-likelihood indicates a better fit. We see that in this case, the MNN model fits the data better than the GPC model. Please refer to the Model evaluation section for more information on evaluation methods.

[15]:

print(gpc.evaluate.log_likelihood(data=test_data, theta=theta_test_gpc))
print(mnn.evaluate.log_likelihood(data=test_data, theta=theta_test_mnn))

tensor(-4951.6743)
tensor(-4928.3208)

An assumption of IRT models, commonly referred to as local independence, is that the item responses are conditionally independent given the latent variable. We can check this assumption by computing Q3 statistics for each pair of items using the q3() method. Q3 is the correlation between the residuals of two items, and a value close to zero indicates that the item responses are conditionally independent given the latent variable. The function will print the item pair with the highest Q3 value. In this case the Q3 value between item 21 and item 22 is quite large for both models, suggesting that the local independence assumption is violated for these items.

A tuple is also returned in which the first element is a square matrix with Q3 statistics. As an example, we print all item pairs with the first 7 items. Their Q3 statistics are all close to zero, indicating that the assumption of conditional independence is reasonable for these items.

[16]:

q3_gpc, _ = gpc.evaluate.q3(data=data_math)
q3_mnn, _ = mnn.evaluate.q3(data=data_math)
print(q3_gpc.iloc[:7, :7])
print(q3_mnn.iloc[:7, :7])

INFO: Optimizing theta scores...

Iteration 4: Current Loss = 24551.634765625

INFO: Converged at iteration 4.
INFO: Largest Q3 is 0.68 between Item 21 and Item 22.
INFO: Finding training data theta scores to use as initial theta estimates...
INFO: Sampling training data theta scores to avoid running out of memory...
INFO: Optimizing theta scores...

Iteration 5: Current Loss = 23594.39453125

INFO: Converged at iteration 5.
INFO: Largest Q3 is 0.60 between Item 21 and Item 22.

        Item 1  Item 2  Item 3  Item 4  Item 5  Item 6  Item 7
Item 1     NaN  -0.029  -0.030  -0.048  -0.037  -0.003  -0.007
Item 2  -0.029     NaN   0.040   0.012  -0.024   0.005  -0.064
Item 3  -0.030   0.040     NaN  -0.042  -0.003  -0.044  -0.001
Item 4  -0.048   0.012  -0.042     NaN  -0.027  -0.015  -0.007
Item 5  -0.037  -0.024  -0.003  -0.027     NaN   0.001   0.029
Item 6  -0.003   0.005  -0.044  -0.015   0.001     NaN  -0.019
Item 7  -0.007  -0.064  -0.001  -0.007   0.029  -0.019     NaN
        Item 1  Item 2  Item 3  Item 4  Item 5  Item 6  Item 7
Item 1     NaN   0.014  -0.048  -0.036  -0.038   0.023  -0.003
Item 2   0.014     NaN   0.060   0.025  -0.000   0.014  -0.053
Item 3  -0.048   0.060     NaN  -0.015   0.023  -0.023   0.015
Item 4  -0.036   0.025  -0.015     NaN  -0.000  -0.006   0.006
Item 5  -0.038  -0.000   0.023  -0.000     NaN   0.004   0.027
Item 6   0.023   0.014  -0.023  -0.006   0.004     NaN  -0.016
Item 7  -0.003  -0.053   0.015   0.006   0.027  -0.016     NaN

If we want to also perform a statistical test, we can set the sample_hypothesis_test argument to True. This will sample data from the model under the null hypothesis of conditional independence and compute the Q3 statistics for these samples. The p-values are then computed based on the distribution of these statistics. The result is a square matrix with p-values and a small p-value indicates that the assumption of conditional independence is violated. By looking at the first 7 items, we see that on a 5% significance level, the assumption is violated for item pair 2 and 7 for the GPC model. One should note that even under the null hypothesis, one would expect some p-values to be small by chance. Thus one should interpret these results with caution.

The sampling procedure can be computationally expensive for large datasets, and the number of samples can be controlled using the samples argument.

[17]:

_, p_values_gpc = gpc.evaluate.q3(data=data_math, sample_hypothesis_test=True)
_, p_values_mnn = mnn.evaluate.q3(data=data_math, sample_hypothesis_test=True)
print(f"\ngpc: \n{p_values_gpc.iloc[:7, :7]}\n")
print(f"mnn: \n{p_values_mnn.iloc[:7, :7]}")

INFO: Optimizing theta scores...

Iteration 4: Current Loss = 24551.634765625

INFO: Converged at iteration 4.
INFO: Largest Q3 is 0.68 between Item 21 and Item 22.

Computing Q3 p-values by null distribution sampling. Sample: 1000

INFO: Finding training data theta scores to use as initial theta estimates...
INFO: Sampling training data theta scores to avoid running out of memory...
INFO: Optimizing theta scores...

Iteration 4: Current Loss = 23594.384765625

INFO: Converged at iteration 4.
INFO: Largest Q3 is 0.60 between Item 21 and Item 22.

Computing Q3 p-values by null distribution sampling. Sample: 1000
gpc:
        Item 1  Item 2  Item 3  Item 4  Item 5  Item 6  Item 7
Item 1     NaN   0.314   0.484   0.074   0.086   0.900   0.700
Item 2   0.314     NaN   0.238   0.674   0.360   0.878   0.024
Item 3   0.484   0.238     NaN   0.120   0.924   0.138   0.978
Item 4   0.074   0.674   0.120     NaN   0.338   0.574   0.752
Item 5   0.086   0.360   0.924   0.338     NaN   0.998   0.278
Item 6   0.900   0.878   0.138   0.574   0.998     NaN   0.438
Item 7   0.700   0.024   0.978   0.752   0.278   0.438     NaN

mnn:
        Item 1  Item 2  Item 3  Item 4  Item 5  Item 6  Item 7
Item 1     NaN   0.638   0.222   0.180   0.194   0.438   0.808
Item 2   0.638     NaN   0.040   0.358   0.994   0.642   0.052
Item 3   0.222   0.040     NaN   0.578   0.400   0.444   0.552
Item 4   0.180   0.358   0.578     NaN   0.942   0.842   0.820
Item 5   0.194   0.994   0.400   0.942     NaN   0.952   0.274
Item 6   0.438   0.642   0.444   0.842   0.952     NaN   0.612
Item 7   0.808   0.052   0.552   0.820   0.274   0.612     NaN

An alternative method to check the assumption of conditional independence is to use mutual information difference. We leave this for the user to explore on their own.

We can evaluate the predictive accuracy of each model on our test data using the accuracy() method. We can also use the predictions() method to compute precision, recall, and F1-score for each item. Here we average the results over all items. We see that the MNN model does better with all metrics, the largest difference being in the precision.

[18]:

print(f"GPC accuracy: {gpc.evaluate.accuracy(data=test_data, theta=theta_test_gpc)}")
print(f"MNN accuracy: {mnn.evaluate.accuracy(data=test_data, theta=theta_test_mnn)}")
print(gpc.evaluate.predictions(data=test_data, theta=theta_test_gpc).mean(axis=0))
print(mnn.evaluate.predictions(data=test_data, theta=theta_test_mnn).mean(axis=0))

GPC accuracy: 0.7219115495681763
MNN accuracy: 0.7295373678207397
precision      0.555751
recall         0.550781
f1             0.537068
w_precision    0.671989
w_recall       0.721912
w_f1           0.685628
dtype: float32
precision      0.611714
recall         0.556414
f1             0.547949
w_precision    0.699868
w_recall       0.729537
w_f1           0.692096
dtype: float32

Furthermore, we can add grouped residuals to our item response function figures to visualize model fit. For these figures, the test takers are divided into groups based on their latent variable scores. Within each group, the proportion of test takers responding with each score is plotted and compared to the proportion expected by the model. Thus, a good overlap of the data and model dots in the figures indicate a good fit. Both models appear to fit the data fairly well for this item.

[19]:

gpc.plot.item_probabilities(item = 24, plot_group_fit=True, group_fit_population_theta=theta_train_gpc).show()
mnn.plot.item_probabilities(item = 24, plot_group_fit=True, group_fit_population_theta=theta_train_mnn).show()