Model evaluation

class irtorch.Evaluator(model: BaseIRTModel)

Bases: object

Class for evaluating IRT model performance using various metrics. A fitted model typically holds an instance of this class in its evaluate property. Thus the methods can be accessed through model.evaluate.method_name().

Parameters: model (BaseIRTModel) – The IRT model to evaluate.

accuracy(data: Tensor = None, theta: Tensor = None, level: str = 'all', **kwargs)

Calculate the prediction accuracy of the model for the supplied data. The response with the highest probability is considered the prediction.

Parameters

data (torch.Tensor) – The input data.
theta (torch.Tensor, optional) – The latent variable theta scores for the provided data on the original theta scale. If not provided, they will be computed using irtorch.models.BaseIRTModel.latent_scores().
level (str = "all", optional) – Specifies the level at which the accuracy is calculated. Can be ‘all’, ‘item’ or ‘respondent’. For example, for ‘item’ the accuracy is calculated for each item. (default is ‘all’)
**kwargs (dict, optional) – Additional keyword arguments used for theta estimation. Refer to irtorch.models.BaseIRTModel.latent_scores() for additional details.

Returns

The accuracy.

Return type

torch.Tensor

approximate_latent_density(theta_scores: Tensor, approximation: str = 'qmvn', cv_n_components: list[int] = None) → None

Approximate the latent space density.

Parameters

theta_scores (torch.Tensor) – A 2D tensor with theta scores. Each row represents one respondent, and each column an item.
approximation (str, optional) – The approximation method to use. (default is ‘qmvn’) - ‘qmvn’ for quantile multivariate normal approximation of a multivariate joint density function (QuantileMVNormal class). - ‘gmm’ for a gaussian mixture model.
cv_n_components (list[int], optional) – The number of components to use for cross-validation with Gaussian Mixture Models. (default is [2, 3, 4, 5, 10])

Return type

None

group_fit_log_likelihood(data: Tensor = None, theta: Tensor = None, groups: int = 10, latent_variable: int = 1, **kwargs)

Group the respondents based on their ordered latent variable scores. Calculate the average log-likelihood of the data within each group.

If ‘data’ is not supplied, the function defaults to using the model’s training data.

Parameters

data (torch.Tensor, optional) – A 2D tensor containing test data. Each row corresponds to one respondent and each column represents an item. (default is None)
theta (torch.Tensor, optional) – The latent variable theta scores for the provided data on the original theta scale. If not provided, they will be computed using irtorch.models.BaseIRTModel.latent_scores().
groups (int) – The number of groups. (default is 10)
latent_variable (int, optional) – Specifies the latent variable along which ordering and grouping should be performed. (default is 1)
**kwargs (dict, optional) – Additional keyword arguments used for theta estimation. Refer to irtorch.models.BaseIRTModel.latent_scores() for additional details.

Returns

The average log-likelihood for each group.

Return type

torch.Tensor

group_fit_residuals(data: Tensor = None, theta: Tensor = None, latent_variable: int = 1, standardize: bool = True, groups: int = 10, theta_estimation: str = 'ML', rescale: bool = True, **kwargs) → tuple[torch.Tensor, torch.Tensor]

Group the respondents based on their ordered latent variable scores. Calculate the residuals between the model estimated and observed data within each group. See Van der Linden [17], Chapter 20 for more details.

If ‘data’ is not supplied, the function defaults to using the model’s training data.

Parameters

data (torch.Tensor, optional) – A 2D tensor containing test data. Each row corresponds to one respondent and each column represents a latent variable. (default is None)
theta (torch.Tensor, optional) – A 2D tensor containing the pre-estimated theta scores for each respondent in the data. Each row corresponds to one respondent and each column represents a latent variable. (default is None)
latent_variable (int, optional) – Specifies the latent variable along which ordering and grouping should be performed. (default is 1)
standardize (bool, optional) – Specifies whether the residuals should be standardized. (default is True)
groups (int) – The number of groups. (default is 10)
theta_estimation (str, optional) – Method used to obtain the theta scores. Can be ‘NN’, ‘ML’, ‘EAP’ or ‘MAP’ for neural network, maximum likelihood, expected a posteriori or maximum a posteriori respectively. (default is ‘NN’)
rescale (bool, optional) – Whether to compute the latent scores on the theta transformation scale if it exists. (default is True)
**kwargs (dict, optional) – Additional keyword arguments used for scale computation. Refer to documentation for the chosen scale in the Rescaling documentation section for additional details.

Returns

A tuple with torch tensors. The first one holds the residuals for each group and has dimensions (groups, items, item categories). The second one is a 1D tensor and holds the mid points of the groups.

Return type

tuple[torch.Tensor, torch.Tensor]

infit_outfit(data: Tensor = None, theta: Tensor = None, level: str = 'item', **kwargs)

Calculate person or item infit and outfit statistics. These statistics help identifying items that do not behave as expected according to the model or respondents with unusual response patterns. Items that do not behave as expectedly can be reviewed for possible revision or removal to improve the overall test quality and reliability. Respondents with unusual response patterns can be reviewed for possible cheating or other issues.

Parameters

data (torch.Tensor) – The input data.
theta (torch.Tensor, optional) – The latent variable theta scores for the provided data on the original theta scale. If not provided, they will be computed using irtorch.models.BaseIRTModel.latent_scores().
level (str = "item", optional) – Specifies whether to compute item or respondent statistics. Can be ‘item’ or ‘respondent’. (default is ‘item’)
**kwargs (dict, optional) – Additional keyword arguments used for theta estimation. Refer to irtorch.models.BaseIRTModel.latent_scores() for additional details.

Returns

The infit statistics.

Return type

torch.Tensor

Notes

Infit and outift are computed as follows:

\[\begin{split}\begin{aligned} \text{Item j infit} = \frac{\sum_{i=1}^{n} (O_{ij} - E_{ij})^2}{\sum_{i=1}^{n} W_{ij}} \\ \text{Respondent i infit} = \frac{\sum_{j=1}^{J} (O_{ij} - E_{ij})^2}{\sum_{j=1}^{J} W_{ij}} \\ \text{Item j outfit} = \frac{\sum_{i=1}^{n} (O_{ij} - E_{ij})^2/W_{ij}}{n} \\ \text{Respondent i outfit} = \frac{\sum_{j=1}^{J} (O_{ij} - E_{ij})^2/W_{ij}}{J} \end{aligned}\end{split}\]

Where:

\(J\) is the number of items,
\(n\) is the number of respondents,
\(O_{ij}\) is the observed score on the \(j\)-th item from the \(i\)-th respondent.
\(E_{ij}\) is the expected score on the \(j\)-th item from the \(i\)-th respondent, calculated from the IRT model.
\(W_{ij}\) is the weight on the \(j\)-th item from the \(j\)-th respondent. This is the variance of the item score \(W_{ij}=\sum^{M_j}_{m=0}(m-E_{ij})^2P_{ijk}\) where \(M_j\) is the maximum item score and \(P_{ijk}\) is the model probability of a score \(k\) on the \(j\)-th item from the \(i\)-th respondent.

latent_group_probabilities(data: Tensor = None, theta: Tensor = None, latent_variable: int = 1, groups: int = 10, theta_estimation: str = 'ML', rescale: bool = True) → tuple[torch.Tensor, torch.Tensor, torch.Tensor]

Group the respondents based on their ordered latent variable scores. Calculate both the observed and IRT model probabilities for each possible item response, within each group.

If ‘data’ is not supplied, the function defaults to using the model’s training data.

Parameters

data (torch.Tensor, optional) – A 2D tensor containing test data. Each row corresponds to one respondent and each column represents an item. (default is None and uses the model’s training data)
theta (torch.Tensor, optional) – The latent variable theta scores for the provided data. If not provided, they will be computed using theta_estimation. (default is None)
latent_variable (int, optional) – Specifies the latent variable along which ordering and grouping should be performed. (default is 1)
groups (int) – The number of groups. (default is 10)
theta_estimation (str, optional) – Method used to obtain the theta scores. Can be ‘NN’, ‘ML’, ‘EAP’ or ‘MAP’ for neural network, maximum likelihood, expected a posteriori or maximum a posteriori respectively. (default is ‘NN’)
rescale (bool, optional) – Whether to group the latent scores on the theta transformation scale if it exists. Note: for uni-dimensional models, all monotone scale transformations are equivalent in this case. (default is True)

Returns

A 3D torch tensor with data group averages. The first dimension represents the groups, the second dimension represents the items and the third dimension represents the item categories.

A 3D torch tensor with model group averages. The first dimension represents the groups, the second dimension represents the items and the third dimension represents the item categories.

The third tensor contains the average latent variable values within each group along the specified latent_variable.

Return type

tuple[torch.Tensor, torch.Tensor, torch.Tensor]

log_likelihood(data: Tensor = None, theta: Tensor = None, reduction: str = 'sum', level: str = 'all', **kwargs)

Calculate the log-likelihood for the provided data.

If ‘data’ is not supplied, the function defaults to using the model’s training data.

Parameters

data (torch.Tensor, optional) – A 2D tensor containing test data. Each row corresponds to one respondent and each column represents an item. (default is None and uses the model’s training data)
theta (torch.Tensor, optional) – The latent variable theta scores for the provided data on the original theta scale. If not provided, they will be computed using irtorch.models.BaseIRTModel.latent_scores().
reduction (str, optional) – Specifies the reduction method for the log-likelihood. Can be ‘sum’, ‘none’ or ‘mean’. (default is ‘sum’)
level (str, optional) – For reductions other than ‘none’, specifies the level at which the log-likelihood is summed/averaged. Can be ‘all’, ‘item’ or ‘respondent’. For example, for ‘item’ the log-likelihood is summed/averaged for each item over the respondents. (default is ‘all’)
**kwargs (dict, optional) – Additional keyword arguments used for theta estimation. Refer to irtorch.models.BaseIRTModel.latent_scores() for additional details.

Returns

The log-likelihood for the provided data.

Return type

torch.Tensor

marginal_reliability(latent_density_method: str = 'data', population_data: Tensor = None, trapezoidal_segments: int = 1000, sample_size: int = 100000, degrees: list[int] = None, rescale: bool = True) → Tensor

Computes the marginal reliability for the test over the population latent space density (see e.g. Cheng et al. [5]). For ‘qmvn’ and ‘gmm’ densities, the trapezoidal rule is used for integral approximation.

Parameters

latent_density_method (str, optional) –
Specifies the method used to approximate the latent space density. Possible options are:
- ’data’ averages over the theta scores from the population data.
- ’encoder sampling’ samples theta scores from the encoder. Only available for VariationalAutoencoderIRT models.
- ’qmvn’ for quantile multivariate normal approximation of a multivariate joint density function (QuantileMVNormal class).
- ’gmm’ for a gaussian mixture model.
- ’standard normal’ assumes a standard normal distribution for the latent variables with identity covariance matrix.
population_data (torch.Tensor, optional) – The population data used for approximating sum score probabilities. Default is None and uses the training data.
trapezoidal_segments (int, optional) – The number of integration approximation intervals for each theta dimension. (Default is 1000)
sample_size (int, optional) – Sample size for the ‘encoder sampling’ method. (Default is 100000)
degrees (list[int], optional) – For multidimensional models. A list of angles in degrees between 0 and 90, one for each latent variable. Specifies the direction in which to compute the reliability. (default is None and computes reliability for each dimension separately)
rescale (bool, optional) – Whether to compute the reliability on the rescaled latent scale if it exists. (default is True)

Returns

A 1D tensor containing the marginal reliability for each dimension, or a single value if degrees is specified.

Return type

torch.Tensor

Notes

Marginal reliability estimates the reliability of the test across the population defined by the latent density function \(f(\boldsymbol{\theta})\). It is computed using numerical integration over the latent space:

\[\rho = \int_{-\infty}^{\infty} \frac{I(\boldsymbol{\theta})}{I(\boldsymbol{\theta}) + 1/\sigma^2_\theta} f(\boldsymbol{\theta}) \, d\boldsymbol{\theta}\]

where \(I(\boldsymbol{\theta})\) is the test information and \(\sigma^2_\theta\) is the variance of the latent variable in the population.

mutual_information_difference(data: Tensor = None, theta: Tensor = None, sample_hypothesis_test: bool = False, samples: int = 1000, log_base: float = 2.0, **kwargs) → tuple[pandas.DataFrame, pandas.DataFrame, pandas.DataFrame]

Compute the mutual information difference (MID) and the absolute value of mutual information difference (AMID) statistic [9] for the provided data to test for conditional independence among items given \(\theta\) (local independence).

Parameters

data (torch.Tensor) – The data used to compute the AMID statistic. Uses the model’s training data if not provided.
theta (torch.Tensor, optional) – The latent variable theta scores for the provided data on the original theta scale. If not provided, they will be computed using irtorch.models.BaseIRTModel.latent_scores().
sample_hypothesis_test (bool, optional) – Whether to sample from the null hypothesis distribution for the AMID statistic and perform a statistical test for each item pair. (default is False)
samples (int, optional) – The number of samples to draw from the null hypothesis distribution. (default is 1000)
log_base (float, optional) – The base of the logarithm used to compute the entropy. (default is 2.0)
**kwargs (dict, optional) – Additional keyword arguments used for theta estimation. Refer to irtorch.models.BaseIRTModel.latent_scores() for additional details.

Returns

A tuple with three data frames. The first two are the MID and AMID statistics for each item pair. The third data frame contains the p-values of the AMID tests if sample_hypothesis_test is True.

Return type

tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]

Examples

>>> import irtorch
>>> from irtorch.models import GeneralizedPartialCredit
>>> from irtorch.estimation_algorithms import MML
>>> data = irtorch.load_dataset.swedish_national_mathematics_1()
>>> model = GeneralizedPartialCredit(data)
>>> model.fit(train_data=data, algorithm=MML())
>>> mid, amid, p_value = model.evaluate.mutual_information_difference(data, sample_hypothesis_test=True, samples=300)

predictions(data: Tensor = None, theta: Tensor = None, **kwargs) → DataFrame

Compute precision, recall, and F1 score for each item using the model’s predictions.

This method works by first getting the predicted response (i.e. the one with the highest probability) for each item, then comparing it to the true responses. For every response category of each item, it calculates the following metrics:

\[\begin{split}\text{precision} =& \frac{\text{true positives}}{\text{true positives} + \text{false positives}} \\ \text{recall} =& \frac{\text{true positives}}{\text{true positives} + \text{false negatives}} \\ \text{F1} =& \frac{2 \times \text{precision} \times \text{recall}}{\text{precision} + \text{recall}}\end{split}\]

where true positives are the number of correct predictions, false positives are the number of times the response was predicted but was incorrect, and false negatives are the number of times the response was not predicted when it should have been. Each metric is then averaged across all categories for each item. Weighted versions based on the proportion of responses in each category are also computed. Missing responses are automatically ignored.

Parameters

data (torch.Tensor) – The input data. If not provided, the model’s training data is used.
theta (torch.Tensor, optional) – The latent variable theta scores for the provided data on the original theta scale. If not provided, they will be computed using irtorch.models.BaseIRTModel.latent_scores().
**kwargs (dict, optional) – Additional keyword arguments used for theta estimation. Refer to irtorch.models.BaseIRTModel.latent_scores() for additional details.

Returns

A pandas dataframe with the precision, recall and F1 score (and their weighted variants) for each item. Each row corresponds to one item.

Return type

pd.DataFrame

q3(data: Tensor = None, theta: Tensor = None, sample_hypothesis_test: bool = False, samples: int = 1000, **kwargs) → tuple[pandas.DataFrame, pandas.DataFrame]

Compute the Q3 statistic [9] for the provided data to test for conditional independence among items given \(\theta\) (local independence).

Parameters

data (torch.Tensor) – The data used to compute the Q3 statistic. Uses the model’s training data if not provided.
theta (torch.Tensor, optional) – The latent variable theta scores for the provided data on the original theta scale. If not provided, they will be computed using irtorch.models.BaseIRTModel.latent_scores().
sample_hypothesis_test (bool, optional) – Whether to sample from the null hypothesis distribution for the Q3 statistic and perform a statistical test for each item pair. (default is False)
samples (int, optional) – The number of samples to draw from the null hypothesis distribution. (default is 1000)
**kwargs (dict, optional) – Additional keyword arguments used for theta estimation. Refer to irtorch.models.BaseIRTModel.latent_scores() for additional details.

Returns

A tuple with the Q3 statistic for each item pair and the corresponding p-values of the Q3 tests if sample_hypothesis_test is True.

Return type

tuple[pd.DataFrame, pd.DataFrame]

Examples

>>> from irtorch.models import GeneralizedPartialCredit
>>> from irtorch.estimation_algorithms import MML
>>> from irtorch.load_dataset import swedish_national_mathematics_1
>>> data = swedish_national_mathematics_1()
>>> model = GeneralizedPartialCredit(data)
>>> model.fit(train_data=data, algorithm=MML())
>>> q3, p_value = model.evaluate.q3(data, sample_hypothesis_test=True, samples=300)

residuals(data: Tensor = None, theta: Tensor = None, average_over: str = 'none', **kwargs) → Tensor

Compute model residuals using the supplied data.

For multiple choice models, the residuals are computed as 1 - the probability of the selected response option. For other models, the residuals are computed as the difference between the observed and model expected item scores.

Parameters

data (torch.Tensor) – The input data.
theta (torch.Tensor, optional) – The latent variable theta scores for the provided data on the original theta scale. If not provided, they will be computed using irtorch.models.BaseIRTModel.latent_scores().
average_over (str = "none", optional) – Whether to average the residuals and over which level. Can be ‘everything’, ‘items’, ‘respondents’ or ‘none’. Use ‘none’ for no average. For example, with ‘respondent’ the residuals are averaged over all respondents and is thus an average per item. (default is ‘none’)
**kwargs (dict, optional) – Additional keyword arguments used for theta estimation. Refer to irtorch.models.BaseIRTModel.latent_scores() for additional details.

Returns

The residuals.

Return type

torch.Tensor

sum_score_probabilities(latent_density_method: str = 'data', population_data: Tensor = None, trapezoidal_segments: int = 1000, sample_size: int = 100000)

Computes the marginal probabilities for each sum score, averged over the latent space density. For ‘qmvn’ and ‘gmm’ densities, the trapezoidal rule is used for integral approximation.

Parameters

latent_density_method (str, optional) – Specifies the method used to approximate the latent space density. Possible options are - ‘data’ averages over the theta scores from the population data. - ‘encoder sampling’ samples theta scores from the encoder. Only available for VariationalAutoencoderIRT models - ‘qmvn’ for quantile multivariate normal approximation of a multivariate joint density function (QuantileMVNormal class). - ‘gmm’ for a gaussian mixture model. - ‘standard normal’ assumes a standard normal distribution for the latent variables.
population_data (torch.Tensor, optional) – The population data used for approximating sum score probabilities. Default is None and uses the training data.
trapezoidal_segments (int, optional) – The number of integration approximation intervals for each theta dimension. (Default is 1000)
sample_size (int, optional) – Sample size for the ‘encoder sampling’ method. (Default is 100000)

Returns

A 1D tensor with the probability for each total score.

Return type

torch.Tensor