Factor Forest (FF) Powered by An Tuned XGBoost Model for Determining the Number of Factors

This function will invoke a tuned XGBoost model (Goretzko & Buhner, 2020; Goretzko, 2022; Goretzko & Ruscio, 2024) that can reliably perform the task of determining the number of factors. The maximum number of factors that the network can discuss is 8.

FF(
  response,
  cor.type = "pearson",
  use = "pairwise.complete.obs",
  vis = TRUE,
  plot = TRUE
)

Arguments

response: A required N × I matrix or data.frame consisting of the responses of N individuals to I items.
cor.type: A character string indicating which correlation coefficient (or covariance) is to be computed. One of "pearson" (default), "kendall", or "spearman". @seealso cor.
use: An optional character string giving a method for computing covariances in the presence of missing values. This must be one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs" (default). @seealso cor.
vis: A Boolean variable that will print the factor retention results when set to TRUE, and will not print when set to FALSE. (default = TRUE)
plot: A Boolean variable that will print the FF plot when set to TRUE, and will not print it when set to FALSE. @seealso plot.FF. (Default = TRUE)

Value

An object of class FF is a list containing the following components:

nfact: The number of factors to be retained.
probability: A matrix containing the probabilities for factor numbers ranging from 1 to 8 (1x8), where the number in the f-th column represents the probability that the number of factors for the response is f.
features: A matrix (1×184) containing all the features for determining the number of factors by the tuned XGBoost Model.

Details

A total of 500,000 datasets were simulated to extract features for training the tuned XGBoost model (Goretzko & Buhner, 2020; Goretzko, 2022). Each dataset was generated according to the following specifications:

Factor number: F ~ U[1,8]
Sample size: N ~ U[200,1000]
Number of variables per factor: vpf ~ U[3,10]
Factor correlation: fc ~ U[0.0,0.4]
Primary loadings: pl ~ U[0.35,0.80]
Cross-loadings: cl ~ U[0.0,0.2]

A population correlation matrix was created for each data set based on the following decomposition: $$\mathbf{\Sigma} = \mathbf{\Lambda} \mathbf{\Phi} \mathbf{\Lambda}^T + \mathbf{\Delta}$$ where $\mathbf{\Lambda}$ is the loading matrix, $\mathbf{\Phi}$ is the factor correlation matrix, and $\mathbf{\Delta}$ is a diagonal matrix, with $\mathbf{\Delta} = 1 - \text{diag}(\mathbf{\Lambda} \mathbf{\Phi} \mathbf{\Lambda}^T)$. The purpose of $\mathbf{\Delta}$ is to ensure that the diagonal elements of $\mathbf{\Sigma} $ are 1.

The response data for each subject were simulated using the following formula: $$X_i = L_i + \epsilon_i, \quad 1 \leq i \leq I$$ where $L_i$ follows a normal distribution $N(0, \sigma)$, representing the contribution of latent factors, and $\epsilon_i$ is the residual term following a standard normal distribution. $L_i$ and $\epsilon_i$ are uncorrelated, and $\epsilon_i$ and $\epsilon_j$ are also uncorrelated.

For each simulated dataset, a total of 184 features are extracted and compiled into a feature vector. These features include:

1. - Number of examinees
2. - Number of items
3. - Number of eigenvalues greater than 1
4. - Proportion of variance explained by the 1st eigenvalue
5. - Proportion of variance explained by the 2nd eigenvalue
6. - Proportion of variance explained by the 3rd eigenvalue
7. - Number of eigenvalues greater than 0.7
8. - Standard deviation of the eigenvalues
9. - Number of eigenvalues accounting for 50
10. - Number of eigenvalues accounting for 75
11. - L1-norm of the correlation matrix
12. - Frobenius-norm of the correlation matrix
13. - Maximum-norm of the correlation matrix
14. - Average of the off-diagonal correlations
15. - Spectral-norm of the correlation matrix
16. - Number of correlations smaller or equal to 0.1
17. - Average of the initial communality estimates
18. - Determinant of the correlation matrix
19. - Measure of sampling adequacy (MSA after Kaiser, 1970)
20. - Gini coefficient (Gini, 1921) of the correlation matrix
21. - Kolm measure of inequality (Kolm, 1999) of the correlation matrix
21. - Number of factors retained by the PA method @seealso PA
23. - Number of factors retained by the EKC method @seealso EKC
24. - Number of factors retained by the CD method @seealso CD
25-104. - Eigenvalues from Principal Component Analysis (PCA), padded with -1000 if insufficient
105-184. - Eigenvalues from Factor Analysis (FA), fixed at 1 factor, padded with -1000 if insufficient

The code for the FF function is implemented based on the publicly available code by Goretzko & Buhner (2020) (https://osf.io/mvrau/). The Tuned XGBoost Model is also obtained from this site. However, to meet the requirements for a streamlined R package, we can only save the core components of the Tuned XGBoost Model. Although these non-core parts do not affect performance, they include a lot of information about the model itself, such as the number of features, subsets of samples, and data from the training process, among others. For the complete Tuned XGBoost Model, please download it from https://osf.io/mvrau/.

References

Goretzko, D., & Buhner, M. (2020). One model to rule them all? Using machine learning algorithms to determine the number of factors in exploratory factor analysis. Psychol Methods, 25(6), 776-786. https://doi.org/10.1037/met0000262.

Goretzko, D. (2022). Factor Retention in Exploratory Factor Analysis With Missing Data. Educ Psychol Meas, 82(3), 444-464. https://doi.org/10.1177/00131644211022031.

Examples

library(EFAfactors)
set.seed(123)

##Take the data.bfi dataset as an example.
data(data.bfi)

response <- as.matrix(data.bfi[, 1:25]) ## loading data
response <- na.omit(response) ## Remove samples with NA/missing values

## Transform the scores of reverse-scored items to normal scoring
response[, c(1, 9, 10, 11, 12, 22, 25)] <- 6 - response[, c(1, 9, 10, 11, 12, 22, 25)] + 1


## Run FF function with default parameters.
if (FALSE) { # \dontrun{
FF.obj <- FF(response)

print(FF.obj)

plot(FF.obj)

## Get the probability and nfact results.
probability <- FF.obj$probability
nfact <- FF.obj$nfact

print(probability)
print(nfact)

} # }