This function runs the comparison data (CD) approach of Ruscio & Roche (2012).

CD(
  response,
  nfact.max = 10,
  N.pop = 10000,
  N.Samples = 500,
  Alpha = 0.3,
  cor.type = "pearson",
  use = "pairwise.complete.obs",
  vis = TRUE,
  plot = TRUE
)

Arguments

response

A required N × I matrix or data.frame consisting of the responses of N individuals to × I items.

nfact.max

The maximum number of factors discussed by CD approach. (default = 10)

N.pop

Size of finite populations of simulating.. (default = 10,000)

N.Samples

Number of samples drawn from each population. (default = 500)

Alpha

Alpha level when testing statistical significance (Wilcoxon Rank Sum and Signed Rank Tests) of improvement with additional factor. (default = .30)

cor.type

A character string indicating which correlation coefficient (or covariance) is to be computed. One of "pearson" (default), "kendall", or "spearman". @seealso cor.

use

an optional character string giving a method for computing covariances in the presence of missing values. This must be one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs" (default). @seealso cor.

vis

A Boolean variable that will print the factor retention results when set to TRUE, and will not print when set to FALSE. (default = TRUE)

plot

A Boolean variable that will print the CD plot when set to TRUE, and will not print it when set to FALSE. @seealso plot.CD. (Default = TRUE)

Value

An object of class CD is a list containing the following components:

nfact

The number of factors to be retained.

RMSE.Eigs

A matrix containing the root mean square error (RMSE) of the eigenvalues produced by each simulation for every discussed number of factors.

Sig

A boolean variable indicating whether the significance level of the Wilcoxon Rank Sum and Signed Rank Tests has reached Alpha.

Details

Ruscio and Roche (2012) proposed a method for determining the number of factors through comparison data (CD). This method identifies the appropriate number of factors by finding the solution that best reproduces the pattern of eigenvalues. CD employs an iterative procedure when generating comparison data with a known factor structure, taking into account previous factors. Initially, CD compares whether the simulated comparison data with one latent factor (j=1) reproduces the empirical eigenvalue pattern significantly worse than the two-factor solution (j+1). If so, CD increases the value of j until further improvements are no longer significant or a preset maximum number of factors is reached. Specifically, CD involves five steps:

1. Generate random data with either j or j+1 latent factors and calculate the eigenvalues of the respective correlation matrices.

2. Compute the root mean square error (RMSE) of the difference between the empirical and simulated eigenvalues using the formula $$ RMSE = \sqrt{\sum_{i=1}^{p} (\lambda_{emp,i} - \lambda_{sim,i})^2} $$ , where:

  • \(\lambda_{emp,i}\): The i-th empirical eigenvalue.

  • \(\lambda_{sim,i}\): The i-th simulated eigenvalue.

  • \(p\): The number of items or eigenvalues.

. This step produces two RMSEs, corresponding to the different numbers of latent factors.

3. Repeat steps 1 and 2, 500 times ( default in the Package ).

4. Use a one-sided Wilcoxon test (alpha = 0.30) to assess whether the RMSE is significantly reduced under the two-factor condition.

5. If the difference in RMSE is not significant, CD suggests selecting j factors. Otherwise, j is increased by 1, and steps 1 to 4 are repeated.

The code is implemented based on the resources available at:

Since the CD approach requires extensive data simulation and computation, C++ code is used to speed up the process.

References

Auerswald, M., & Moshagen, M. (2019). How to determine the number of factors to retain in exploratory factor analysis: A comparison of extraction methods under realistic conditions. Psychological methods, 24(4), 468-491. https://doi.org/https://doi.org/10.1037/met0000200.

Goretzko, D., & Buhner, M. (2020). One model to rule them all? Using machine learning algorithms to determine the number of factors in exploratory factor analysis. Psychol Methods, 25(6), 776-786. https://doi.org/10.1037/met0000262.

Ruscio, J., & Roche, B. (2012). Determining the number of factors to retain in an exploratory factor analysis using comparison data of known factorial structure. Psychological Assessment, 24, 282–292. http://dx.doi.org/10.1037/a0025697.

See also

Author

Haijiang Qin <Haijiang133@outlook.com>

Examples

library(EFAfactors)
set.seed(123)

##Take the data.bfi dataset as an example.
data(data.bfi)

response <- as.matrix(data.bfi[, 1:25]) ## loading data
response <- na.omit(response) ## Remove samples with NA/missing values

## Transform the scores of reverse-scored items to normal scoring
response[, c(1, 9, 10, 11, 12, 22, 25)] <- 6 - response[, c(1, 9, 10, 11, 12, 22, 25)] + 1


## Run CD function with default parameters.
# \donttest{
CD.obj <- CD(response)
#> 
CD is simulating data: nfact= 1/10
CD is simulating data: nfact= 2/10
CD is simulating data: nfact= 3/10
CD is simulating data: nfact= 4/10
CD is simulating data: nfact= 5/10
CD is simulating data: nfact= 6/10
CD is simulating data: nfact= 7/10
CD is simulating data: nfact= 8/10
CD is simulating data: nfact= 9/10
CD is simulating data: nfact=10/10
#> The number of factors suggested by CD is 9 .

print(CD.obj)
#> The number of factors suggested by CD is 9 .

## CD plot
plot(CD.obj)


## Get the RMSE.Eigs and nfact results.
RMSE.Eigs <- CD.obj$RMSE.Eigs
nfact <- CD.obj$nfact

head(RMSE.Eigs)
#>          [,1]     [,2]     [,3]      [,4]      [,5]      [,6]      [,7]
#> [1,] 2.520280 1.710999 1.255704 0.9008949 0.5638721 0.5898673 0.4457931
#> [2,] 2.549383 1.698597 1.290050 1.0082324 0.6244967 0.5933479 0.5111748
#> [3,] 2.520387 1.695946 1.226892 1.0184523 0.6424078 0.5002352 0.6655834
#> [4,] 2.526061 1.707069 1.302093 0.9434590 0.5660613 0.5842731 0.5756487
#> [5,] 2.543030 1.711862 1.254901 0.8425360 0.6434615 0.4238260 0.5750875
#> [6,] 2.525968 1.754083 1.260790 0.8460263 0.5829216 0.5356443 0.6111790
#>           [,8]      [,9]     [,10]
#> [1,] 0.4053720 0.4021450 0.4806636
#> [2,] 0.3931302 0.6218749 0.6481592
#> [3,] 0.5173863 0.5196232 0.5485825
#> [4,] 0.5204452 0.5940384 0.5079340
#> [5,] 0.7413088 0.4815181 0.8191780
#> [6,] 0.4192279 0.4085366 0.6641017
print(nfact)
#> [1] 9

# }

## Limit the maximum number of factors to 8, with populations set to 5000.
# \donttest{
CD.obj <- CD(response, nfact.max=8, N.pop = 5000)
#> 
CD is simulating data: nfact= 1/8
CD is simulating data: nfact= 2/8
CD is simulating data: nfact= 3/8
CD is simulating data: nfact= 4/8
CD is simulating data: nfact= 5/8
CD is simulating data: nfact= 6/8
CD is simulating data: nfact= 7/8
CD is simulating data: nfact= 8/8
#> The number of factors suggested by CD is 7 .

print(CD.obj)
#> The number of factors suggested by CD is 7 .

## CD plot
plot(CD.obj)


## Get the RMSE.Eigs and nfact results.
RMSE.Eigs <- CD.obj$RMSE.Eigs
nfact <- CD.obj$nfact

head(RMSE.Eigs)
#>          [,1]     [,2]     [,3]      [,4]      [,5]      [,6]      [,7]
#> [1,] 2.521017 1.709313 1.230259 0.8147750 0.5689392 0.6984562 0.4752621
#> [2,] 2.519007 1.723487 1.250668 0.8746175 0.4971828 0.6623962 0.6283442
#> [3,] 2.501260 1.657773 1.222843 0.8867012 0.5930649 0.4928996 0.4729325
#> [4,] 2.479979 1.706934 1.354569 0.8922971 0.7559234 0.5312592 0.4094892
#> [5,] 2.478935 1.701781 1.231123 0.8908225 0.5852438 0.5587063 0.4731640
#> [6,] 2.574056 1.656444 1.242689 0.7983774 0.5217779 0.6862998 0.6282767
#>           [,8]
#> [1,] 0.4959593
#> [2,] 0.5966837
#> [3,] 0.6403932
#> [4,] 0.4527898
#> [5,] 0.6303497
#> [6,] 0.4665600
print(nfact)
#> [1] 7

# }