Simulating Data Following John Ruscio's RGenData

This function simulates data with $nfact$ factors based on empirical data. It represents the simulation data part of the CD function and the CDF function. This function improves upon GenDataPopulation by utilizing C++ code to achieve faster data simulation.

GenData(
  response,
  nfact = 1,
  N.pop = 10000,
  Max.Trials = 5,
  lr = 1,
  cor.type = "pearson",
  use = "pairwise.complete.obs",
  isSort = FALSE
)

Arguments

response: A required N × I matrix or data.frame consisting of the responses of N individuals to I items.
nfact: The number of factors to extract in factor analysis. (default = 1)
N.pop: Size of finite populations for simulating. (default = 10,000)
Max.Trials: The maximum number of consecutive trials without obtaining a lower RMSR. (default = 5)
lr: The learning rate for updating the correlation matrix during iteration. (default = 1)
cor.type: A character string indicating which correlation coefficient (or covariance) is to be computed. One of "pearson" (default), "kendall", or "spearman". @seealso cor.
use: An optional character string specifying a method for computing covariances in the presence of missing values. This must be one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs" (default). @seealso cor.
isSort: Logical, determines whether the simulated data needs to be sorted in descending order. (default = FALSE)

Value

A N.pop * I matrix containing the simulated data.

Details

The core idea of GenData is to start with the empirical data's correlation matrix and iteratively approach data with nfact factors. Any value in the simulated data must come from the empirical data. The specific steps of GenData are as follows:

(1): Use the empirical data ($\mathbf{Y}_{emp}$) correlation matrix as the target, $\mathbf{R}_{targ}$.
(2): Simulate scores for $N.pop$ examinees on $nfact$ factors using a multivariate standard normal distribution: $$\mathbf{S}_{(N.pop \times nfact)} \sim \mathcal{N}(0, 1)$$ Simulate noise for $N.pop$ examinees on $I$ items: $$\mathbf{U}_{(N.pop \times I)} \sim \mathcal{N}(0, 1)$$
(3): Initialize $\mathbf{R}_{temp} = \mathbf{R}_{targ}$, and set the minimum Root Mean Square Residual $RMSR_{min} = \text{Inf}$. Start the iteration process.
(4): Extract nfact factors from $\mathbf{R}_{temp}$, and obtain the factor loadings matrix $\mathbf{L}_{shar}$. Ensure that the first element of $\mathbf{L}_{share}$ is positive to standardize the direction.
(5): Calculate the unique factor matrix $\mathbf{L}_{uniq, (I \times 1)}$: $$L_{uniq,i} = \sqrt{1 - \sum_{j=1}^{nfact} L_{share, i, j}^2}$$
(6): Calculate the simulated data $\mathbf{Y}_{sim}$: $$Y_{sim, i, j} = \mathbf{S}_{i} \mathbf{L}_{shar, j}^T + U_{i, j} L_{uniq,i}$$
(7): Compute the correlation matrix of the simulated data, $\mathbf{R}_{simu}$.
(8): Calculate the residual correlation matrix $\mathbf{R}_{resi}$ between the target matrix $\mathbf{R}_{targ}$ and the simulated data's correlation matrix $\mathbf{R}_{simu}$: $$\mathbf{R}_{resi} = \mathbf{R}_{targ} - \mathbf{R}_{simu}$$
(9): Calculate the current RMSR: $$RMSR_{cur} = \sqrt{\frac{\sum_{i < j} \mathbf{R}_{resi, i, j}^2}{0.5 \times (I^2 - I)}}$$
(10): If $RMSR_{cur} < RMSR_{min}$, update $\mathbf{R}_{temp} = \mathbf{R}_{temp} + lr \times \mathbf{R}_{resi}$, $RMSR_{min} = RMSR_{cur}$, set $\mathbf{R}_{min, resi} = \mathbf{R}_{resi}$, and reset the count of consecutive trials without improvement $cou = 0$. If $RMSR_{cur} \geq RMSR_{min}$, update $\mathbf{R}_{temp} = \mathbf{R}_{temp} + 0.5 \times cou \times lr \times \mathbf{R}_{min, resi}$ and increment $cou = cou + 1$.
(11): Repeat steps (4) through (10) until $cou \geq Max.Trials$.

Of course C++ code is used to speed up.

References

Ruscio, J., & Roche, B. (2012). Determining the number of factors to retain in an exploratory factor analysis using comparison data of known factorial structure. Psychological Assessment, 24, 282–292. http://dx.doi.org/10.1037/a0025697.

Examples

library(EFAfactors)
set.seed(123)

##Take the data.bfi dataset as an example.
data(data.bfi)

response <- as.matrix(data.bfi[, 1:25]) ## loading data
response <- na.omit(response) ## Remove samples with NA/missing values

## Transform the scores of reverse-scored items to normal scoring
response[, c(1, 9, 10, 11, 12, 22, 25)] <- 6 - response[, c(1, 9, 10, 11, 12, 22, 25)] + 1
# \donttest{
  data.simulated <- GenData(response, nfact = 1, N.pop = 10000)
  head(data.simulated)
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,]    5    4    4    5    5    6    3    4    6     2     6     5     3     4
#> [2,]    5    2    2    2    1    4    2    4    3     6     2     1     1     3
#> [3,]    5    5    5    1    2    5    3    6    6     2     1     1     2     4
#> [4,]    5    6    6    6    5    5    4    5    5     3     3     4     2     2
#> [5,]    1    2    4    3    5    2    4    5    2     3     6     1     2     5
#> [6,]    1    5    6    4    2    5    2    3    5     1     1     2     1     2
#>      [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
#> [1,]     4     1     1     1     6     4     4     6     5     4     1
#> [2,]     4     2     6     4     4     5     3     1     4     5     1
#> [3,]     6     5     2     2     5     6     5     5     5     6     1
#> [4,]     2     1     2     1     2     4     5     6     5     5     1
#> [5,]     2     6     6     6     4     5     5     3     3     6     1
#> [6,]     3     5     5     4     6     5     4     2     3     5     1
# }