5 Functions

Below is the documentation of the functions developed for this project, which are used throughout the openSIMD procedure.

5.1 normalScores

This function calculates the normal scores for each indicator. The normal score is defined as follows:

\[\begin{equation*} y_{i} = \phi^{-1}\left(\frac{r_{i}}{n + 1}\right) \end{equation*}\]

where: \(\phi^{-1}\) is the inverse cumulative normal (probit) function, \(r_{i}\) is the rank of the i’th observation and \(n\) is the number of non-missing observations for the ranking variable. This is the inverse cumulative normal probability density of proportional ranks. The resulting variable should appear normally distributed regardless of the input data. We translated this approach using the SAS documentation as a guide resulting in the following R function.

normalScores <- function(
  v,                  # a numeric vector as the input variable
  ties = "average",   # passed to ties.method argument in rank()
  forwards = TRUE     # smallest numerical value on left? default is TRUE 
) {
  
  r <- rank(v, ties.method = ties)
  n <- length(na.omit(v))
  
  rn <- r / (n + 1)
  
  y <- qnorm(rn, mean = 0, sd = 1, lower.tail = forwards)
  
  return(y)
  
}

The function takes a numeric vector as its input v. It first ranks this input r and then calculates the proportional rank rn. The final step is to apply the cumulative normal probability using the qnorm function. The return value is a numeric vector of the same length as the input v.

5.2 replaceMissing

This function replaces missing values, once normalised indicator scores have been calculated. The function finds missing values (as well as Inf and -Inf) in a vector and replaces them with 0. This is used in openSIMD where a data zone has zero population, or has a missing value for an indicator, and we want it to sit in the centre of the distribution and so we assign a value of 0.

replaceMissing <- function(v) replace(v, is.na(v) | v == Inf | v == -Inf, 0)

The function takes a numeric vector v and returns a vector of equal length with missing values replaced with 0.

5.3 getFAWeights

This function performs a factor analysis using the psych::fa function. It then extracts the loadings on the first resulting factor, converts them to proportions of the sum of loadings, the weights, and then returns them as individual elements of a list. This is designed to be equivalent to the SAS procedure in previous SIMD calculations, however the results are only comparable up to two decimal places, likely due to differences in the implementation of factor analysis in the two packages (see section on Quality Assurance).

getFAWeights <- function(dat, ...) {
  
  fact <- psych::fa(dat, nfactors = 1, fm = "ml", rotate = "none", ...)
  
  f1_scores <- as.data.frame(fact$weights) %>% select(ML1)
  
  f1_weights <- f1_scores / sum(f1_scores)
  
  # This is just to make each weight an individual element of a list
  # For compatibility with purrr::map2() in the next step, combineWeightsAndNorms()
  return(lapply(seq_along(f1_weights$ML1), function(i) f1_weights$ML1[i]))

  }

The function takes a data.frame dat which contains all of the variables for factor analysis. The return value is a list with individual elements corresponding to the weights of variables in the same order as the collumns in the input data. A list is returned here for compatibility with the next function combineWeightsAndNorms; however, this can be easily converted to a vector with unlist.

5.4 combineWeightsAndNorms

This function takes the normalised indicator scores and the weights derived from factor analysis, multiplies them out and then takes the sum of these weighted indicator scores to get the final score for that domain. Weights and indicators need to be in the same order.

combineWeightsAndNorms <- function(weights, norms) {
  
  combined <- purrr::map2(weights, norms, ~ .x * .y)
  
  combined %>% data.frame %>% rowSums
  
}

The function takes a list of weights (generated by getFAWeights) and a data.frame of normalised scores. The function returns a numeric vector containing the combined domain score.

5.5 expoTransform

This function exponentially transforms the (inverted) domain ranks.

expoTransform <- function(ranks) {
  
  prop_ranks <- ranks / max(ranks)

  expo <- -23 * log(1 - prop_ranks * (1 - exp( -100 / 23)))
  
  return(expo)
}

The function takes a numeric vector of the (inverted) ranks and returns a numeric vector of equal length containing the transformed values.

5.6 reassignRank

This function manually reassigns ranks to individual data zones. It can be used, for example, when the domain ranking needs to reflect a population of zero in a data zone.

reassignRank <- function(data, domain, data_zone, end = "max", offset = 0) {
  
  if(end == "max") {
    data[data$data_zone == data_zone, domain] <- 
      max(data[, domain], na.rm = TRUE) - (offset + 0.1)
  }else
    if(end == "min") {
      data[data$data_zone == data_zone, domain] <- 
        min(data[, domain], na.rm = TRUE) + (offset + 0.1)
    }
  
  data[, domain] <- rank(data[, domain])
  
  return(data)
}

The function takes a data.frame containing a column named ‘data_zone’ and a ranked variable. The function changes the rank of the data zone in question (with an optional offset), and it then re-ranks the whole variable. The function returns the corrected data.frame.