3 Calculating Domains

In this section, I will explain the code used to calculate the individual domain scores.

3.1 Setting up

Here is a complete list of the files in openSIMD_analysis that we need for openSIMD:

  • scripts/calculations/domains.R to calculate the domains
  • scripts/calculations/openSIMD.R to calculate SIMD
  • scripts/utils/helpers.R, some utility functions (see Functions for documentation)
  • data/SIMD16 indicator data.xlsx
  • data/SIMD16 ranks and domain ranks.xlsx.

If you use RStudio you can open openSIMD_analysis.Rproj to open the project.

Starting in domains.R, the first things to do are:

  • load a few packages
  • source the utility functions
  • read in the data

You will need to make sure that the file paths in your code correspond to where your files are. If you are using RStudio and you have opened the openSIMD_analysis.Rproj then the directories of interest will be scripts/utils and data.

You may experience an error when reading the .xlsx file files with read_excel. If this happens you have two options. Either you can open the file in Excel, save it, close it, and try again. The file should now read into R without an error. Alternatively, you can save the file as a .csv file (make sure you select the correct sheet) and then read it into R using the read.csv function.

library(readxl)
library(dplyr)
source("../openSIMD_analysis/scripts/utils/helpers.R")
d <- read_excel("../openSIMD_analysis/data/SIMD16 indicator data.xlsx", sheet = 3, na = "*")

The data here contains the published indicators, in this case from SIMD 2016.

str(d)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6976 obs. of  36 variables:
##  $ Data_Zone                     : chr  "S01006506" "S01006507" "S01006508" "S01006509" ...
##  $ Intermediate_Zone             : chr  "Culter" "Culter" "Culter" "Culter" ...
##  $ Council_area                  : chr  "Aberdeen City" "Aberdeen City" "Aberdeen City" "Aberdeen City" ...
##  $ Total_population              : num  904 830 694 573 676 ...
##  $ Working_age_population_revised: num  605 491 519 354 414 464 322 354 710 491 ...
##  $ Income_rate                   : num  0.07 0.07 0.05 0.05 0.1 0.03 0.02 0.03 0.01 0.01 ...
##  $ Income_count                  : num  60 60 30 30 70 25 10 20 15 10 ...
##  $ Employment_rate               : num  0.07 0.05 0.03 0.06 0.07 0.03 0.02 0.03 0.01 0.01 ...
##  $ Employment_count              : num  40 25 15 20 30 15 5 10 10 5 ...
##  $ CIF                           : num  60 40 45 65 75 50 35 55 25 35 ...
##  $ ALCOHOL                       : num  103.4 33 54.3 141.9 52.3 ...
##  $ DRUG                          : num  27.6 39.9 32.9 49.2 234.9 ...
##  $ SMR                           : num  67 86 43 88 149 68 77 137 86 61 ...
##  $ DEPRESS                       : num  0.123 0.148 0.127 0.154 0.197 ...
##  $ LBWT                          : num  0 0.0256 0.0556 0.037 0 ...
##  $ EMERG                         : num  80 97.7 70 87.9 105.1 ...
##  $ Attendance                    : num  0.863 0.825 0.877 0.942 0.805 ...
##  $ Attainment                    : num  5.93 5.57 5.8 5.62 5.33 ...
##  $ Noquals                       : num  52.8 95.9 38.6 80.1 77.2 ...
##  $ NEET                          : num  0.03 0.01 0 0.04 0.04 0.03 0.08 0.02 0.03 0.03 ...
##  $ HESA                          : num  0.1304 0.1014 0.1486 0.0816 0.0857 ...
##  $ drive_petrol                  : num  2.42 3.68 3.24 2.51 2.08 ...
##  $ drive_GP                      : num  2.92 4.02 3.6 2.47 2.17 ...
##  $ drive_PO                      : num  1.52 2.74 2.09 1.96 1.67 ...
##  $ drive_primary                 : num  2.03 3.13 2.66 1.44 1.51 ...
##  $ drive_retail                  : num  1.5 2.65 1.88 2.22 1.93 ...
##  $ drive_secondary               : num  10.8 11.5 11.5 10.8 10.6 ...
##  $ PT_GP                         : num  8.44 8.33 7.85 7.43 5.14 ...
##  $ PT_Post                       : num  5.99 7.26 5.83 8.31 6.63 ...
##  $ PT_retail                     : num  5.71 6.79 5.25 8.44 6.62 ...
##  $ crime_count                   : num  8.01 4 4 NA 12.01 ...
##  $ crime_rate                    : num  88.6 48.2 57.7 NA 177.7 ...
##  $ overcrowded_count             : num  87 85 31 42 50 27 27 15 10 29 ...
##  $ nocentralheat_count           : num  10 4 8 6 7 8 9 4 3 1 ...
##  $ overcrowded_rate              : num  0.1021 0.1017 0.0482 0.0724 0.0867 ...
##  $ nocentralheat_rate            : num  0.01174 0.00478 0.01244 0.01034 0.01213 ...

3.2 The recipe

For each domain, the process has several steps, many of which are repeated across domains. To make this recipe easier to follow, I have defined some openSIMD utility functions in scripts/utils/helpers.R, see Functions. The steps of the recipe and the associated functions are as follows:

  • Select the indicators (columns) that are relevant to that domain, using the function dplyr::select
  • Calculate normalised ranks for each indicator, using the function normalScores
  • Replace missing values, using the function replaceMissing
  • Derive the factor analysis weights for each indicator, using the function getFAWeights
  • Combine the normalised ranks and weights to generate the indicator score, using the function combineWeightsAndNorms
  • Rank the indicator score, using the function base::rank.

Not all steps are used for every domain.

3.3 dplyr

Throughout this project, I make use of the tools in the dplyr package for data manipulation including the %>% notation for forwards piping. I won’t introduce these tools but if they are unfamiliar, I recommend reading this article. See the section on chaining for an explanation of the %>% pipe.

3.4 An example: Education

As an example of the process, here we will calculate the domain rank for the education domain. In this first chunk of code, I will create the normalised ranks for the education domain as follows:

normalised_education <- d %>% # start with the raw data
  select(Attendance, Attainment, Noquals, NEET, HESA) %>% # select relevant columns 
  mutate(Attendance = normalScores(Attendance, forwards = FALSE)) %>% # replace each column with its normalised values
  mutate(Attainment = normalScores(Attainment, forwards = FALSE)) %>%
  mutate(Noquals    = normalScores(Noquals, forwards = TRUE)) %>%
  mutate(NEET       = normalScores(NEET, forwards = TRUE)) %>%
  mutate(HESA       = normalScores(HESA, forwards = FALSE)) %>%
  mutate_all(funs(replaceMissing)) # replace missing values
## Warning in qnorm(rn, mean = 0, sd = 1, lower.tail = forwards): NaNs
## produced

## Warning in qnorm(rn, mean = 0, sd = 1, lower.tail = forwards): NaNs
## produced

## Warning in qnorm(rn, mean = 0, sd = 1, lower.tail = forwards): NaNs
## produced

## Warning in qnorm(rn, mean = 0, sd = 1, lower.tail = forwards): NaNs
## produced

Note: you may see a warning here because normalScores can generate missing values from missing values. This is fine, these will be replaced with 0, when you call replaceMissing.

The only decisions to make are (a) which columns to select using select and (b) which orientation to rank (and then normalise) each indicator. The orientation is determined by the forwards argument to normalScores, see normalScores and Variations for further information.

When combining the indicators, we need to apply a different weight to each. The weights are derived through factor analysis of the normalised indicator scores, and the proportional loadings on factor 1 serve as the weightings. We extract the loadings using the getFAWeights function as follows:

education_weights <- getFAWeights(normalised_education)

Now that we have the normalised indicator scores and weights, we can combine them with the utility function combineWeightsAndNorms. Each normalised indicator variable is multiplied by its weight derived from factor analysis, as follows:

education_score <- combineWeightsAndNorms(education_weights, normalised_education)

Finally we rank these weighted scores to generate the domain rank (1 = most deprived).

education_rank <- rank(-education_score)

3.5 Variations

The remaining domains are calculated in a similar way with some variations. Rather than explaining each domain, I will explain the possible variations.

For the housing domain, the sum of the overcrowding rate and non-central heating rate is ranked.

For the crime, income and employment domains, we simply use the published ranks. The reason for this is that the published indicator data for these domains is rounded, whereas the published domain ranks are based on unrounded data and therefore more precise.

In the education example above, when applying the normalScores function, we needed to pay attention to the forwards argument to orient the variables (decide whether a high value was good or bad). In the other domains this is not the case and each indicator can take the default value forwards = TRUE (high value = deprived). This means we can use the dplyr::mutate_all() function instead of mutating each variable independently. In addition, if indicator column names have something in common, we can select them with dplyr::select(contains("some_common_text")).

The access domain is unique in that it has two sub-domains (drive time and public transport time) which are processed separately in the normal way (normalise -> weight -> rank). Then, each sub-domain is exponentially transformed (covered in the next section on calculating SIMD) and the resulting scores are summed in a 2:1 ratio before the final ranking for the domain.

3.6 Re-assigning ranks

We have included some functionality to manually re-assign ranks to allow for certain exceptions. This is done via the reassignRank function.