This study utilizes a unique data set comprising self-reported progression surveys collected every three months from disadvantaged social assistance recipients, that is, social assistance recipients defined (by their caseworkers) to be not immediately ready for work but ready for activation, in 10 job centers across Denmark, along with their approximately 300 attached caseworkers. The project and data collection was conceived and organized by Væksthuset (the Greenhouse) and Væksthusets Research Centre.Footnote 3 The data spans a four-year period, from 2013 to 2016, with almost all clients entering the project in 2013. An essential aspect of this study is the ability to merge these survey responses with comprehensive data from administrative registers, which include detailed geographic and demographic information as well as very detailed weekly information on labor market status (employment, unemployment, and other income transfers, etc.), detailed educational information, and historical health and criminal records, for each recipient.
The integration of self-reported surveys with administrative data facilitates a comprehensive understanding of the recipients’ progression and the intricate factors that impact their transition into employment. By combining these two data sources, this study gains valuable insights into the dynamics shaping recipients’ trajectories and elucidates the correlates of successful employment outcomes. This comprehensive approach enables a more holistic exploration of the multifaceted factors that contribute to recipients’ transition from social assistance to employment, providing a robust foundation for evidence-based policy recommendations and interventions.
Sample description
The predictive model was analyzed using data collected from social assistance recipients assessed to be not ready for work in 10 municipalities across Denmark. The initial data set encompassed 15,818 unique responses from 5512 clients. Each response ideally consisted of both 11 questions posed to the client (the client questionnaire) and 11 questions posed to the caseworker (the caseworker questionnaire). These questionnaires were answered in connection with compulsory meetings held between caseworkers and clients at the PES. To ensure the reliability and accuracy of the data, we carefully filtered out responses where either the client or the caseworker had not answered the questionnaire. Additionally, we excluded observations with a gap of more than 6 months between the completion of two questionnaires. These data-cleaning steps resulted in a data set comprising 11,268 unique responses from 3697 clients.
To focus specifically on the study of progression, we further eliminated 1105 responses from clients who had only completed the survey once. As a result, the final population included in the statistical analysis consisted of 10,163 observations from 2599 clients. It is important to note that whenever we analyze progression towards employment, the client’s initial answers as well as the change in the answers from the first survey to the current one are included in our assessments. Thus, for the final analyses, the data set comprised 7564 unique responses from 2599 clients.
Supplementary Table A.1 presents information on the 2599 clients included in the analysis and their characteristics, measured at the time of the first meeting between the client and the caseworker. The clients were on average 39 years old, there was a small over-representation of women, and only 20 percent were married or stably living together with their partner. In general, the clients hold a low educational level, with 71 percent having high school or less as their highest educational degree. Their employment history over the previous 5 years was very unfavorable, i.e., they were employed on average 2% of the past two years and 10% of the past five years. They had a substantial use of social assistance, which they also received at the time of measurement in order to be included in the study. Interestingly, and in line with our expectations, clients also had high usage of prescription medication (especially painkillers, lifestyle medication, and antidepressants) and generally many contacts with the healthcare sector in terms of somatic and mental health diagnoses.
As is common in the literature, we further divide the complete set of ERIQ responses into two separate samples for modeling purposes: a training sample representing 75% of the data, utilized for model development, and a test sample encompassing the remaining 25%, used to evaluate model performance. To avoid overfitting issues, we randomize individuals based on their (anonymized) personal ID numbers, guaranteeing that no individuals appear in both the training and test samples. This method yields a training sample comprising 5675 responses from 1930 distinct participants, while the test sample contains 1889 responses from 672 different participants.
Outcomes and feature sets
Outcomes
The primary objective of this project is to predict the transition into employment within a year after answering the questionnaire and the initiation of active job search. As secondary outcomes, we also consider transitions into educational programs in the ordinary educational system within a year (although ERIQ was not developed with this transition in mind), since for social assistance recipients below 30 without qualifying education, it is a major aim for the PES to help them into the educational system. We also consider the transition into either employment or education within a year as a secondary outcome.
Table 2 illustrates that only 9% of the sample successfully made a transition into either employment or education, with 6% entering employment and the remainder entering education. These figures underscore the main challenge faced when investigating the progression from social assistance towards employment, as a significant portion of the recipients are very distant from the labor market.
Job search is captured by constructing a dummy variable to indicate whether they are actively applying for jobs or not, taken from ERIQ. Table 2 demonstrates that job application prevalence is substantially higher, with 27% of the sample actively searching for jobs. We approach this measure from two perspectives. First, we examine whether applying for jobs serves as a viable intermediate goal toward the long-term objective of leaving unemployment altogether by including this dummy in the model for the transition into employment (and education). Secondly, we explore whether ERIQ can predict the likelihood of applying for jobs and, consequently, enhance the probability of successful reemployment in the long run. By investigating both of these angles, we hope to uncover valuable insights to support individuals in their progression from social assistance towards employment.
Feature sets
We construct two distinct feature sets for our analysis. The first feature set (referred to as the “Admin” feature set) comprises a comprehensive range of characteristics of social assistance recipients, extracted from Statistics Denmark’s administrative registers, which integrate population-wide data from all public databases. This data set includes demographic variables such as sex, age, ethnicity, cohabitation status, municipality of residence, and educational level, alongside detailed records on employment history, income, health status, medical diagnoses, and criminal history. Additionally, it captures information on social benefits, disability support, and housing conditions, offering a robust foundation for analyzing labor market trajectories. An exhaustive list of variables included in the Admin feature set is provided in Panel B of Supplementary Table A.1 in the Supplementary Information. The Admin feature set provides information on the participants receiving social assistance, representing characteristics that are often challenging, if not impossible, to change.
The second feature set is the ERIQ (referred to as the “ERIQ” feature set). It contains all the information obtained from the two questionnaires (one for clients and one for caseworkers). Social assistance recipients participating in ERIQ are queried approximately every three months during compulsory meetings at the PES, where they respond to a set of questions about their personal experiences. These questions cover various aspects, including social networks, coping strategies in daily life, health management, and knowledge about opportunities in the labor market, as well as job search strategies. Additionally, the caseworkers are asked to evaluate the same social assistance recipients at the same meetings, using a set of indicators, some of which overlap with the participant’s indicators, while others explore additional dimensions, such as concentration ability and the caseworker’s belief in the participant’s potential for employment. The selection of questions for both the participants and caseworkers was based on a comprehensive literature review Væksthuset and NewInsight (2012), aiming to identify employment readiness indicators that are malleable. The selected indicators are summarized in Table 3, while the full set of questions are available in Supplementary Table A.2. For descriptive statistics, please refer to Panel A in Supplementary Table A.1.
Finally, we combine the two feature sets into a third feature set (“Admin + ERIQ”) to investigate whether the information contained in both sets complements each other, resulting in improved predictions. Alternatively, if no significant improvement is observed, it may suggest that one of the sets is more influential in the prediction process.
Prediction models
Following Rosholm et al. (2024), we employ four different machine learning methods of varying complexity to predict the primary and secondary outcomes. Importantly, all four models are implemented using the same sample splits and data, ensuring the model predictions are directly comparable.
Linear probability model
First, we consider a linear probability model (LPM) estimated using ordinary least squares. This model offers the advantage of being straightforward and interpretable, allowing us to determine the influence of each variable by examining the regression coefficients. However, the disadvantage of the LPM lies in its simplicity, as it only captures linear relationships in the data and assigns non-zero weight to all variables in the feature set, which increases the risk of overfitting.
Logistic regression model with LASSO
The second model combines the Least Absolute Shrinkage and Selection Operator (LASSO) (Tibshirani, 1996) with a logistic regression framework. This hybrid approach is well-suited for handling binary outcome variables and provides both variable selection and regularization, enhancing the precision of the predictions. To determine the optimal size of the regularization parameter λ, we employ five-fold cross-validation. Specifically, we select the value of λ that maximizes the cross-validated AUC-ROC (see below).
To implement this model, we utilize the glmnet R package, and following the authors’ recommendations, we standardize all variables to have a mean of zero and a standard deviation of one. This standardization helps ensure comparability and stability in the model’s performance across different variables.
Random forest model
The third model is a random forest model, initially introduced by Breiman (2001), which employs bagging as an ensemble learning technique. Bagging involves training different individual decision trees on various random subsets of the data in parallel. Additionally, random forest models perform a random selection of explanatory variables for each decision tree, significantly reducing the risk of overfitting the model.
For the implementation of the random forest algorithm, we utilize the ranger R package (Wright and Ziegler, 2017). To optimize the model’s predictive performance, two critical hyperparameters, namely the number of variables considered at each node (mtry) and the minimal node size (min.node.size), were thoughtfully selected. We employed a Bayesian optimization approach to identify the optimal hyperparameter configurations, maximizing the AUC-ROC through five-fold cross-validation. We implement the random forest algorithm using 1,000 independent trees.
Extreme gradient boosting model
The final and most complex predictive model is the extreme gradient boosting (XGBoost) model (Chen and Guestrin, 2016). This method uses boosting as an ensemble learning technique. Boosting combines weak models iteratively, focusing on correcting errors made by previous models, to create a strong predictive model. The XGBoost algorithm effectively handles nonlinear relationships in the data and mitigates overfitting through regularization and pruning.
To estimate the XGBoost model, we utilize the xgboost R package and fine-tune its performance by optimizing seven hyperparameters through Bayesian optimization. In accordance with the xgboost package’s terminology, we explore the following hyperparameters: max.depth, eta, gamma, subsample, colsample_bytree, colsample_bynode, and min_child_weight. Specifically, we search for the hyperparameter configurations that yield the highest AUC-ROC in the training sample.
Performance metrics
The predictive models we consider yield the probability of the transition into employment (or one of the other outcomes). To assess their performance using two different feature sets, we employ AUC-ROC and AUC-PR as performance metrics.
The ROC curve plots the true positive rate of the predictive model against its false positive rate for each decision threshold from 0 to 1. A higher AUC-ROC indicates that the model is more likely to assign a higher predicted probability of transition into employment to a randomly chosen true positive (i.e., an individual actually finding employment) than to a randomly chosen true negative (i.e., an individual not finding employment). It is essential to note that a fully random prediction would yield an AUC-ROC of 50%.
In binary classification, the precision of a classifier is the ratio of true positives to the total number of predicted positives (true positives plus false positives), while recall corresponds to the true positive rate (true positives divided by the sum of true positives and false negatives). By adjusting the threshold between zero and one for a given prediction model, we can plot the precision-recall curve, and the area under this curve (AUC-PR) can be calculated. An optimal model would have an AUC-PR value of one, indicating perfect precision and recall, while random guessing yields a score equal to the proportion of positives in the data (in our case, 5.8% for employment). Higher AUC-PR values indicate better model performance for a specific data set, but it is crucial to compare them to the prevalence of the outcome in the data. Therefore, direct comparison of AUC-PRs between different data sets or outcomes should be avoided, as their interpretation is specific to the characteristics of each data set. However, it is valid for comparison between different feature sets and model specifications.
The AUC-PR has a particular advantage in the context of highly imbalanced data, in the present case where the fraction of negatives is significantly larger than the fraction of positives (Saito and Rehmsmeier, 2015). In the ROC approach, equal importance is given to predicting both negative and positive instances correctly, which might lead to a high AUC-ROC score even when the model exhibits a significant number of false positive predictions. This is more likely to happen in severely imbalanced data sets, where true negatives outweigh false negatives. However, because the AUC-PR focuses on how well the model predicts the positives (i.e., movement into employment), the fraction of correctly predicted negatives becomes irrelevant.
In the context of transitions from social assistance to employment, it is crucial to study how well a predictive model can identify positive outcomes. Therefore, focusing on precision and recall allows us to address this aspect effectively, ensuring that the model’s performance is assessed based on its ability to predict positive outcomes accurately.
Explaining predictions
To elucidate the influence of different variables, including interactions between them, on the outcomes of interest, we employ Shapley additive explanation (SHAP) values (Lundberg et al. 2020; Lundberg and Lee, 2017). SHAP values offer a model-agnostic approach to unravel the underlying factors shaping the predicted probabilities of the transition out of unemployment.
By utilizing SHAP values, we can gain insights into how predictive models make specific predictions for each individual in the dataset. These values provide a measure of the contribution of each variable in each feature set to the final prediction. A SHAP value for a variable expresses how much its information alters the model’s opinion in relation to the prediction. In other words, SHAP values illustrate how the values of individual variables influence the prediction away from the average prediction of the outcome while accounting for correlations between variables. For comparison, the SHAP values equal the regression coefficients of a linear regression model in situations where variables are uncorrelated and there are no interactions.
The adoption of SHAP values enhances the interpretability and transparency of predictive models, enabling a deeper understanding of the factors influencing the outcome of interest. The insights gleaned from SHAP values facilitate tailored interventions and evidence-based policy decisions aimed at adaptable variables, thus potentially contributing to higher employment rates in the long run and increasing the well-being of social assistance recipients not ready for work.