Random effects regression trees for the analysis of INVALSI data

Mixed or multilevel models exploit random effects to deal with hierarchical data, where statistical units are clustered in groups and cannot be assumed as independent. Sometimes, the assumption of linear dependence of a response on a set of explanatory variables is not plausible, and model specification becomes a challenging task. Regression trees can be helpful to capture non-linear effects of the predictors. This method was extended to clustered data by modelling the fixed effects with a decision tree while accounting for the random effects with a linear mixed model in a separate step (Hajjem & Larocque, 2011; Sela & Simonoff, 2012). Random effect regression trees are shown to be less sensitive to parametric assumptions and provide improved predictive power compared to linear models with random effects and regression trees without random effects. We propose a new random effect model, called Tree embedded linear mixed model, where the regression function is piecewise-linear, consisting in the sum of a tree component and a linear component. This model can deal with both non-linear and interaction effects and cluster mean dependencies. The proposal is the mixed effect version of the semi-linear regression trees (Vannucci, 2019; Vannucci & Gottard, 2019). Model fitting is obtained by an iterative two-stage estimation procedure, where both the fixed and the random effects are jointly estimated. The proposed model allows a decomposition of the effect of a given predictor within and between clusters. We will show via a simulation study and an application to INVALSI data that these extensions improve the predictive performance of the model in the presence of quasi-linear relationships, avoiding overfitting, and facilitating interpretability.


A tree embedded linear mixed model
We propose a random effect model, called Tree Embedded Linear Mixed (TELM) model, able to treat both non-linear and interaction effects and cluster mean dependencies. Motivated by the application of interest, we consider in particular a two-level random effect model. Hence, we will denote as level 1 units the statistical units (e.g. students) and level 2 units the groups (e.g. schools).
The model is a piecewise-linear regression function, consisting of the sum of a tree component and a mixed effect linear component. The proposal is the mixed effect version of the semi-linear regression trees (Vannucci, 2019). It can be ideally divided into three parts: a fixed effect linear part, a fixed effect non-linear part based on a tree and a random effect part. The resulting model can be formulated as where Y ij is the response variable for level 1 unit i belonging to level 2 unit j, β 0 is the (fixedeffect) regression intercept, X ij is the vector of the level 1 covariates, β the associated fixed effect coefficients, Z j is the vector of the level 2 covariates, γ the associated fixed effect coefficients. Here, T (X ij , Z j ) is the tree based component depending on some or all the level 1 and the level 2 explanatory variables. Finally, U j ∼ N (0, σ 2 u ) is the random intercept for level 2 unit j and ij ∼ N (0, σ 2 ) are the regression errors.
The model is additive in its components where the tree-component acts as a region-specific categorical variable. This can be seen in the following alternative specification where R 1 , . . . , R M is the partition of the predictor space corresponding to the tree-component. When the unknown regression function can be assumed to be quasi-linear (Wermuth and Cox, 1998), the number of leaf nodes M can be kept small to avoid overfitting.
To account for the contextual effects of level 1 predictors, we add the cluster mean W j = (1/n j ) n j i=1 W ij to the set of level 2 predictors Z j (Snijders and Bosker, 2012). An iterative, backfitting-like procedure obtains model fitting. First, the tree is initialised at the mean of the response variable and the partial residuals Y * are computed by subtracting to Y the tree prediction. Secondly, a linear random intercept effect model is fitted on Y * and explanatory variables at the individual and group level. The corresponding partial residuals Y * * are obtained by subtracting to Y model predictions. These partial residuals Y * * are employed in the next step to fit a new tree, using the CART algorithm (Breiman et al., 1984) with a short depth. We iterate alternating the two fitting steps until convergence is reached. At the end of the procedure, model (2) is fitted by a linear random effect model using the partition associated with the tree selected at convergence. The leaf node parameters µ m are estimated jointly with the other model parameters β 0 , β, γ, σ 2 u , σ 2 ε . The main difference of our procedure with respect to previous proposals (Hajjem et al., 2011;Sela, 2012), is the inclusion of the linear component X ij β + Z j γ in the random effect model (2). In the presence of quasi-linear relationships, this inclusion allows us to avoid overfitting and helps interpretation. Moreover, since the µ m are jointly estimated in the final step, standard hypothesis tests and confidence intervals can be used for model selection and evaluation, together with the mean squared error computed on a test data set for prediction accuracy evaluation.

Application: Invalsi tests in Italian schools
We apply the TELM model outlined in the previous section to data on students' achievement collected by INVALSI. The Institute yearly carries out standardised tests to assess students' achievement in mathematics and reading and evaluate the overall quality of the educational offering of schools and vocational training institutes. See Arpino et al. (2019) for a discussion on this set of data.
As an illustration, we are here focusing on data on students who participated in the Maths tests at 5 th and 8 th grades. Specifically, the dataset is obtained by linking data on students who attended the 5 th grade in 2013-2014 with data on students who attended the 8 th grade in 2016-2017. The number of students who participated on both occasions of the Maths test is 409 528. They are grouped into 5 773 schools. We aim to predict the Maths test score, while understanding which of the included variables may be associate to the final score. Table 1 lists the considered explanatory variables. As shown in the table, we include both student level and school-level covariates, denoted in (1) as X ij and Z j respectively. Among the school level variables, we consider, in addition, the average of 5 th grade Maths test and the average of the Socio-economic status index for each school. We are denoting these variables CM MATH5 and CM SES. The proposed model takes into account both linear and non-linear effects and can detect the presence of both within level and cross-level interaction effects. In particular, the tree component T (X ij , Z j ) in (1) is modelling non-linearities and interactions at once via a piece-wise linear function. Estimates for model parameters are reported in Table 2, while the tree component is also illustrated in Figure 1. The two terminal nodes without label in the plot have been automatically set in the reference category.
Individual and school level covariates not selected by the algorithm in the tree component have the usual interpretation. For example, controlling for the model covariates, females have, on average, around 1.5 points less than males in the score of math at the 8th grade.
Besides the usual interpretation of the coefficients of the linear components, it seems here interesting to focus on the covariates selected in the tree component of the model, namely the math score at grade 5 (MATH5) and the geographical area of the school (AREA). In particular, the tree component algorithm splits the values of MATH5 into three intervals: below 33 (2% of the observations), between 33 and 72 (55%) and above 72 (43%). Moreover, the algorithm 23 31 splits the schools into two groups depending on AREA: schools placed in North or Center Italy, and schools placed in South Italy and Islands. Thus, the algorithm suggests the presence of an interaction effect between these two variables, with the effect of AREA depending on the interval of MATH5 and vice versa. For example, for a pupil living in a region of NW of Italy, the expected difference with respect to a pupil with same characteristics living in the NE of Italy (baseline) is 2.4386 if MATH5< 33, it decreases to 0.4675 if 33 <MATH5< 73, and it rises up to 4.9494 if MATH5 ≥ 73.
Note that the ordinary mixed effect regression model, whose parameter estimates are reported in Table 3, is nested with the TELM model. The Likelihood Ratio test comparing these two models obtains a test statistic equal to 10168, with 4 degrees of freedom, in favour of the TELM model. The variation between the estimates in the two models is due to the inclusion of the tree component, that relaxes the assumption of linearity and includes interaction effects. An interesting variation concerns the AREA coefficients estimates. Ignoring the AREA and MATH5 interaction, and the MATH5 non-linearity, completely reverse the main effect of AREA for South and Islands.

Conclusions
Tree Embedded Linear Mixed (TELM) models extend random effect models by including both a linear component and tree component in the regression function. The proposal increases the flexibility and the predictive ability of ordinary random effects models by handling simultaneously linear and non-linear associations and interactions.
A TELM model has the following characteristics: (1) it can handle clusters with different numbers of observations (unbalanced clusters); (2) it allows the inclusion of level 1 and level 2 covariates in the splitting process; (3) it allows observation-level covariates to have random effects. Besides, our proposal extends random effect regression trees in two directions: (i) incorporating a linear component in the final random effect model, and (ii) allowing to take into account contextual effects of level 1 covariates.
The application on INVALSI data is an illustrative example of TELM models that shows how the inclusion of a tree component helps highlight cross-level interactions.