Nonparametric methods for stratified C-sample designs: a case study

The analysis of C-sample designs in the presence of stratification is a problem frequently faced by practitioners. In the industrial field a variety of stratified analysis scenarios present themselves. Take, for example, a company that wishes to assess the performance of three different formulas for a new dishwasher detergent. Multiple dishwashers are used and multiple washes are carried out. At the end of each wash, an expert provides an evaluation of the cleaning performance of the formula. When analyzing the resulting data, the effect of using one dishwasher instead of another cannot be ignored, so each dishwasher is considered to be a separate stratum. Likewise, in the healthcare field it is quite common for multiple drugs to be tested on patients of different age groups. Each age group is again considered to be a stratum. In this paper we focus on a scenario from the field of education. We are interested in assessing how the performance of students from different degree programs at the University of Padova changes, in terms of university credits and grades, when compared with their entrance exam results. In other words, we want to assess whether people who achieved the best results in this exam perform best during their academic career. The entrance exam can have three possible outcomes (i.e. it is an ordinal variable). This is therefore a typical stochastic ordering problem (Basso et al., 2009; Basso and Salmaso, 2011; Bonnini et al., 2014), that is a problem in which the main interest lies in evaluating the null


Introduction
The analysis of C-sample designs in the presence of stratification is a problem frequently faced by practitioners.
In the industrial field a variety of stratified analysis scenarios present themselves. Take, for example, a company that wishes to assess the performance of three different formulas for a new dishwasher detergent. Multiple dishwashers are used and multiple washes are carried out. At the end of each wash, an expert provides an evaluation of the cleaning performance of the formula. When analyzing the resulting data, the effect of using one dishwasher instead of another cannot be ignored, so each dishwasher is considered to be a separate stratum. Likewise, in the healthcare field it is quite common for multiple drugs to be tested on patients of different age groups. Each age group is again considered to be a stratum.
In this paper we focus on a scenario from the field of education. We are interested in assessing how the performance of students from different degree programs at the University of Padova changes, in terms of university credits and grades, when compared with their entrance exam results. In other words, we want to assess whether people who achieved the best results in this exam perform best during their academic career.
The entrance exam can have three possible outcomes (i.e. it is an ordinal variable). This is therefore a typical stochastic ordering problem (Basso et al., 2009;Basso and Salmaso, 2011;Bonnini et al., 2014), that is a problem in which the main interest lies in evaluating the null , where at least one inequality is strict, and ψ(·) is an increasing function (Pesarin and Salmaso, 2010). Our aim is in fact to assess whether by comparing increasing entrance exam outcomes, the C = 3 corresponding distributions of the student's performance measure Y are stochastically ordered.
A few nonparametric methods have been proposed in the literature to address these problems. Among them, Jonckheere's test (Jonckheere, 1954;Terpstra, 1952) is one of the first nonparametric solutions to test for ordered alternatives and is based on use of the Mann-Whitney test (Mann and Whitney, 1947) to perform all the possible [C × (C − 1)]/2 pairwise comparisons between C groups. Neuhäuser et al. (1998) also proposed a modification of this test that appears to be more powerful than the original test with small sample sizes (Shan et al., 2014). Additionally, permutation-based solutions involving the Non-Parametric Combination (NPC) technique (Pesarin and Salmaso, 2010;Klingenberg et al., 2009;Finos et al., 2007Finos et al., , 2008 were introduced.
We propose a further extension of the NPC technique to address stochastic ordering problems in the presence of stratification. Indeed, the impact of the student's choice of degree program cannot be ignored, therefore stratification must be considered in the testing procedure.
In section 2 we are going to describe the proposed permutation-based approach. In section 3 we apply it to the case study of interest related to university education. Finally, section 4 provides the results and conclusions.

Methodology
Firstly, let us further describe the stochastic ordering problem. The main interest lies in evaluating the system of hypotheses: where the symbol d = denotes equality in distribution and where F j is the cumulative distribution function. An alternative way to write this is: (1) NPC-based solutions generally consider a particular decomposition. The hypotheses are split in order to recreate the conditions of a set of two-sample problems as follows: where the null hypothesis H 0 is the intersection of a number of partial hypotheses and the alternative hypothesis H 1 is the union of C − 1 sub-hypotheses. For each pair of sub-hypotheses H i0 and H i1 , the first i and the last (C − i) samples are pooled, so that two new samples X 1 and X 2 are achieved, with sizes N and M . The subproblem can therefore be rewritten as: Each sub-hypothesis is then tested separately, using appropriate permutation tests. The adopted test statistic can differ according to the nature of the data, but a common and versatile choice is the modified version of the Anderson-Darling test statistic: According to the NPC algorithm (Pesarin and Salmaso, 2010), B permuted datasets are independently generated for each sub-problem and the related values of the test statistic T * b , b = 1, . . . , B are calculated to simulate the null distribution of T . Partial p-values (λ i ) and λ * ib , b = 1, . . . , B estimating their distributions can therefore be achieved. It is worth noting that the same permutation design is adopted for each sub-problem, to implicitly take into account the existing dependency among sub-problems.
A combination step now needs to be performed. The partial p-values λ i , i = 1, . . . , C − 1 related to the C − 1 sub-problems {H i0 vs H i1 } are combined using an adequate combining function, such as Fisher's combining function T F = −2 · C−1 i=1 log(λ i ). The same is done for each of the B vectors λ * ib , i = 1, . . . , C − 1. The elements of the new resulting vector represent the second-order test statistics, from which it is finally possible to achieve the global p-value λ to assess the system of hypotheses 1.
Given that stratification needs to be included, we propose firstly applying this procedure to each of the S strata, testing S systems of hypotheses: H 0s : F 1s = F 2s = · · · = F (C−1)s = F Cs H 1s : F 1s ≥ F 2s ≥ · · · ≥ F (C−1)s ≥ F Cs and at least one strict inequality. (3) After applying the aforementioned NPC-based approach to each stratum, the global p-values λ s , ∀s = 1, . . . , S (and the λ * sb estimating their distributions) are thus retained. Then we adopt a further combination step, using the Fisher combining function, and retrieve a final p-value λ . In this way, by comparing λ to the desired significance level α, we are able to solve the global stochastic ordering problem H 0 vs H 1 .
Given that multiple systems of hypotheses H s0 vs H s1 , ∀s = 1, . . . , S are assessed, we then apply an appropriate multiplicity correction to control the false discovery rate (FDR). Our choice is the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995).

A case study
Let us now focus on the real stratified C-sample problem at hand. As mentioned before, we are interested in evaluating the performances of students from different degree programs at the University of Padova. In particular, we want to understand if the university credits gained at the end of the first year (Y a ), the credits gained at the end of the third year (Y b ) and the final average grade (Y c ) somehow depend on the results achieved by the student in the entrance exam. In other words, we try to indirectly assess the efficacy of this exam in evaluating and selecting future students. The analysis is performed using R (R Core Team, 2020).
Let us briefly describe the data. The total sample size is 3083 students. Firstly, the degree programs are grouped into 4 classes (identified by their Italian subject titles): The different classes represent different strata (i.e. S = 4) and have different sample sizes (see Figure 1). The variable reporting the outcome of the entrance exam has three modalities (i.e. C = 3), namely INSUFFICIENTE, SUFFICIENTE and PIU' CHE SUFFICIENTE (Insufficient, Sufficient and More Than Sufficient). For the sake of simplicity, we are going to refer to them as INS, SUF and PIU in our notation. In Figure 1, the possible outcomes are ordered from worst to best. For with at least one strict inequality, taking into account the effect of the degree program class.
Looking at credits gained at the end of the first year, a first descriptive analysis (see Figure  2) appears to support the alternative hypothesis. Indeed, in all strata, students achieving INS at the entrance exam appear to perform worse than students achieving SUF, and students achieving PIU at the entrance exam tend to perform better than students achieving SUF. Similar conclusions can be drawn about both credits gained at the end of the third year (see Figure 3) and the average grade at the end of the academic career (see Figure 4).
Applying our testing procedure, we managed to confirm these hypotheses. We set B = 10000 and used the test statistic in Equation 2 and Fisher's combining function. When looking at Y a (see Table 1), all the partial p-values and the global p-value proved to be substantially smaller than 1%. The only exceptions were ING CIVILE AMBIENTALE L7 (S2) and ING INFORMAZIONE L8 (S3), for which the descriptive analysis shows that the order among entrance exam outcomes is less evident.

Conclusions
In this paper we presented a new solution to C-sample stochastic ordering problems in the presence of stratification, focusing on its application to a case study from the field of education.
Our proposal takes advantage of the Non-Parametric Combination (NPC) procedure (Pesarin and Salmaso, 2010), a versatile permutation-based methodology allowing us to solve several different complex problems, such as stochastic ordering. We apply this technique to evaluate the presence of stochastic ordering in each of the S existing strata and then use an appropriate combining function to assess the stochastic ordering in all the samples.
The application of this procedure allowed us to assess the efficacy of the University of Padova's entrance exams in evaluating and selecting future students. Indeed, it emerged that students with the worst results in the entrance exam tended to perform the worst during their academic career, in terms of both university credits achieved at the end of the first and third years and in terms of the final average grade, independently of the chosen degree program. The only exception was people from ING CIVILE AMBIENTALE L7 and ING INFORMAZIONE L8. For these two strata, when the credits at the end of the third year were considered, it was not possible to find enough evidence in favor of the stochastic ordering hypothesis.
Overall, this approach appears to be significantly promising and a simulation study has been planned to further explore its performances.