Profiling visitors of a national park in Italy through unsupervised classification of mixed data

The success of a tourism destination, among other things, relies on the implementation of a strategic marketing plan. Since the identification and understanding of customers features and needs are essential for a correct market segmentation, the use of inappropriate techniques could result in missing strategic marketing opportunities (Bloom, 2004, Thompson & Schofield, 2009). Furthermore, any subsequent marketing activity would incur the risk to disappoint customers’ expectations, producing their dissatisfaction. Moreover, the segmentation of markets based on visitor features and their motivations enables the identification of strengths and opportunities of a market (Lee & Lee, 2001). The main benefit of market segmentation lies in knowledge acquisition. Profiling visitor allows to identify current consumers travel behaviour and to forecast future ones (Suleiman & Mohamed, 2011), enabling to acquire a competitive advantage (Hsu & Kang, 2003; Bui & Le, 2016, Koshy et al, 2019). The aim of our study is to determine visitors characteristics and their satisfaction toward facilities of the National Park of Majella, in Italy. The outcome of our analysis is expected to serve as a guide for tourism operators, in order to facilitate plans toward formulating robust marketing strategies aimed to enhance visitors satisfaction. Our data have been collected on-site, from a sample of park visitors, and include both continuous and categorical features. In order to cluster such kind of data, we used an unsupervised classification method, specific for mixed data. The paper is articulated as follows: in Section 2 we explain our data and consider the main clustering approaches for mixed variables, whereas in Section 3 we show the results obtained by the application of these methods to our dataset, providing an evaluation of the clustering results, by means of internal and external validity indexes. Finally, in Section 4, we draw some conclusions and discuss some suggestions for future research.


Introduction
The success of a tourism destination, among other things, relies on the implementation of a strategic marketing plan. Since the identification and understanding of customers features and needs are essential for a correct market segmentation, the use of inappropriate techniques could result in missing strategic marketing opportunities (Bloom, 2004, Thompson & Schofield, 2009). Furthermore, any subsequent marketing activity would incur the risk to disappoint customers' expectations, producing their dissatisfaction. Moreover, the segmentation of markets based on visitor features and their motivations enables the identification of strengths and opportunities of a market (Lee & Lee, 2001).
The main benefit of market segmentation lies in knowledge acquisition. Profiling visitor allows to identify current consumers travel behaviour and to forecast future ones (Suleiman & Mohamed, 2011), enabling to acquire a competitive advantage (Hsu & Kang, 2003;Bui & Le, 2016, Koshy et al, 2019.
The aim of our study is to determine visitors characteristics and their satisfaction toward facilities of the National Park of Majella, in Italy. The outcome of our analysis is expected to serve as a guide for tourism operators, in order to facilitate plans toward formulating robust marketing strategies aimed to enhance visitors satisfaction. Our data have been collected on-site, from a sample of park visitors, and include both continuous and categorical features. In order to cluster such kind of data, we used an unsupervised classification method, specific for mixed data.
The paper is articulated as follows: in Section 2 we explain our data and consider the main clustering approaches for mixed variables, whereas in Section 3 we show the results obtained by the application of these methods to our dataset, providing an evaluation of the clustering results, by means of internal and external validity indexes. Finally, in Section 4, we draw some conclusions and discuss some suggestions for future research.

Data and method
Our dataset results from a questionnaire which has been collected on-site, from a sample of visitors of the Park, during the period from July 16 until October 27, 2020. A total of 523 tourists has been interviewed.
The Majella National Park is in Abruzzo, central Italy, and incorporates the provinces of Chieti, L'Aquila and Pescara, including 39 municipalities, characterized by a high spatial heterogeneity. This natural area is crucial for the protection of the natural ecosystem and for the socio-economic development of the area.
These data allow to perform a qualitative analysis on visitors of the Majella National Park, and consequently to assess their satisfaction level on the Park services.
The variables analysed are 16 (9 numerical -7 categorical) and the entries are 523. The numerical variables concern the visitors perceived quality (measured in a 5 point Likert scale) on the following aspects: the web site, the naturalistic heritage conservation, the adequate presence of signage, of public transport, of children amenities, of footpaths maintenance, of accommodation facilities, of restaurant services and of food and wine products. The qualitative variables, instead, involve the following variables: customers' expectations, the aim of their trip, the chosen location and how they came to its knowledge, the number of overnight stays, the type of chosen accommodation and, finally, the daily average expenditure per person.
In literature, most clustering approaches are limited to numerical or categorical data only. The traditional approach, instead, when dealing with both quantitative and qualitative variables, is to convert the latter values into numerical ones, and then apply the quantitative value based clustering methods (Foss et al, 2016;Ichino et al, 1994, Caruso et al, 2018. However, this approach would ignore the similarity information enclosed in the qualitative attributes, producing a loss of knowledge (Ahmad, A. & Dey, L. 2007). Finding a unified similarity metric for both kind of data, instead, would allow to remove the metric gap between them. Therefore, in order to detect different clusters, we compared two of the most used mixed data clustering methods, namely, the methods of Huang (Huang, Z., 1997) and Cheung & Jia (Cheung, Y. & Jia, H., 2013).
For sake of brevity, we will not describe in detail the methods we adopted to analyse the variables; the reader may consult our previous works for details (Caruso et al 2018(Caruso et al -2019.

Results
We implemented a cluster analysis with a number of clusters equal to 3. Table 1 displays, for each cluster, the mean value of the 9 quantitative attributes analyzed and shows that the patterns produced by the two performed methods, specific for mixed data, are quite similar among them. The Huang one, in particular, highlights a slightly stronger clustering structure, meaning that the dissimilarity between clusters is higher.  Figures 1 and 2 show the boxplot of the variables "Signage" and "Footpaths" in each cluster. The visual analysis highlights different median values in each group. Similar behaviours have been observed for the remaining quantitative variables. Table 2 reports the results for the variable "overnight stays". The mode of the marginal distribution is represented by the value "1-3 nights stays" (42%). The clusters identified by the Cheung method are characterized with three different modes "1-3 nights stays" (Cluster 2), "4-7 nights stays" (Cluster 3) and "more than 7 nights stays" (Cluster 1).
The Huang method produced a slightly different result with two clusters out of three having mode "1-3 nights stays".   A similar pattern can be observed with regards to the variable "Accommodation" (Table 3). The clusters identified by the Cheung method have different modes, i.e. "Other" (Cluster 3), "Second house" (Cluster 1) and "Hotel" (Cluster 2) while Clusters 2 and 3 of Huang have the same mode "Second house".  With regards to the variable "Expenditure" (Table 4), the mode of the marginal distribution is represented by "10-30 Euros" (36%). The same result is observed in two out of three clusters for both methods.

EXPENDITURE Huang Cheung
Cluster  With regards to the variable "Expectation" (Table 5), most tourists visited the park in order to take "guided tours for environmental education" (45%). This result is in line with all clusters produced by the Huang method and by two clusters obtained by the Cheung method.

EXPECTATION Huang Cheung
Cluster  Synthetizing, by using the Huang method, cluster 1 differs from the others because it is characterized by tourists which stay in hotel, from 1 up to 3 nights, with an average daily expenditure of Euro 50,00. Cluster 2, instead, includes visitors which choice falls on B&B or rented rooms, for a period from 1 to 3 nights and which the average daily expenditure ranges from Euros 10 to 30. Visitors belonging to cluster 3, instead, choose their second house and they stay for more of 7 nights and with an average daily expenditure which ranges from 10 to 30 Euros.
When using the Cheung method, cluster 1 includes tourists which stay in their second houses, for more than 7 nights, and which daily expenditure ranges from 10 and 30 Euros. The aim of their visit is to take guided tours for the environmental education and their final goal is relaxation. Tourists inside cluster 2, instead, choose to stay in hotel, from 1 up to 3 nights, and they spend more than 50 Euros per day. Both in case of expectation and motivation they selected the option "other". The tourists of cluster 3 choose an alternative kind of accommodation and they stays from 4 to 7 nights. Their daily expenditure goes from 10 to 30 Euros. Their expectation is to take guided visits for the environmental education and their aim is to relax.
Internal validity Indexes were computed in order to evaluate the quality of the cluster solutions. Results are shown in Table 6.
For numerical variables, the Calinski-Harabasz and the Silhouette Indexes are reported. Higher values correspond to better results; thus, the method of Huang is the one performing better when it comes to quantitative variables. With regards to the Internal Index for categorical variables, we used the Entropy Index. In this case a lower value of H corresponds to the best clustering result. The best (lowest) result for Entropy is obtained by using the Cheung method.

Conclusions
In order to detect clusters in a more efficient way, it is very useful to dispose also of qualitative variables. Our main aim was to observe the results of each method and to detect which one performs better. From our analysis it appears clearly that it corresponds to the Huang one as for the numerical variables, whereas the method of Cheung allows to obtain better results when it comes to qualitative ones.
Our objective for the future research is to develop new clustering analysis techniques for mixed data, which will consider an interesting insight provided by the work of Diday & Govaert, proposing an adaptive dynamic clustering procedure useful to calibrate the weights between qualitative and quantitative variables. 125 139