An evaluation of variable selection methods using Southern Africa solar irradiation data

Dimensionality poses a challenge in developing quality predictive models. Often when modelling solar irradiance (SI), many covariates are considered. Training such data has several disadvantages. This study sought to identify the best variable embedded selection method for different location and time horizon combinations from Southern Africa solar irradiance data. It introduced new variable selection methods into solar irradiation studies, namely penalised quantile regression (PQR), regularised random forests (RRF), and quantile regression forest (QRF). Stability analysis, performance and accuracy metric evaluations were used to compare them with the common lasso, elastic and ridge regression methods. The QRF model performed best in all locations followed by the shrinkage methods on hourly data. However, it was found that QRF is not sensitive to associations through correlations, thereby ignoring the relevance of variables while focusing on importance. Among the shrinkage methods, the lasso performed best in only one location. On the 24-hour horizon, elastic net dominated the performances among the shrinkage methods, but QRF was best in three locations of the six considered. Results confirmed that variable selection methods performed differently on different situational data sets. Depending on the strengths of the methods, results were combined to identify the most paramount variables. Day, total rainfall, and wind direction were superfluous features in all situations. The study concluded that shrinkage methods are best in cases of extreme multicollinearity, while QRF is best on data sets with outliers or/and heavy tails.


Introduction
It is inevitable that when studying solar irradiance (SI) one has to consider a lot of variables that may influence the radiation of the sun's energy on the earth's surface.Considering all variables in forecasting models introduces the challenge of the curse of dimensionality.This is a phenomenon where models are negatively affected by the increase in the number of covariates.Some of these covariates may be giving redundant information which does not have any influence on prediction processes.These superfluous variables must be excluded when training the data.The question becomes, which variables are paramount to consider when building a model?It is necessary to find ways of identifying those insignificant features and exclude them in model development before training the model.In addition, a high dimensional training data set can negatively affect a predictive model in several ways: (1) Prediction accuracy is reduced; (2) models do not learn well a large number of irrelevant variables; (3) some important variables may not be picked due to interference from irrelevant variables; (4) it makes the model complex to interpret; (5) the algorithm processing time is increased; (6) too many resources are used in the prediction process; (7) maintenance is difficult.As proved by Hossain et al. (2013), including an optimal feature subset provides better prediction accuracy in forecasting solar power, and the selection of a small (possibly minimal) feature set giving the best possible classification results is desirable for practical reasons.The subset fits well the data because it contains the most paramount variables.An optimal subset reduces overfitting in the model-building process.Different variable selection methods have been used to find this optimal subset of features, but the methods were developed to suit different data conditions.As a result, they focused on different aspects of data sets that have been developed thereby introducing different variable selection performances.That is, on different situational data sets the methods would give different optimal subsets.Therefore, this study is motivated by the need to establish situational SI data sets when different embedded variable selection methods are best applicable.A comparative investigation of existing variable selection methods in SI studies is made here, and new variable selection methods are introduced to achieve this objective.

Rationale and contribution of the study
Several methods are applicable to solve the curse of dimensionality in different situational data sets.However, according to the best of our knowledge, all studies that included variable selection on Southern Africa SI data applied lasso and/or its extensions (shrinkage methods only) without any comparative analysis.The methods might not have been the best approaches for those different situational data sets.
A comparison of the variable selection methods is necessary before application.
The main contribution of the current study is to demonstrate to the solar energy industry, meteorologists and the body of knowledge at large that different variable selection approaches perform differently on different situational data sets.The study showed that, although it may appear that lasso is the ideal variable selection method for Southern Africa SI data sets, it depends on location, time horizon and nature of the data.To complement regularisation in its sensitivity to outliers we introduced penalised quantile regression (PQR).That is, adding the robustness to outliers and/or heavy-tailed data property of quantile regression (QR) to shrinkage methods.Since shrinkage methods only measure the relevance of a variable, random forests (RFs) which measure feature importance were also proposed.RFs were developed to improve learning performance by use of a voting system which enables them to measure the importance of variables as well as predict the response variable.That is, the current study proposed the inclusion of evaluating the importance of variables in addition to relevance when finding an optimal feature subset.It suggested that a subset without corresponding variable importance measures is a local minimal.A global minimal subset of variables should consider variables of both best relevance and importance.In addition to proposing RFs, the study also checked whether regularising them improves their performance.It further proposed hybridising the RF method with QR modelling.Apart from being robust to outliers and/or heavy-tailed data, QR gives unique insights into the predictor-response variable relationship through percentiles.Though this concept of hybridising models has become popular in machine learning methods, the study investigated whether hybridisation improves the embedded variable selection methods used to solve the curse of dimensionality when modelling SI.Hybridisation may not be necessary in some situations.

Research highlights
This study shows that, although lasso has been popular in Southern Africa SI modelling, it is not always the best among shrinkage methods as a variable selection technique.The root mean squared error (RMSE) was used to measure the goodness-of-fit of the shrinkage methods.However, shrinkage methods concentrate on relevance, so the evaluation study introduced separate RFs.RMSEs were also used to evaluate the goodness-of-fit of RFs and the PQR model.The PQR model was included to check if the weakness of shrinkages in failure to handle data with outliers and/or noise can be improved by hybridising with QR.The adjusted R 2 and mean absolute scaled error (MASE) were used as measures of performance and accuracy respectively.RFs would not perform better than shrinkage methods amid multicollinearity.As a result, a regularised RF (RRF) was also considered.Noting that SI data is heavy-tailed and sometimes contains outliers, it was checked whether hybridising separately both shrinkage methods and RFs with QR would improve their performances.Analysis of variables selected by the different models was done and results (together with the R 2 ) were used to check the stability of the methods through sensitivity analysis.

Related literature
Amongst the several studies done on solar irradiation in the Southern Africa region so far, only five considered the selection of variables, to the best of our knowledge.Four of them applied lasso, but none did a comparison of the variable selection methods to check which one would best apply to their different situational data sets.However, outside Southern Africa, the latest variable selection method comparison in SI studies was done by Muller (2021).Their developed clustering, nested modelling and regression (CNR) model performed the best under high sensitivity when compared to lasso, lasso least angle regression, and elastic net.The study restricted the number of features, and CNR identified relevant information better than any other.In contrast, El Motaki and El Fengour (2021) used meteorological and geographical data with no correlated features as conditional variants in comparing different filter, wrapper and embedded variable selection methods.However, conclusions were made from the reduced number of features, stability and regression accuracy comparatives.Instead, lasso was considered by Tang et al. (2017) as a solar power generation forecasting tool.Comparison analysis was focused on forecasting accuracy rather than optimal variable selection and found lasso's capability to optimise feature selection as a trade-off between complexity and model forecasting accuracy.Outside the solar energy industry, Yilmaz and Kuvat (2023) used the Rsquare metric to investigate the effect of nine feature selection methods (lasso and elastic net included) on the success of overall equipment effectiveness prediction.Omoruyi et al. (2019) applied mean rank to optimise model selection among direct search, forward selection, backward and stepwise on gross domestic product data.Sanchez-Pinto et al. (2018) concluded that variable section method performance is associated with sample size after comparing regression-based and tree-based algorithms.The elastic net was found to be the most superior among the six algorithms compared by Williams et al. (2015) when applied to frequency and severity models of homeowner insurance claims.Simulation has been found useful in variable selection method comparative studies (Kipruto and Sauerbrei, 2021;Mehmood et al., 2020;Celeux et al., 2015).Surveys and reviews alike can be approaches for investigating the strengths and weaknesses of variable selection methods (Li et al., 2017;Wang, Wang and Chang, 2016;Khalid et al., 2014).All these previous studies were focused on the properties of variable selection methods as the basis for comparison, except El Motaki and El Fengour (2021), who considered variants in the data properties.Therefore, the current study extends variable methods comparison investigation to data sets with separate and combined existence of multicollinearity, heavy tails and outliers.It also adds time horizon variants to meteorological and geographical variants, which has not been done before.Among the existing embedded variable selection methods used in previous studies, the current research introduces PQR, RRF and QRF models.
Lasso being the most common variable selection method in SI studies using Southern Africa data, Mpfumali et al. (2019) and Chandiwana et al. (2021) extended the implementation via hierarchical interactions between predictor variables.They claim that consideration of interactions greatly expands the understanding of the relationships among variables.Mpfumali et al. (2019) highlighted that the lasso is useful especially when dealing with highly correlated predictors but they mixed up relevance and importance in interpreting their results.Mutavhatsindi et al. (2020) presented a bar chart of variable coefficient values and also misinterpreted them as important features.A feature coefficient may be shrunk to zero by a regularisation process, but it does not necessarily mean that it is not important.Ratshilengo et al. (2021) agreed with so many other researchers who used lasso that the method tackles issues of model overfitting.The many advantages of lasso over other regularisation methods have made the method so popular.However, the question is whether the method would be the best in all SI cases.Leng et al. (2006) highlighted that when superfluous variables exist in a regression model and the design matrix is correlated then the probability of a regularisation method on identifying the true set of important variables is less than a constant, not depending on the sample size.Thus, prediction-accuracy-based criteria are not suitable in such a situation.As a result, some researchers have suggested some adjustments to the lasso, including Alhamzawi and Ali (2018), who extended the adjustment to the Bayesian adaptive lasso.We agree with Belloni and Chernozhukov (2011) that running the selection procedure at quantile levels may help.They developed a PQR, which is a hybrid feature selection of regularisation and quantile regression.Randa et al. (2022) stipulated that penalisation re-moves at least nearly all covariates whose population coefficients are zero in a QR process.Since penalised least squares regression is not robust to outliers or heavy-tailed error (Su and Wang, 2021), combining the technique with QR (a useful method on heteroscedasticity) has been found to improve regularisation.QR estimates the response feature at different quantile levels which gives precise insight into the relationship at the upper and lower tails of the distribution.Gu et al. (2017) added that the asymptotic behaviour of quantiles can be directly investigated amid increasing data dimensions.PQR is very competitive, accurate and efficient.However, QR is generally inconsistent in high dimensional settings while shrinkage deals with high multicollinearity among covariates.The method has not been applied to SI data from Southern Africa.
The other weakness of shrinkage methods is that they only focus on associations of covariates with the response through correlation.That is, they lack importance measurement of a variable on the response, and yet lack of correlation does not necessarily mean no importance.RFs classify features as important or rejected.Leo Breiman in 2001 advocated that RFs are the best classifiers for high dimensional data.They form an ensemble of weak unbiased classifiers which combine their results during final classification.No tuning is necessary since trees are grown until each leaf contains just a few elements.Munshi and Moharil (2022) added that RFs can handle missing values with no overfitting.They are less affected by noise in the data, robust to outliers, and stable.Villegas-Mier et al. (2022) found that RFs gave a robust performance with similar results in two different scenarios.RFs delivered accurate and precise results when mapping SI at high latitudes (Babar et al., 2020).Lee et al. (2020) also found that lagged solar irradiance features contribute significantly to the ensemble model.Their RF model produced SI at one-hour lag, relative humidity and showed to have high importance scores on USA data from six stations.Zeng et al. (2020) concluded that the RF model had high performance under different climates and geographic conditions.The importance scores computed by Ibrahim and Khatib (2017) showed that sunshine, hour and temperature were the most important features.We appreciate that assessing importance scores guides the feature selection process.However, the concept of ignoring correlations makes RFs insensitive to interaction effects.Therefore, Deng and Runger (2012) proposed a tree regularisation framework based on the random forest method called regularised random forest (RRF).The model was developed to improve the performance of RFs amid data sets with significant multicollinearity.Now, the quantile regression forest (QRF) model, which includes percentiles, adds a no-parametric property of QR to give valuable information about the dispersion of observations.Freeman et al. (2023) claimed that it is possible to map the upper and lower bounds of predictions with QRF because predicted median and quantiles can be mapped.These predictions from individual trees in the model can follow any probability distribution.Vaysse and Lagaricherie (2017) used the extended ensemble method on soil properties, and it performed better than the common regression kriging.Another strong property of QRF is that it does not assume any prior distribution or stationarity of the response variable, so it is a better methodology to describe variability in the real world than linear QR.This is probably one of the reasons why Vantas et al. (2020) compared QRF and the state-of-the-art kriging method.QRF compared very well with rainfall erosivity data in Greece.Asnaghi et al. (2017) used QRF as a novel approach for coastal management of harmful algal blooms.They found the methodology flexible in such a way that it could be extended to other ecological phenomena that are dependent on meteorological features.Maxwell et al. (2021) also addressed weaknesses in geostatistical methods to model coal properties by proposing a QRF algorithm.The algorithm performed better than the most popular regression kriging method in the field of geostatistics.However, they found out that the algorithm is less intuitive and computationally demanding.The novel approach has not yet been applied to Southern Africa SI data and any SI variable selection study in the globe according to the best of our knowledge.

Data and variables
Radiometric stations in the Southern Africa region are geographically located as shown in Table 1.
Data uploaded from the radiometric stations by the Southern Africa Universities Radiometric Association Network (SAURAN) into their database can be accessed at their website (https://sauran.ac.za).The Meteorological Services Department in Zimbabwe supplied daily averaged insolation data from their Goetz observatory in Bulawayo.There were at least nineteen features considered from each SAURAN station on the hourly recorded SI datasets, with Windhoek having the highest number of twenty-one variables.The daily recorded data sets had more variables considered with Windhoek having the highest number of thirty-three variables.Solar irradiance from the SAURAN database was measured as hourly averaged global horizontal irradiation (GHI) in watts per square metre.However, at the Goetz observatory, solar irradiation was measured as daily total insolation also in watts per square metre.Features considered from each location are shown in Table 2.

Main assumption
It is assumed that all radiometric stations in the Southern Africa region experience similar climatic conditions.As a result, though variable selection methods may select different variables in different locations, conclusions would apply to any other location within the Southern Africa region.That is, a change of location within the same climatic region does not significantly affect features that influence the amount of solar energy received on Earth.

Regularisation methods
A shrinkage method is defined as a general procedure used to improve a least squares estimator and comprises reducing variance by adding constraints on the value of the coefficients.They remove irrelevant variables by reducing the fitted coefficients to zero.
Theorem 1.A variable Xi is irrelevant to Y concerning X (notwithstanding that Xi ∈ X), for which, for all A ⊆ X, the conditional mutual information, I (Xi; Y |A) of Xi and Y given the variable in A is equal to zero.
Thus, lasso and elastic net (ridge regression also being a special case of the elastic net) were developed to identify such variables in Theorem 1 by shrinking the coefficients of irrelevant variables to zero.The algorithm developed by Friedman et al. (2010) generalises naturally to the unstandardised cases of observations such that is the elastic net penalty.
The lasso penalty is also a compromise of the elastic net penalty when α = 1 in Equation 2. When , elastic net performs much like a lasso, but removes any degeneracies and wild behaviour caused by extreme correlations.When α = 0 the penalty becomes a compromise to the ridge regression penalty.Ridge regression was proposed to minimise the residual sum of squares subject to the constraint (Equation 3).
The solution to the problem in Equation 1 is unique given the condition in Theorem 2, which is fulfilled when p ≤ n.That is, assuming that X has a full rank then: However, it is noted that although the solution is not necessarily unique it is still convex and neighbourhood search results can be used.If cross-validation is applied when solving the problem in Equation 1for a good value of a regularisation parameter, then λ is chosen to minimise the prediction error (PE) in Equation 4: where Y ∈ R N×1 is a vector of the response variable observations and X ∈ R N×p is the matrix of predictor variables.Different structure models are obtained from several choices of λ's that give the same PE.However, the choice of the parameter using an nfold cross-validation is not stable in many cases (Park and Casella, 2008).
The main objective is to achieve better prediction in the face of multicollinearity while preventing overfitting.Lasso shrinks the coefficients of correlated variables towards each other which allows them to borrow strength from each other.However, the method shrinks all regression coefficients to zero and yet some variables may be important.On the other hand, lasso ensures that only relevant variables are selected.However, lasso tends to pick one and ignore the rest of a group of highly correlated variables.The selected coefficient has a high variance because this collinearity causes the corresponding coefficient standard error to become unstable.Regularisation is achieved by continuously shrinking the coefficients such that if λ is sufficiently large then some coefficients are shrunk to zero.However, a little bias to reduce the variance of the predicted values is compromised at the overall benefit of improving the prediction accuracy.It becomes a relevant technique for situations where regularised methods are applied as the inclusion of statistical modelling for the feature selection process and the results used for further analyses.Feature selection and regularisation performed by lasso enhance model prediction accuracy by tackling the problem of overfitting.It does so on finite samples and performs well in cases where p may grow faster than n.It is a novel approach to problems of high dimensional non-linear modelling where the structure of the model has to be detected.Most other methods are not adequate, in that they do not deal well with large numbers of irrelevant explanatory variables.This makes lasso to be an advantageous variable selection procedure in difficult forecasting problems.However, lasso in particular selects at most n variables before saturating in small n but large number of variables data sets.It is a challenge to apply shrinkage methods when some variables are recorded through some calculations, because there would be no common referral point for the features to choose which ones should be selected.

Penalised quantile regression
When cross-validation is repeated, shrinkage methods tend to be unstable.In addition, these regularised least square regression methods are not robust to outliers or heavy-tailed error distribution.Yet QR is robust, and sparse and gives unique insights into the relationship between features and response variables.Penalisation in QR removes at least nearly all features whose population coefficients are shrunk to zero whilst (in particular comparison to lasso) the method of percentiles deals with its coefficients' instability under repeated cross-validations.So, penalised QR offers sparse solutions as well as performing automatic feature selection.The l1 PQR estimator ) ( ˆ  is a solution to the optimisation problem, as shown in Equation 5.
The overall penalty level ) 1 (    − depends on each quantile τ, whilst λ depends on T.
Table 3 gives a summarised comparison of the lasso, elastic net, ridge and PQR as a group of shrinkage variable selection methods.

Tree-based algorithms 3.4.1 Random forests
RFs are popular learning models for solving a variety of classification and regression problems.They are based on a multitude of decision trees which are independently developed on different sample bags taken from the training set.RFs are an ensemble method in which classification is performed by voting of multiple unbiased decision trees.Different subsets of attributes are randomly selected at each step of tree construction.These subsets are different bootstrap samples of the training set.Each bootstrap sample is a result of the replacement selection of the same objects as in the original set.Trees that are trained on different parts of the same training set are averaged to reduce variance, but this increases bias and the models may lose interpretability.Consequently, this will pull together efforts of the tree algorithms.Therefore, the performance of a single random tree is improved by this teamwork of many trees.However, the performance of the final model is greatly boosted.Trees that are grown very deep

Ridge
Achieves good prediction when covariates are correlated while preventing overfitting.
The coefficients of some important variables may be reduced to zero.The penalty term cannot force the coefficients to be exactly zero.
Coefficient estimates can change substantially when multiplying a given predictor by a constant.
PQR Robust to outliers.Gives unique insights into the relationship between features and response variables at all quantile levels of the distribution.Deals with its coefficients' instability under repeated cross-validation.
Lacks the ability to reveal grouping information.
The check loss function is not smooth.
can learn highly irregular patterns in the data and have a low bias at the expense of overfitting the training sets.The original dataset is extended by adding the so-called shadow features whose values are randomly permuted among the training cases to remove their correlations with a decision variable.
Then importance estimation of a feature is calculated as the loss of classification accuracy caused by a random permutation of feature values of cases.
That is, the importance of any variable is evaluated as the mean decrease impurity importance (MDI), shown in Equation 7.
where p(j) is the proportion nj/n of samples reaching j nodes,


is the weighted impurity decrease and v(sj ) is the variable used in split sj .Now, if X −i denotes the subset X{Xi} and Pk(X −i ) is the set of subsets of X −i of cardinality k then we can use Theorem 3 to compute MDI.
Theorem 3. MDI importance of Xi ∈ X for Y as computed with an infinite ensemble of fully developed randomised trees and an infinitely large training sample is as shown in Equation 8.
We deduce from Theorem 4 that an irrelevant variable has no importance, but among relevant variables we can have strongly relevant ones, which are classified as confirmed important.Then weakly relevant variables are classified as tentative.
Theorem 4. Xi ∈ X is irrelevant to Y concerning A if and only if its infinite sample size importance as computed with an infinite ensemble of fully developed randomized trees built on X for Y is 0.
RFs improve learning performance through a voting system given a set number of decision trees.As new objects come in, all trees in the forest classify them and the final decision on the new objects is made through this voting system.Trees vote for the classification of objects which were not involved in their classification.The votes for a correct class are recorded for each tree.Then values of variables are randomly permuted across objects and the classification is repeated.In summary, RFs exhibit characteristics of random feature selection, bootstrap sampling, out-of-bag (OOB) error estimation and full-depth decision tree growing.That is, an RF model first extracts some of the samples by bootstrap sampling and then randomly selects the features of these samples, as shown in Figure 1.
These two steps of random sampling make RF more tolerant to the noise in the data and reduce the possibility of overfitting.However, when data has random correlations in a large number of variables it is difficult for RFs to distiguish truly important variables from those that gain importance.An RRF may be considered on such data.

Regularised random forests
The RRF framework is generally for feature selection where the process is greedy and features are chosen on a sub-sample from each node.That is, in each of those nodes all observations are analysed.Feature selection is done by avoiding features not belonging to F (a feature set used in previous splits in a tree model) unless a regularised information gain is defined by Equation 9, where is significantly larger than the maximum gain.λ ∈ [0, 1] is the penalty charged on gain (Xi , v) for a feature that does not belong to F. Therefore, the assumption is that maximising gain (Xi , v) selects the splitting feature at any tree node.

Quantile random forests
QRFs assess the conditional distribution of the response i.e. in each tree all of the leaves keep all of the relevant observations.If the observations are unequally weighted, then a good approximation of the full conditional distribution can be delivered.If the weights Wi(x) are calculated, then the conditional distribution of the response given by Equation 10.
That is, for any τ -quantile level defined by Equation 12, we can consider the spread of the response in the form of a quantile, as in Equation 13.
Now, this conditional distribution model can be solved as an optimisation problem, by Equation 14.
where λ is the tuning parameter.

Methodology
The evaluation of variable selection methods considered in this study was focused on a comparative investigation approach in different locations and time horizons.The study focused on performance comparison of embedded variable selection methods.The methods determine the best feature subset while building the statistical learning model itself which is ideal for solar irradiance modelling.Embedded methods are very relevant algorithms to our study because they select variable subsets in the course of training the data (El Motaki and El Fengour, 2021) and, further, use their results to determine how solar irradiation forecasting can be improved.It was deduced from the literature that solar irradiation data contain some outliers and have significant multicollinearity among covariates.The presence of outliers and correlated covariates can significantly influence the performance of variable selection algorithms.Thus, the variance inflation factor (VIF) was used to assess the collinearity of variables considered from each data set.Then multicollinearity was measured as a proportion of variables with VIFs greater than 1.A Grubb's test was also conducted on each data set to check the presence of at least one outlier.The normality of the response has a bearing on the properties of variable selection algorithms, so the distribution of solar irradiance was explored through the skewness statistic computation, box plot construction and interpretation.As a result, there was a need to apply robust and sparse variable selection algorithms like embedded.Apart from giving diverse samples that are good enough to come up with significant associations between radiometric stations, the multi-site forecasting approach enhances the statistical power of an algorithm (Sigauke et al., 2023).A cross-validation process was then implemented on each multi-site data set, that is, splitting the data set: 80% going into training and 20% into test samples.Models were fitted on the training data frames and then variable selection capabilities were analysed using the testing data frames.The partitioning strategy avoids the overfitting problem, improves prediction based on bias and/or variance, assesses how effectively the model will perform in real-world scenarios and allows for predicting how well a model will perform on data that it has not seen before (Yilmaz and Kuvat, 2023).The embedded methods were also classified into two groups, namely the shrinkage methods (lasso, elastic net, ridge, PQR) and tree-based methods (RF, RRF, QRF).Shrinkage algorithms were developed for better performance while avoiding overfitting on multicollinear covariates through the assessment of variable relevance.On the other hand, treebased algorithms may overfit the data but can learn highly irregular patterns in solar irradiation data with low bias.The ensemble of decision trees selects variables by classifying them as important or not.QR hybrids were also introduced among the algorithms because QR is robust to outliers, whose presence in solar irradiation data has been indicated by the literature.All tree-based algorithms are non-parametric models, which are best adapted for modelling response variables like solar irradiance, which has no known relationship structures with its covariates as yet.The RMSE metric was used to find the best among lasso, elastic net and ridge.These three shrinkage methods are called regularisation algorithms.The three regularisation algorithms have the same model structure, i.e.Equation 1.As a result, we presumed that, for any sample data, if the best algorithm (in terms of the RMSE) among these three regularisation methods is inferior to any other model compared in this study then the inferior models among the regularisation methods can never outperform that other model.For example, suppose that for the Venda data set, the elastic net has the lowest RMSE among the regularisation algorithms but it is outperformed by the RF algorithm.We would not expect lasso or ridge to outperform the RF algorithm.That is, the best regularisation algorithm is the one we compared with PQR and the rest of the treebased algorithms.In that context, we combined the regularisation algorithms through the 'glmnet' R programming software package developed by Hastie et al. (2023).The package uses one penalty, λ but adds another parameter α ∈ [0, 1] such that the general structure of the regularisation model can be written as Equation 15: (15) where ) ( L is minimised.The parameter λ controls how much of the tuning needs to be done on the penalised least squares regression process.The software has an inbuilt k-fold cross-validation (CV) algorithm where k (model size) is automatically selected on which the resulting test has the smallest CV error.While the three regularisation algorithms used multiple linear regression to learn the data, PQR uses multiple linear QR and the method of quantiles to estimate the regression coefficients.The Boruta algorithm was used to train the RF and RRF models while the QRF was trained using non-parametric quantile regression.
To prevent the researchers' judgment from biasing the results, a forward selection technique was implemented on each multi-site data set.That is, all variables in each data set were fed into the model at once and then assessed according to the extent each algorithm provides an optimal subset that results in a highly predictive model.The assessments were done by analysing the experimental error according to the RMSE, adjusted-R 2 and MASE evaluation metrics.The RMSE was used as a basis for comparing the goodness-of-fit of the models, and with adjusted-R 2 their predictive performances were compared.Since solar irradiation data include zero values of GHI, MASE was used as a basis metric for predictive accuracy comparisons.Performance scores on each metric evaluation were generated and a system of ranking the models was introduced to finalise the comparison investigation.Other algorithm comparative considerations were the ease of setting up the model in software and the speed of processing results when running the coded algorithm.
Through listing, common variables with coefficients shrunk to zero and rejected were identified by inspection.They were determined from the relevant and importance scores calculated using the respective fitted models.The flow chart in Figure 2 summarises the comparative investigation approach adopted in this study.The stability of the models was checked through a sensitivity analysis, where it was observed how the model performances changed when sample sizes were changed.That is, we checked for consistencies in R 2 values as we varied sample sizes and general performances across different locations.We also checked whether the algorithms were selecting the same variables in these different locations.

Goodness-of-fit evaluation
The goodness-of-fit of the models was evaluated by measuring the deviations of the fitted from the actuals using computing the RMSE, as given in Equation 17.
where i y is the actual observed SI and i y  is the cor- responding model fitted value.That is, it is the standard deviation of the residuals.The smaller the RMSE the better the model fits the data.

Performance evaluation
Performance is the universal metric for evaluating a learning model.Performance was measured by calculating the proportion of the total variation in solar irradiation that could be explained by the covariates.That proportion was then expressed as a percentage through the coefficient of multiple determination, R 2 .Difficulties in interpreting it are avoided by considering the adjusted R 2 .For a p dimensional data set the adjusted R 2 can be computed as in Equation 18.
). 1 ( The adjusted R 2 does not necessarily change as more and more covariates are introduced and the best model is the one that gives the maximum value of the metric.

Accuracy evaluation
The MASE provides an interpretable measure of accuracy in predictive modelling.It is a good metric to use when comparing models trained on different datasets.It is one of the most appropriate metrics when the response has zero or near zero values.The metric is computed by dividing the mean absolute error (MAE) of the trained model by the MAE of the corresponding naïve mode.The naïve model predicts the value at a time point as the previous historical value.That is, MASE indicates the effectiveness of a forecasting model concerning a naïve model.As a result, a MASE greater than 1 means that the forecasting model is performing worse than the naïve benchmark otherwise it is better.A forecasting model with a lower MASE is a better model than the one compared to.

Sensitivity analysis
A sensitivity analysis was done to compare conclusions between the analysis carried out and another analysis in which some aspect of the approach is changed -for example, changing parameters or assumptions of the modelling process, such as support, confidence, or lift, to observe how the model changes and find the optimal values.The sensitivity analysis attempts to assess the appropriateness of a particular model specification and to appreciate the strength of the conclusions being drawn from such a model.The process involves a series of methods to quantify how the uncertainty in the output of a model is related to the uncertainty in its inputs.In this way it assesses how "sensitive" the model is to fluctuations in the parameters and/or data on which it is built.

Data exploration
Multicollinearity was quite high on the hourly data sets from all except Cape Town, which had 24%; the rest had more than 30%, as shown in Table 4. Windhoek had an extreme multicollinearity of 72% on daily recorded variables but Bulawayo had a very low multicollinearity of 13%.A Grubb's test for outliers shows that Cape Town and Durban hourly SI had outliers because they had p-values less than 0.05.Bulawayo is the only one that had outliers on the 24-hour horizon (p-value =1.07×10 −6 ).Hourly SI is positively skewed because all of the skewness values were more than 1.0, while 24-hour SI is not skewed.The skewness values of 24-hour SI can be approximated to zero, as shown in Table 4. Hourly SI is also heavily right-tailed on all locations, as shown by all box plots in Figure 3, whilst daily SI is symmetrically distributed (Figure 4).All of the box plots in Figure 3 have right-hand side whiskers and no left-hand side whiskers.

Evaluation of regularisation algorithms
The first model comparisons were done amongst the regularisation algorithms because they use the same model structure shown in Equation 1.To determine the best regularisation algorithm, RMSEs from trained models in Equation 14for each value of α were calculated and are given in Table 5.The best regularisation algorithm for a particular location is the one with an RSME in bold.The results show that lasso was the best shrinkage method in only one location, Windhoek on hourly SI data.The data set had a high multicollinearity percentage.Either elastic-net or ridge regression was the best for the other locational data sets.It can also be observed that ridge regression was the best in Venda, where the data set had the highest multicollinearity percentage.A 24-hour time horizon was also considered, in order to check how the time horizon influences the variable selection methods.Results show that lasso was not the best in any of the locations considered, as shown in Table 6.The elastic net was the dominant 24-hour SI in the variable selection context, in-stead.Bulawayo was the only location where the elastic net was inferior to ridge regression.We observe that Bulawayo 24-hour data had the lowest multicollinearity percentage of 13, but outliers were existent in the data set.The data set was one of only two that was negatively skewed.

Comparisons of variable selection methods
The best regularisation algorithm on each location was compared with the PQR, RF, RRF and QRF variable selection methods.All algorithms were coded and run in R programming software where the 'quantreg' (Koenker, 2018) and 'rqPen' packages were used to fit the PQR model, the 'Boruta' (Kursa and Rudnicki, 2022) package fitted the RF model, the 'randomForest' and 'RRF' (Deng et al, 2022) packages were used to fit the RRF model and 'quantregForest' (Meinshausen, 2022) package fitted the QRF model.

Goodness-of-fit evaluations
The RMSE for the best regularisation algorithm on each location was also compared against PQR, RF, RRF and QRF.QRF had the lowest RMSEs on all locations for hourly SI as shown in Table 5 (RMSE in bold shows the smallest RSME for that particular location).Results show that PQR had larger RSMEs than regularisation algorithms in all locations except for Durban where there was an improvement from elastic net.We suspect that the reason is that more than 90% of the features considered in all locations were important as evidenced by importance scores from RF, RRF and QRF models.That is, data sets considered did not have superfluous features.Therefore, hybridising a regularisation model with quantile regression would not improve the model.Thus, the results demonstrate that shrinkage methods have been developed to overcome the challenges of multicollinearity in modelling because they could handle the high multicollinearity in the data sets considered.On all locations shrinkage methods performed markedly better than RFs.Even regularising the RF did not improve the selection process except for Gaborone 24-hour data as shown in Table 7.The RMSEs in both Table 7 and Table 8 of all corresponding RRFs were higher than those of RFs, worsening the performance of the RFs.Surprisingly, adding the non-parametric property of QR to the RF model significantly improved the feature selection performance of an RF on all locations except for 24hour SI in Gaborone.QRF hybrid model became the best method for all hourly SI.The non-parametric property of QR and its other several advantages in regression modelling could be attributed to this significant improvement.It is noted that both QR and RFs are robust to outliers.RFs are also tolerant of outliers.In locations where there were outliers (Cape Town for hourly time horizon and Bulawayo for daily time horizon) the performance of QRF was the best among the locations.Median conditional distribution (i.e. at τ = 0.5) gave the best description of solar irradiation in almost all locations.However, the model cannot measure the association of features with the response through correlations.Though QRF was better than any of the RFs, it performed best in only three locations and elastic net was the best in Windhoek for the daily time horizon (it is noted that Windhoek has extreme multicollinearity).Likewise, hybridising a shrinkage method with QR on hourly time horizon data sets did not improve the selection method on daily recorded SI except for Pretoria, where a PQR was the best selection method.

Adjusted R-square comparisons
The adjusted R-square was used as a performance indicator in this study.To analyse the performance, Figure 5 shows the R-squared scores.All of the adjusted R-squared scores were at least 90% except for shrinkage and PQR on the Gaborone data set.The shrinkage methods had a 78.25% score while PQR had 81.33%.The data set had a relatively high multicollinearity percentage of 42, no outliers and was comparatively skewed like any other data set used in this study.Further investigations may be required on the data set to find out the reason for comparatively low adjusted R-squared scores.The new QRF model had the highest adjusted R-squared scores from all data sets, followed by RRF and then RF.It showed that the hybridisation of an RF with QR performed better than the RF on its own.Results show that RF-based algorithms had better predictive performances than regularisation methods.This means that RF-based algorithms could explain the total variation in solar irradiation caused by the covariates considered in this study better than shrinkages and PQR.Results also show that introducing QR on regularisation methods did not improve their performance, because the PQR model had smaller or almost equal adjusted Rsquared scores than shrinkage on all locations.

Accuracy evaluations
Results in Table 9 show that all of the models performed better than their corresponding naïve models when trained to the data from all locations under study.All of the MASE values were less than one.The results also show that QRF had the smallest MASE values in each location (the locational MASE values in bold) meaning that it was the most accurate variable selection method for each location.It is also observable that it is the same method that had the smallest (MASE=0.026)among all computed MASE values.Therefore it can be deduced that QRF was the most accurate variable selection method.

Model rankings
The average rankings of the variable selection meth-ods over all of the locations are shown in Table 10.Results show that QRF was ranked first on all of the metrics.That is, QRF is the overall superior variable selection method among the methods considered in this study.The shrinkage method was ranked second on RMSE but last on both adjusted R-square and MASE.Though RFs were ranked better on performance and accuracy, shrinkages were better fitting the data.

Feature selection evaluation
Features with coefficients shrunk to zero or rejected were different in the selection method, location and time horizon.Table 11 shows that total rain, wind speed, maximum wind speed, wind direction, wind direction standard deviation, 12V battery, 12V battery minimum and 24V battery can be excluded from hourly SI prediction modelling.The features were either rejected or had coefficients that were shrunk to zero in at least two locations or two methods.Day, 12V On 24-hour SI, features that were not selected on at least two locations were day, maximum barometric pressure, wind speed, wind direction, wind velocity magnitude, 12V battery, 24V battery minimum and averaged DNI (see Table 12).Table 13 shows features with coefficients that were shrunk to zero when applying a better method between shrinkage and PQR.The coefficient of wind direction was shrunk to zero from all locations except Pretoria on hourly SI.The day coefficient was also shrunk to zero from Pretoria and the other two locations.On a 24-hour SI, day, wind speed, 12V battery average, wind vector magnitude wind direction can be removed.Rejected variables or those with less than 1.5% importance scores were extracted from the best selection method among RF, RRF and QRF.Table 13 shows that wind direction, wind standard deviation and total rainfall are not important for hourly SI.The day is also not important on 24-hour SI.
Year, month, temperature, DHI, wind speed, 12V battery and 24V battery were found to have the most significant relevance on hourly SI, whilst month and minimum temperature were the most relevant variables on 24-hour SI.Hour, DHI, DNI, temperature, relative humidity and barometric pressure were the most important features of hourly SI.Month and DHI were the most important features of 24-hour SI.

Sensitivity analysis
Results from Section 4.4 show that hour, DNI, DHI, temperature, relative humidity, barometric pressure and wind speed should be always considered covariates when modelling SI.These results agree with previous studies that included variable selection when modelling SI.Since RFs are non-parametric models, the stability of the models was checked through variations in sample sizes.Results given in Figure 6 show that the QRF algorithm was not sensitive to any sample size changes.It was demonstrated to be the most stable random variable selection method.The regularisation, PQR, RF and  RRF models were sensitive to smaller sample size changes.However, they were not sensitive to large sample size changes.That is, the algorithms became more stable as the sample size increased.All of the algorithms selected the same variables from all of the different sample sizes considered, as shown in Table 15.However, it has to be noted that all shrinkage methods shrunk the coefficient of wind direction to zero, while all tree-based algorithms selected all of the variables (considered for sensitivity analysis) as important features when modelling SI.

Discussion
Although lasso is the most common variable selection method among previous SI studies in Southern Africa, the results from this study show that, among the regularisation algorithms considered, ridge was the best in most locations.The focus of this study was on comparing the variable selection capabilities of different algorithms.Literature confirms that ridge is good for only selecting relevant variables in the presence of multicollinearity but lasso is good when the selection method is further applied as a forecasting model.However, here lasso was the best in only one location, Windhoek.Ridge prevents overfitting when covariates are correlated and that is why it became the best in Venda and Gaborone where there was the highest multicollinearity.Ridge regression is a very good algorithm when focusing on dimension reduction only, as in this study.The combination of ridge and lasso, the elastic net (Sigauke et al., 2023), which is expected to overcome their limitations, was the best in Durban, where there were outliers, and in Pretoria.Pretoria data did not have outliers or multicollinearity.Though outliers do not cause serious problems with lasso and ridge, the two algorithms do not perform well in the presence of many outliers.The performance of the elastic net in the presence of several outliers is suspected to be attributable to its ability to remove degeneracies and wild behaviours in the data.Since the elastic net provides a compromise between lasso and ridge regression it can be expected that the algorithm will perform better than both lasso and ridge in a data set like Pretoria, with no outliers and low multicollinearity.
The new QRF algorithm outperformed all variable selection methods considered in this study on all metric evaluations when training all different SI data in Table 2.This excellent superiority can be attributed to the ability of QRF to learn any pattern of any response for any provided data (Dega et al, 2023).SI data is non-Gaussian and this study shows that the algorithm can be applied to both deterministic correction and probabilistic calibration of a skewed distribution of meteorological features, as claimed by Evin et al (2021).Apart from exhibiting strengths of both RFs and QR modelling, the hybrid algorithm demonstrated that it handled well the weaknesses in both QR and RF modelling separately.Though it is difficult to discern truly important variables from those that gain importance when applying a tree-based algorithm on high dimensional data that has random correlations, the hybrid algorithm handled excellently very high multicollinearity in Gaborone, Venda and Windhoek.QRF was robust to outliers in Cape Town and Durban SI data, although QR depends on the completeness of the meteorological data (Ayodele et al., 2016), and its prediction errors of the next value in the series are often large on short-term forecasting.As a result, the noise in the data did not affect the algorithm as much as it affected other algorithms.Those are the powerful properties the hybrid algorithm inherited from QR, being robust to outliers and tolerant to noise in the data (Diez-Olivan et al., 2018;Vantas et al., 2020).In addition, QRF infers conditional quantiles and gives a non-parametric and accurate way of estimating conditional quantiles in high-dimensional cases (Gostkowski and Gajowniczek, 2020).
Although shrinkage algorithms were outright inferior to the proposed QRF, they had better RMSE values than the rest of the other algorithms.This showed that the inbuilt k-fold cross-validation and regularisation in shrinkage algorithms tackled the problem of overfitting quite well.RMSE measures deviations of the fitted from the actuals as a goodness-of-fit metric and shrinkages were developed to achieve both feature selection and regularisation in the presence of multicollinearity.Though variance reduction done by shrinkages introduces little bias, the continued shrinking of the coefficients improves the minimisation of the prediction error, thus enhancing prediction accuracy by tackling overfitting (Fonti and Belitser, 2017).However, the shrinkage algorithms work well in small sample sizes and highdimensional data, which is why lasso in particular performs well in finite sample cases where p may grow faster than n (Brink-Jensen and Ekstrom, 2021).The smallest sample size in this study was 5000 data points and the smallest variable dimension was 10.In addition, the process of shrinking coefficients focuses on feature relevance while ignoring importance.The relevance of a feature is determined by measuring associations through correlation.Thus, indirectly applying Gaussian and parametric modelling assumptions, the data exploration results here showed that SI data is non-Gaussian, as well as the relationship structures between SI and its covariates not yet being known.When dealing with SI data it is always best to work with nonlinearity assumptions.Non-parametric assumptions are even better.Consequently, these results show the superiority of tree-based algorithms against regularisation algorithms when using the R 2 and MASE metrics.Tree-based algorithms are non-parametric models that have a very good ability to learn highly irregular patterns in non-linear data (Ibrahim and Khatib, 2017).That is why both the RF and RRF algorithms had better R 2 and adjusted-R 2 values than shrinkages.By pulling together a forest of trees, the performance of single trees is greatly boosted in RFs.Better MASE values on tree-based algorithms than shrinkages to the OOB estimation they do can be accounted for through randomly selecting features from bootstrap samples.The RF and RRF algorithms are noted as having similar feature selection processes, but RRF selects features by avoiding features not belonging to a feature set used in previous splits in the tree model.This avoidance did not lead to any superiority of RRF against RF in our SI study, because all metric values of the two algorithms were approximately equal.However, the present results agree with with those of Deng and Runger (2012), that the RRF was developed to improve the performance of RFs amid data sets with significant multicollinearity.Adjusted R-squared scores and MASE values from the RRF algorithm were slightly higher than those from the RF algorithm in Gaborone, Venda and Windhoek.Literature highlights the possibilities of overfitting in tree-based algorithms and the packages used in this study to fit them did not have inbuilt k-fold cross-validation.So, we attribute the reduction of overfitting possibility in tree-based algorithms to the partitioning strategy employed on all data sets.The very high adjusted R-squared scores and very low MASE values from all algorithms compared in this study indicate that our results agree with the finding of Ludwig et al. (2015) that both shrinkages and tree-based algorithms can select the right variables.Literature also specifies that shrinkage and tree-based algorithms are stable variable selection methods, and the results in this study have demonstrated that when the algorithms selected the same variables on different situational data sets they have approximately equal R-squared scores on different sample sizes.

Conclusion
This study introduced PQR to feature subset selection in SI studies and compared it with other shrinkage methods.Though lasso is the most popular model among regularisation algorithms it was not the best in some locations.Therefore, a comparison of the regularised methods should be made before the application of a selected variable selection method.Tree-based variable selection methods were also included in the study and were compared with shrinkage methods.It emerged that there are data situations when one or other of the techniques is best.The study focused on multicollinearity, outlier existence, skewness and heavily tail-distribution data situations.The hybrid model between QR and RFs, that is, the QRF model performed the best in most of these situations.The conclusion was drawn that the QRF model is the best method for SI data sets on which there is often existence of outliers.However, the RF in the hybrid model is not sensitive to associations through correlations, while QR offered a way of exploring sources of heterogeneity in covariates.As a result, it would not be advised to conclude the feature selection exercise using results from the QRF model only.As features are classified as important or not, it is also paramount to measure their relevance.Therefore, it would be prudent to run the feature selection process in two stages, starting with the relevance of features to associations through correlations and then classifying their importance.We conclude that a variable coefficient is shrunk to zero if and only if it is not important, but a relevant one may not be important.That is, relevant variables can be classified as strongly relevant or weakly relevant through importance scores.Further studies can be done to find how regularisation can be done on QRF modelling; that is, developing a model that is stable and accurate in multicollinearity, outlier existence and heavily-tailed data situations.If there exist groups of highly correlated features, then group regularisation should be considered.Further studies can include interactions in the features as well.It can also be concluded that hourly or monthly time and temperature are paramount variables in SI modelling in Southern Africa.Time can be recorded in hourly or monthly units depending on the study.Day recorded as a variable is neither relevant nor important when modelling SI.Apart from location and time horizon, This study also leads to the conclusion that covariates paramount for predicting SI may vary, depending on the context of the study or application.

Figure 5 :
Figure 5: Bar chart exhibiting the R-squared scores.