Outlier detection in ground-measured solar resource data using statistical classification models

Authors

DOI:

https://doi.org/10.17159/2413-3051/2025/v36i1a20742

Keywords:

outlier detection, solar energy resource assessment, statistical learning

Abstract

Ground-based solar resource measurements are known to be preferred to synthetic or simulated data for a given location, but outliers present in this data can significantly impact the accuracy of predictions used in viability assessments. For solar energy installations to be self-sustaining and viable, accurate ground-based solar resource data for the location of these installations are essential for decision-making and planning. Conventional outlier detection techniques used for solar resources, including graphical plots to complex numerical approaches, often have difficulty identifying these outliers to a satisfactory degree. This study proposes the use of simulated outliers added to synthetic data to train and compare the effectiveness of traditional outlier detection methods and several statistical learning methods, including NN, naïve Bayes, support vector machines and advanced tree-based models for the purpose of outlier detection in this field. The results indicate that the advanced tree-based models provide accurate identification of outliers in the simulation step and are demonstrated to be effective on a ground-based real world data set collected in Gqeberha, South Africa. The use of the proposed approach can aid in reducing the uncertainty in measured solar resource data and, as a result, help to promote the use of solar energy solutions in areas with unreliable solar resource data.

Downloads

Download data is not yet available.

Author Biographies

  • Chantelle Clohessy, Nelson Mandela University

    Department of Statistics, Nelson Mandela University, Gqebertha (Port Elizabeth), South Africa

    Senior lecturer, Head of Department

    https://statistics.mandela.ac.za/Staff

    https://www.linkedin.com/in/chantelle-clohessy-72456092/

  • Warren Brettenny, Nelson Mandela University

    Department of Statistics, Nelson Mandela University, Gqebertha (Port Elizabeth), South Africa

    Research Associate 

    https://www.linkedin.com/in/warren-brettenny/

    https://statistics.mandela.ac.za/Research/Research-Associates

  • Waldo Abrahams, Previous student at Nelson Mandela University

    Department of Statistics, Nelson Mandela University, Gqebertha (Port Elizabeth), South Africa

    Previous MSc student

    Currently a Machine Learning Engineer at Spatialedge.

    https://www.linkedin.com/in/waldo-abrahams-95018a147/?originalSubdomain=za

References

Abrahams, W., 2021. Classification and clustering based methods for outlier detection of solar resouce data (Master’s thesis). Nelson Mandela University.

Ali, N., Neagu, D., Trundle, P., 2019. Evaluation of k-nearest neighbour classifier performance for heterogeneous data sets. SN Applied Sciences 1, 1–15. https://link.springer.com/content/pdf/10.1007/s42452-019-1356-9.pdf.

Breiman, L., 1996. Bagging predictors. Machine learning 24, 123–140.

Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J., 2000. LOF: Identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. pp. 93–104. https://dl.acm.org/doi/pdf/10.1145/342009.335388

Brooks, M.J., Du Clou, S., Van Niekerk, W.L., Gauché, P., Leonard, C., Mouzouris, M.J., Meyer, R., Van der Westhuizen, N., Van Dyk, E.E., Vorster, F.J., 2015. SAURAN: A new resource for solar radiometric data in southern africa. Journal of energy in Southern Africa 26, 2–10. https://www.scielo.org.za/pdf/jesa/v26n1/01.pdf

Charte, D., Herrera, F., Charte, F., 2019. Ruta: Implementations of neural autoencoders in r. Knowledge-Based Systems 174, 4–8. https://fcharte.com/assets/pdfs/2019-KBS-Ruta.pdf

Chen, T., Guestrin, C., 2016. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining. pp. 785–794. https://dl.acm.org/doi/pdf/10.1145/2939672.2939785

Chicco, D., Jurman, G., 2020. The advantages of the matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC genomics 21, 1–13. https://link.springer.com/content/pdf/10.1186/s12864-019-6413-7.pdf

Clohessy, C.M., 2017. Statistical viability assessment of a photovoltaic system in the presence of data uncertainty (PhD thesis). Nelson Mandela Metropolitan University.

Clohessy, C.M., Sharp, G., Hugo, J., van Dyk,. 2019. Inferemtial based statistical indicators for the assessment of solar resoource data. Journal of energy in Southern Africa 30, 21–33. https://www.scielo.org.za/scielo.php?pid=S1021-447X2019000100003&script=sci_arttext.

Cousineau, D., Chartier, S., 2010. Outliers detection and treatment: A review. International Journal of Psychological Research 3, 58–67. https://www.redalyc.org/pdf/2990/299023509004.pdf

Cover, T., Hart, P., 1967. Nearest neighbor pattern classification. IEEE transactions on information theory 13, 21–27.

Divya, D., Babu, S.S., 2016. Methods to detect different types of outliers. In: 2016 International Conference on Data Mining and Advanced Computing (SAPIENCE). IEEE, pp. 23–28. https://ieeexplore.ieee.org/abstract/document/7684114/

Fitzgerald, D., n.d. Quality checking of weather data: Information document, October 2021. ed. Centre for Renewable; Sustainable Energy Studies, Stellenbosch.

Fix, E., Hodges, J., 1951. Discriminatory analysis, nonparametric discrimination: Consistency properties. (Technical Report). Technical Report 4, USAF School of Aviation Medicine, Randolph Field, Texas.

Gholamy, A., Kreinovich, V., Kosheleva, O., 2018. Why 70/30 or 80/20 relation between training and testing sets: A pedagogical explanation, departmental technical report: UTEP-CS-18-09. The University of Texas at El Paso. https://scholarworks.utep.edu/cgi/viewcontent.cgi?article=2202&context=cs_techrep

Ha, J., Seok, S., Lee, J.-S., 2015. A precise ranking method for outlier detection. Information Sciences 324, 88–107.

Hahsler, M., Piekenbrock, M., Doran, D., 2019. dbscan: Fast density-based clustering with R. Journal of Statistical Software 91, 1–30.

Hand, D.J., Yu, K., 2001. Idiot’s bayes: Not so stupid after all? International Statistical Review / Revue Internationale de Statistique 69, 385–398.

Jacovides, C., Tymvios, F., Assimakopoulos, V., Kaltsounides, N., 2006. Comparative study of various correlations in estimating hourly diffuse fraction of global solar radiation. Renewable energy 31, 2492–2504.

Jensen, A.R., Anderson, K.S., Holmgren, W.F., Mikofski, M.A., Hansen, C.W., Boeman, L.J., Loonen, R., 2023. Pvlib iotools—open-source python functions for seamless access to solar irradiance data. Solar Energy 266, 112092. https://www.sciencedirect.com/science/article/pii/S0038092X23007260

Journée, M., Bertrand, C., 2011. Quality control of solar radiation data within the RMIB solar measurements network. Solar Energy 85, 72–86.

Kuhn, M., 2020. Caret: Classification and regression training.

Lee, K., Yoo, H., Levermore, G.J., 2013. Quality control and estimation hourly solar irradiation on inclined surfaces in south korea. Renewable energy 57, 190–199. https://www.sciencedirect.com/science/article/abs/pii/S0960148113000669

Lemmens, A., Croux, C., 2006. Bagging and boosting classification trees to predict churn. Journal of Marketing Research 43, 276–286. https://pure.uvt.nl/ws/portalfiles/portal/1425373/lemmens_bagging.pdf

Molineaux, B., Ineichen, P., 1994. Automatic quality control of daylight measurements: Software for IDMP stations. Commission Internationale de l’Éclairage.

Moradi, I., 2009. Quality control of global solar radiation using sunshine duration hours. Energy 34, 1–6. https://www.sciencedirect.com/science/article/abs/pii/S0360544208002466

Muneer, T., Fairooz, F., 2002. Quality control of solar radiation and sunshine measurements–lessons learnt from processing worldwide databases. Building Services Engineering Research and Technology 23, 151–166. https://journals.sagepub.com/doi/abs/10.1191/0143624402bt038oa

Osborne, J.W., Overbay, A., 2004. The power of outliers (and why researchers should always check for them). Practical Assessment, Research, and Evaluation 9, 6. https://scholarworks.umass.edu/bitstreams/af7a80b9-e2b9-4809-b353-b6271b1e5314/download

R Core Team, 2020. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

Remund, J., Mueller, S., Kunz, S., Schilter, C., 2012. Meteonorm handbook, part II: theory. METE-OTEST: Bern, Switzerland.

Shahbaz, M., Raghutla, C., Song, M., Zameer, H., Jiao, Z., 2020. Public-private partnerships investment in energy as new determinant of CO2 emissions: The role of technological innovations in china. Energy Economics 86, 104664. https://mpra.ub.uni-muenchen.de/97909/1/MPRA_paper_97909.pdf

Sheng, H., Xiao, J., Cheng, Y., Ni, Q., Wang, S., 2017. Short-term solar power forecasting based on weighted gaussian process regression. IEEE Transactions on Industrial Electronics 65, 300–308. https://ieeexplore.ieee.org/abstract/document/7945510

Singh, P., Singh, N., Singh, K.K., Singh, A., 2021. Chapter 5 - diagnosing of disease using machine learning. In: Singh, K.K., Elhoseny, M., Singh, A., Elngar, A.A. (Eds.), Machine Learning and the Internet of Medical Things in Healthcare. Academic Press, pp. 89–111.

Smiti, A., 2020. A critical overview of outlier detection methods. Computer Science Review 38, 100306.

Wilcox, S.M., McCormack, P., 2011. Implementing best practices for data quality assessment of the national renewable energy laboratory’s solar resource and meteorological assessment project. National Renewable Energy Lab.(NREL), Golden, CO (United States). https://www.nrel.gov/docs/fy11osti/50897.pdf

Yang, D., Wang, W., Xia, X., 2022. A concise overview on solar resource assessment and forecasting. Advances in Atmospheric Sciences 39, 1239–1251. https://link.springer.com/content/pdf/10.1007/s00376-021-1372-8.pdf

Younes, S., Claywell, R., Muneer, T., 2005. Quality control of solar radiation data: Present status and proposed new approaches. Energy 30, 1533–1549.

Zhang, Y., Meratnia, N., Havinga, P., 2009. Hyperellipsoidal SVM-based outlier detection technique for geosensor networks. In: Trigoni, N., Markham, A., Nawaz, S. (Eds.), GeoSensor Networks. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 31–41. https://link.springer.com/chapter/10.1007/978-3-642-02903-5_4

Downloads

Published

2025-05-10

How to Cite

Clohessy, C., Brettenny, W. ., & Abrahams, W. (2025). Outlier detection in ground-measured solar resource data using statistical classification models. Journal of Energy in Southern Africa, 36(1). https://doi.org/10.17159/2413-3051/2025/v36i1a20742