Evaluation of the Impact of Gap Filling Technology in Precipitation Series on the Estimation of Climate Trends, the Case of the Souss Massa Watershed

Accurate climatic data, especially precipitation measurements, play a critical role in various studies concerning the water cycle, particularly in modeling flood and drought risks. Unfortunately, these datasets often suffer from tem - porary gaps that are randomly dispersed over time. This study aims to assess the effectiveness of three imputation methods: KNN, MICE, and missForest, in impute missing values in climate series. The evaluation is conducted in two distinct rainfall regimes: the Moulouya basin and the Sous Massa basin. The performance analysis considers the percentage of missing data across the entire dataset. The imputed datasets are used to estimate annual precipitation, which are then subjected to statistical tests to identify potential trends and detect changepoints. The analysis focuses on the precipitation series within the Souss Massa watershed, encompassing 27 rainfall stations. Results indicate that data imputation has a highly positive impact on the study of rainfall series trends and change point detection. The study found that studying trends without data imputation could lead to questionable conclusions. The most significant breakpoints detected in the analyzed rainfall series were in the years 1988, 1991, 1997, 2007, and 2010. The decrease in precipitation at stations showing a downward trend varies between -60 mm and -137 mm using the MICE method, and between -40 mm and 186 mm using the missForest method.


INTRODUCTION
Understanding the evolution of climatic parameters, such as temperatures and precipitation, requires the availability of sufficiently long and complete datasets, which is often not the case in most places around the world.We often encounter either short chronological series or gaps within these se-with this data gap.[Melki et al., 2020] examined the consequences of missing precipitation data on hydrological modeling, highlighting possible distortions in forecasts and hydrological analyses.Completed climatological data is crucial for obtaining accurate results [Evin et al., 2021].Furthermore, the study by [Zhao et al., 2018] emphasized the importance of completing missing precipitation data for forecasting streamflow in ungauged basins, highlighting the impact on the reliability of these forecasts.In this work, we emphasize the importance of filling gaps in precipitation series before analyzing climate trends and detecting change points in the Souss Massa watershed.
Rapid advancements in the fields of computer science and scientific research have led to numerous imputation techniques, some of which require significant computational capabilities while others do not.This situation presents advantages but also raises two crucial questions: the choice of the best imputation technique to use and the impact of these techniques on the study of trends and the detection of change points in climate series.
In this study, we have focused on two main aspects: evaluating the performance of three imputation techniques (KNN, MIC, and missForest) under two different rainfall regimes (Moulouya basin and Sous Massa basin).We selected these three techniques due to their flexibility and wide range of applications [Van Buuren et al., 2011].Subsequently, we will examine the impact of each imputation technique on the study of climate trends using the Mann Kendall test and the detection of change points in the Souss Massa basin.Like any gap-filling operation, calculating the percentage of missing values is a crucial step.We have computed this percentage per station for the Moulouya basin.
An analysis of the results obtained in the table below shows that only 16 of the 59 stations had no missing values, representing 27% of the total.It is quite logical to understand that to obtain good results when modeling a phenomenon that calls rainfall data, it is essential to have a clear knowledge of the variability of precipitation in different localities, which requires a significant number of stations.Consequently, it is necessary to fill in the gaps in the other stations.

The study area
Rainfall patterns can significantly vary from one region to another, making it crucial to test different imputation techniques across various rainfall regimes to ensure their applicability in different geographical situations.With this aim, we

Methodology
The methodology adopted is based on five stages: These steps are summarized in Figure 3:

Data
The primary input data consists of daily precipitation measurements collected by rain gauge stations placed at various locations within the two watersheds.

Materials
R software was used to perform all the required calculations, given its richness in terms of documentation and packages, as well as its performance in performing complex and repetitive operations.

Performance criteria
To evaluate the performance of the imputation methods, we used the following three statistical indicators: MAE, RMSE and CV RMSE.The model with the low values in these indicators would be the best.
• MAE: mean absolute error: • RMSE: square root mean square error: • CV RMSE: coefficient of variation of the square root of the root mean square errors: where: X obs -the average of the values of the variable X observed on all the data studied.

Imputation techniques
The treatment of missing data has been widely studied in the statistical literature [Imbert et  − Step 1 -calculation of the distances between the i and the n-1 records.− Step 2 -the average of the k nearest neighbors.
• MICE method -multiple imputation by chained equations (MICE), is based on a Monte-Carlo Markov Chain algorithm.In this imputation technique, many regression models are run such that the variable with missing data is modeled in terms of other variables in the data set [Bousri et. al., 2021].
The steps for applying the method are as follows: • Step 1 -imputation by the mean

RESULT AND DISCUSSION
The results of this study are presented in a structured manner.We commenced by presenting the outcomes regarding the evaluation of the performance of missing data imputation techniques.Following that, we analyzed the trends at each station before and after filling the gaps using various techniques.Finally, we discussed the results pertaining to change point detection, corresponding to the dates of modification in rainfall patterns.

Evaluation of imputation techniques' performance
The first step we undertook was calculating the percentage of gaps within each dataset across all rainfall stations in both watersheds.The percentage of missing data across all stations in the Souss Massa watershed is 41.7%, while the available data represent 58.3% of the total.Concerning the Moulouya watershed, the overall percentage of missing data is 32.4%, with 62.7% of data available.The percentages of missing data per station are depicted in Figures 4 and 5. Upon

Trend study
After imputing the missing data, resulting in the creation of four (04) databases: a database without imputation, a database imputed by the KNN technique, a database imputed by the MICE technique, and a database imputed by the miss-Forest technique, we examined the trends at each rainfall station within these four databases.The trend study results are summarized in Table 4: • the symbol 0 -indicates no significant trend; • the symbol + -indicates an upward trend; • the symbol --indicates a downward trend.
The analysis of the impact of filling gaps using three imputation techniques on precipitation series trends in the Souss Massa watershed revealed the following: • trend analysis on raw data showed that 14 stations had no significant trend, 13 stations exhibited an upward trend, and no station showed a downward trend.• imputation using the KNN method with K=5 maintained the same distribution of stations as the raw data, except for one station displaying a downward trend.• on the other hand, imputation using the MICE method identified 5 stations with a downward trend, 22 stations with no significant trend, and no stations with an upward trend.• imputation using the missForest method detected a downward trend for 3 additional stations compared to the MICE method, totaling 8 stations with a downward trend.Additionally, one station showed an upward trend.
Table 5 summarizes the results of the trend analysis obtained.

Change points study
The Pettitt test results indicate that data imputation enhances the detection of change points in rainfall series.When calculations were performed on raw data, we identified ten (10)   Tendance vers la baisse 0 1 5 8 Tendance vers la hausse 13 13 0 1 of stations, as shown in Table 6; these groups vary in terms of homogeneity.For instance, in group 3 (G3), stations had change points at K = 14, others at K = 15, and further ones at K = 16.Note -a group represents a set of rainfall stations that share the same change point and exhibit the same trend either upward or downward after the change point.However, after imputing missing data using KNN and MICE, the number of groups decreased from ten (10) to only four (04) highly homogenous groups regarding change points.Imputation using the missForest technique added another group compared to KNN and MICE, with a change point at k = 28, as seen in Table 6.Additionally, slight modifications were observed in station assignments to different groups and in the detected change point values.
In terms of the number of stations per group, group G3 contains the highest number of stations (15 stations) with a change point at K = 21, corresponding to the year 1997.Following that, group G2 consists of 07 stations with K = 14 (and station S1 with K=15), corresponding to the year 1991.Group G1 contains only 03 stations with  rainfall for station S13, seen in Figure 10a, clearly demonstrates a downward trend from the detected change point (year 1995).The decrease concerning the average is -137 mm when using the MICE technique and -186 mm when using the missForest technique.The change points in rainfall series, particularly in stations displaying significant decreases, are illustrated in Figure 10.

CONCLUSIONS
In general, the missForest method proves to be the most effective, followed by the MICE method, while the K-MN method exhibits the poorest performance.These results hold true for both watersheds.The percentage of missing data does not influence the reliability of the applied techniques.For instance, the missForest imputation method consistently remains the most efficient regardless of the proportion of missing data.
Examining trends and change points without applying an imputation technique can lead to misleading conclusions about historical trends and change points.Regarding change point detection, the KNN and MICE methods yield similar results, identifying four groups with the same change point.On the other hand, the missForest method allows for the classification of stations Significant improvement is observed in station grouping based on change point compared to the raw data, i.e., without any imputation.Applying the MICE method reveals that five stations in the Souss Massa basin exhibit a decreasing trend, no increasing trend, and 22 stations show no clear trend.Considering the results obtained from the missForest method, eight stations in the Souss Massa basin display a decreasing trend, one station shows an increasing trend, and 18 stations exhibit no clear trend.The most notable change point dates detected in the Souss Massa basin are 1988, 1991, 1997, 2007, and 2010.The decrease in precipitation for stations exhibiting a downward trend (S13, S14, and S15) ranges from -60 mm to -137 mm according to the MICE method, and from -40 mm to 186 mm according to the missForest method.

Fig. 1 .
Fig. 1.Spatial distribution of rainfall stations in the Souss Massa watershed Step 1 -retrieval of daily precipitation data from the two watersheds, namely the Moulouya and Souss Massa watersheds.• Step 2 -data formatting -during this stage, the collected data (in matrix format) was transformed into a date-value format to facilitate their utilization in the R software.• Step 3 -visual verification aimed at eliminating irrelevant data such as text (rainfall traces, instrument malfunctions, observer leave, etc.) and symbols within the datasets.• Step 4 -gap filling -in this stage, we initiated the filling of gaps detected in the series using three imputation techniques, subsequently evaluating the performance of each technique.• Step 5 -based on the completed daily rainfall data, we generated annual rainfall series to assess the impact of each imputation technique on trend analysis and change point detection.

Fig. 2 .
Fig. 2. Spatial distribution of rainfall stations in the Moulouya watershed

Fig. 10 .
Fig. 10.Comparison of the impact of data imputation using MICE and misForest on change point detection and the amount of rainfall decrease al., 2018, Niass et al., 2015, Rousseau et al., 2012].Several methods for calculating missing data have been developed and can be distinguished into two domains: the domain of time series and the domain of periodic data analysis, univariate and multivariate, the reconstruction of missing data has been widely studied in the field of time series [Aissia 2014, Marlinda et al., 2010, Nejjari et al., 2020].
• Step 2 -missing values of a single variable • Step 3 -regress by the other variables • Step 4 -predict the missing values of this variable • Step 5 -repeat for the other variables • Step 6 -repeat m times • Step 7 -merge the results m times Statistical trend and breakout tests • Trend detection test − Mann Kendall trend test -is used to determine with a nonparametric test whether a trend is identifiable in a time series that possibly includes a seasonal component.This nonparametric trend test is the result of an improvement of the test first studied by Mann (1945) then taken up by Kendall (1975) and finally optimized by Hirsch (1982, 1984) in order to take into account a component seasonal.The Pettitt (1979) and Mann-Kendall (1947) [Mann et al., 1947, Pettitt 1979] tests are so-called non-parametric statistical tests − Change point detection test -homogeneity tests bring together a large number of tests for which the null hypothesis is that a time series is homogeneous between two given times.As part of this work, we opted for the use of Pettitt's test since it is used by several researchers in studies similar to our case [ [Bousri, I., & al., 2021]'s evident that only 16 out of 59 stations have no missing values, accounting for 27% of the total.It's essential to comprehend that achieving accurate modeling for phenomena reliant on precipitation data necessitates a clear understanding of precipitation variability across different locations, mandating a significant number of stations.Hence, filling in the gaps in other stations becomes necessary.Two approaches were employed.The first involved measuring three parameters MAE, RMSE, and CVRMSE at Souss Massa watershed stations.These results are presented in Table1.The second approach measured the NRMSE parameter for varying percentages of missing values (10%, 20%, 30%, 40%, and 50%) in the Moulouya watershed stations.The results are provided in Table2.This was done to assess the impact of the quantity of missing values on the performance of each technique.The results demonstrate that regardless of the percentage of missing values, the missForest technique remains the most effective as it minimizes all indicators in both watersheds.This outcome aligns with other studies conducted in different regions worldwide, such as the study by Bousri et al.[Bousri, I., & al., 2021].

Table 1 .
groups Performance evaluation results of the techniques for selected stations in Souss Massa

Table 2 .
NRMSE calculation results for each percentage of missing data in the Moulouya basin

Table 3 .
Results of trend analysis using the Mann-Kendall test

Table 4 .
Distribution of the number of stations by type of trend

Table 5 .
Results of change point detection tests using different techniques