Estimation of Water Disinfection by Using Data Mining

In this study, the Artificial Neural Network (ANN) models and multiple linear regression techniques were used to estimate the relation between the concentration of total coliform, E. coli and Pseudomonas in the wastewater and the input variables. Two techniques were used to achieve this objective. The first is a classical technique with mul - tiple linear regression models, while the second one is data mining with two types of ANN (Multilayer Perceptron (MLP) and Radial Basis Function (RBF). The work was conducted using (SPSS) software. The obtained estimated results were verified against the measured data and it was found that data mining by using the RBF model has good ability to recognize the relation between the input and output variables, while the statistical error analysis showed the accuracy of data mining by using the RBF model is acceptable. On the other hand, the obtained results indicate that MLP and multiple linear regression have the least ability for estimating the concentration of total coliform, E. coli and pseudomonas in wastewater.


INTRODUCTION
Freshwater availability in countries is mainly based on precipitation as well as on water flowing from one region to another. However, the amount of available freshwater is less than 0.05%; the UN estimates that over 30 countries in the world lack freshwater resources (Barlow et al., 2017). Even though the average amount of freshwater available per person reaches over 100,000 m 3 per year in few humid and sparsely populated areas, it could be less than 50 m 3 in some parts in the Middle East (World Water Assessment Programme, 2006). In fact, a recent study showed that almost every nation experiences some sort of a vulnerability regarding the freshwater supplies and the most vulnerable is Jordan in the Middle East region (Padowski et al., 2017). Therefore, protecting water sources and improving the quality of drinking water is becoming more important every year especially in remote and rural areas .Multiple water disinfection techniques have been implemented for this purpose, like chlorination and water boiling; in addition, the solar water disinfection (SODIS) technique has been used, which is considered an easy, low cost and environmentally sustainable solution for water purification at a household level (Burhan 2015).
The solar water disinfection (SODIS) technique has gained a lot of attention in the past decade since the method is simple, cost effective, and can be implemented at households (Stubbé et al., 2016). The concept of the technique depends on solar radiation where the ultraviolet rays (UV) produce a synergistic effect that inactivates and kills microbial pathogens in contaminated water (Boyle et al., 2008;Castro-Alférez et al., 2017). Three to five hours of sunlight exposure with solar radiations above 500 W/m 2 is enough to eliminate pathogens (Meierhofer and Wegelin, 2002) given little to no water turbidity and favorable ambient temperatures (Oates et al., 2003). SODIS has been investigated in previous studies with several modifications based on the conditions of the experiments and the nature of infected water. Exposing infected water to direct sunlight contributed to a significant reduction in the growth of microbes and viruses in general, as shown in the work of (Lawrie et al. Therefore, predicting the presence of microbial pathogens using the data-driven techniques can enhance the disinfection process of water through cutting costs and optimizing the previously stated variables. For instance, a previous study used three methods based on a data-mining technique to predict the levels of chlorine in water in order to optimize the costs of adding chlorine without sacrificing the water quality (Zounemat-Kermani et al., 2018). The results showed that the multi-layer perceptron neural network method (MLPNN) yielded the greatest accuracy compared to other methods. Other studies also investigated the concentration of chlorine in water using artificial neural networks (ANN) and genetic algorithms (Wu et al., 2014;Hernández Cervantes et al., 2015).
However, when it comes to SODIS, the sunlight exposure period plays a major part in the inactivation process of bacteria (Shekoohiyan et al., 2019). Therefore, mathematical models were developed to estimate the time period needed to kill all microscopic organisms in water. For instance, a previous study introduced a fuzzy rule-based logic model that estimates the sunlight exposure time required to remove all fecal coliforms under different turbidity levels (Haider et al., 2017). The results showed agreement between the predicted and measured values of total coliform. Another study proposed a simple equation that provides the estimated amount of lethal UV dose that is needed for solar water disinfection (Figueredo-Fernández et al., 2017).
There is very little research, however, regarding the estimation of residual microbes in water that is treated with SODIS. A previous study presented this methodology to predict the level of Coliforms and E. coli on tomato fruits and lettuce leaves after the sanitizing process, rather than in water (Keeratipibul et al., 2011). In this paper, multiple regression and Artificial Neural Network (ANN) methods were used to predict the concentrations of total coliform, E. coli and Pseudomonas in the wastewater that is treated with SODIS. The results will help us optimize this disinfection technique by identifying the factors and variables that positively or negatively impact the solar disinfection process.

EXPERIMENT SETUP
BOECO Germany Laboratory glass bottles of 500 ml were used as wastewater containers which in its turn were directed to solar radiation. These containers were installed side by side and their measurements were collected every hour. Thermometers were used for monitoring temperatures.
Total coliform, E. coli and Pseudomonas were tested by means of the IDEXX setup, this technique is considered certificated, rapid, easy, and accurate. In addition, a quality and quantity test was performed (Hamdan and Darabee, 2017).

Multiple linear regression
The regression model resulted from SPSS, time (t), water temperature (T), pH and turbidity (Tr) were used as input variables and the concentration of total coliform, E. coli and pseudomonas in the wastewater were used as the output variables. In total, 48 samples were used to obtain the following linear equations:  Table 1 represents a summary of the results obtained using this model. As it was shown, the value of R (coefficient of determination) depends strongly on the dependent variable for constant values of time, water temperature, pH and turbidity for the prediction of total coliform and E. coli concentration. On the other hand, the value of R depends weakly on the dependent variable for constant values of time, water temperature, pH and turbidity for the prediction of Pseudomonas concentration. Table 2 shows the relation between the time, water temperature, pH and turbidity as predictors (input) with the concentration of total coliform, E. coli and Pseudomonas as dependent variables.

Artificial neural network model
In this work, two types of Artificial Neural Network (ANN) models were used to estimate the concentration of total coliform, E. coli and Pseudomonas in the wastewater, these models are Multilayer Perceptron (MLP) and Radial Basis Function (RBF). The variables (time, water temperature, pH and turbidity) were the inputvariables used in training the ANN models, and the concentrations of total coliform, E. coli and Pseudomonas in the wastewater were used as outputs variables. The obtained results were verified against the multiple regression technique.
Two types of ANN models were built and examined by Statistical Package for the Social Sciences (SPSS) software. The experimental data of previously obtained 48 samples was used as the input of ANN model.

Multilayer Perceptron Model
The Multilayer Perceptron Model (MLP) is a procedure compatible with a particular kind of neural network called a multilayer perceptron which is considered flexible. It uses the feedforward architecture and can have multiple hidden layers. It is one of the most commonly used neural network architectures. Table 3 shows the case processing summary, Table 4 shows the network information and Table 5 shows the model summary.

Radial Basis Function Model
A Radial Basis Function network is a feed-forward; supervised learning network with only one hidden layer, called radial basis  function layer. The RBF network can do both prediction and classification exactly the same as to what multi-layer perceptron network can do. However, it can be much faster than the MLP, but it is not as flexible in the types of models it can fit. Table 6 shows the case processing summary, Table 7 shows the network information and Table 8 shows the model summary. Figures 1 to 3 show the comparison between the obtained experimental data and the estimated power, as mentioned previously. Table 9 summarizes the comparison of performance of the used models based on statistical analysis. Lower

CONCLUSIONS
In this study, neural network models and multiple linear regression techniques were successfully used to estimate the relation between the concentration of total coliform, E. coli and Pseudomonas in the wastewater and the input variables. Two techniques were used to achieve      this objective. The first is a classical technique with multiple linear regression model, while the second one is data mining with two types of ANN (Multilayer Perceptron and Radial Basis Function). The comparisons between the estimated data and the experimental data showed that data mining by using RBF model has ability to recognize the relation between input and output variables. Moreover, the statistical error analysis showed the accuracy of data mining by using the RBF model.
On the other hand, the obtained results indicate that MLP and multiple linear regression have the least ability for the estimation of the concentration of total coliform, E. coli and Pseudomonas in the wastewater, respectively.