Hydro-Geochemical Attributes Based Classifiers for Groundwater Analysis

Freshwater supply is critical for domestic, agricultural and industrial purposes. A good supply of clean water is normally obtained from surface and groundwater water bodies. Nonetheless, many localities rely heavily on the latter as the main source of their water resource. Therefore, proper mapping, exploitation and conservation of groundwater resources should become a primary focus in the years to come. In this study, the groundwater samples collected from Bamanghati, Odisha were assigned into three classes (excellent, good and bad) based on the guidelines provided by World Health Organization in 1984. These water quality assignments were completed via a combined approach of hydro-geochemical information and artificial neural network for reconstructing a classifier for groundwater analysis. Here, the probabilistic approach and boosted instance selection method were used to remove inconsistencies in the dataset and to determine the classification accuracy, respectively. Finally, the transmuted dataset is used for kernel estimator-based Bayesian and Decision tree (J48) classification approaches. The findings from the present study confirm that the preprocessing task using statistical analysis along with the combined method of hydro-geochemical attributes-based classification approach is encouraging while the decision tree approach is better than the Bayesian neural network classifier in terms of precision, recall, F-measures, and Kappa statistics.


INTRODUCTION
The task of classification (Mishra and Dehuri, 2014) and prediction  in data mining ) is a great challenge which appeals to many investigators and researchers to develop a robust and accurate model using the hidden data (Pati et al., 2020). However, the accuracy of the developed model is restricted to the use of quality data being used in the process of data mining. The researchers who are working in the field of statistics, neural networks, and machine learning have developed different types of classification methods (Dehuri and Cho, 2010b). Here, the concept of Bayesian neural network has been used for accurate classification impressed by probabilistic theory (Dehuri and Cho, 2010a).
Classification and prediction of groundwater quality is very crucial, because it is the major source of drinking water in many societies. It is also used for agricultural, industrial, and various other domestic purposes (Pati et al., 2021;Nandi et al., 2015Nandi et al., , 2016Nandi et al., , 2017. The chemical composition and concentration of groundwater are subjected to various damaging pollutants (Borowski and Ghazal, 2019; Kannan and Joseph, 2009;Mondal et al., 2011;Park et al., 2005). The method of groundwater quality assessment could therefore help in deciding to manage the environment properly (Yan et al., 2010). In recent years, researchers have used principal component analysis (PCA), discriminant function analysis, cluster analysis for groundwater quality assessment (Panda et al., 2006;Raghavachari, 2001). However, groundwater records are nonlinear and the afore-mentioned linear and semi-automated techniques seem to be inappropriate for data analysis. Hence, the probabilistic approach of the Bayesian neural network as classifier has become a suitable approach for solving the above problem (Lahiri et al., 2021).
The objective of this paper was to develop a classifier to tackle the unpredictable data with a compromised architecture and simple learning methods to rebuild an ANN model which can evaluate, assess and classify the groundwater quality of the Bamanghati subdivision of Mayurbhanj district, Odisha, India. The article is organized into 5 sections. Section 1 gives an introduction to the proposed research. Section 2 discusses the hydrogeologic framework through the Geographical Information System (GIS) along with hydro-geochemical information about the groundwater samples. It also discusses the boosting instance approach of attribute selection along with the descriptions of Bayesian Neural network architecture and learning process. Section 3 proposes a boosting instance selection based Bayesian classifier for the classification groundwater quality and in Section 4 experimental work is carried out followed by the result analysis. Finally, Section 5 gives the concluding remarks followed by references.

METHODS AND RELATED WORK Geology and hydrogeology of the study area
The study area, Bamanghati, is one of the remote sub-division of the Mayurbhanj district of Odisha. It is one of the four subdivisions of Mayurbhanj also part of the Chhatonagpur plateau, which falls in the survey of India Toposheet (73J/2, 73J/3, 73J/4, 73J/7, 73J/8, 73F/14, 73F/15, 73F/16, 73K/1). The total area of the Bamanghati subdivision is 1917 Sq. Km. It is surrounded by the Singhbhum district in North and West, Panchpir Subdivision in South, and Baripada Subdivision in the East (Figure 1). The subdivision extends between 85°55ꞌE to 86°30ꞌE longitude and 22°.0ꞌN to 22°35ꞌN latitude. According to 2011 census, the Bamanghati subdivision has a population of 4, 95,005 with 2, 42,020 male and 2, 52,984 female. The study area represents conspicuous physiographic variations marked by hills with intervening narrow intermountain valleys.

Hydro-geochemical information and model development
According to World Health Organization (WHO-1984), the groundwater quality index used in this paper is classified into three classes: (1) excellent, (2) good, and (3) bad, as shown in Table 1.
The classification of the groundwater is made for the safe drinking purpose of the water. For example, the p H of the water is expressed on a scale ranging from 0-14, where 7 represents neutral alkalinity. A p H value below 7 indicates the acidic nature, whereas a p H value above 7 represents the basic nature of the water. Accordingly, the p H value ranging within 7.5-8.5 represents the "excellent", p H value ranging within 7.1-7.5 represents the "good", and a value within the range 0.01-7.0 and 8.51-14.00 is assigned for "bad". According to WHO-1984, a value below 6.5 and above 8.5 is considered to be appealing (Osmanaj et al., 2021). The groundwater is considered safe for drinking with the EC value below 1,500 μS/ cm, but it is considered as saline as per WHO-1984 when the EC value is more than 1,500 μS/ cm (Brown et al., 1970).
According to WHO-1948, the TDS value below 1,000 mg/l of the groundwater is safe for drinking. Similarly, it is safe for use with a limit to 300 mg/l of HCO 3 ions, below the 200 mg/l of Clions, below the 200 mg/l of SO 4 2ions, less than 45 mg/l of NO 3 ions, 75 mg/l Ca 2+ and 30 mg/l of M g 2+ ions, respectively. In saline water, Na + K + Fand ions also play a major role in the classification of groundwater to be considered for drinking. As per WHO-1948, the value of Na + K + , and Fions should have a value below 200 mg/l, 100 mg/l, and 1.1 mg/l, respectively. The hydrogeochemical attributes discussed above are taken as inputs to the model as shown in equation (1) where: pH, EC, TDS, HCO 3 , Na + , K + , Fhydrogen ion concentration, electrical conductivity, total dissolved solids, bicarbonate, chloride, sulfate, nitrate, calcium, magnesium, sodium, potassium, and fluoride of water samples, respectively. The mean, standard deviation, and skew of the model attributes are shown in Figure 2.

Sources of Hydro-geochemical information
For the study of groundwater quality assessment, 89 water samples were systematically collected from different tube wells (TW), dug wells (DW), and bore wells (BW) of the Bamanghati subdivision during pre and postmonsoon seasons (2017) in polyethylene bottles with the capacity of one liter. The water samples were collected from the wells which are used regularly for domestic and irrigational purposes covering the whole area.
The bottles were cleaned with distilled water, dried, and closed before their use for sample collection. Before collecting the respective water sample, each bottle was first rinsed with the water from the respective well and then filled with the well water, and method of collection and analysis for the above-mentioned study was referred to the work of previous researchers (Brown et al., 1970;Rahman et al., 2021).

Boosting Instance Selection Approach
In data mining research, instance selection (Song and Shepperd, 2007) process plays a very important and relevant role. For managing the data in a proper way such as for efficiently processing, efficient storage, and data reduction purpose boosting instance selection method is needed. It is also essential for avoiding needless precision, removal of noise and outlier, smoothing of data, etc. Using these new developments and applications can be carried forward. Here, from the set of training instances, the task was to find a meaning for  Here, the authors used the stochastic gradient boosting technique to avoid overfitting in the purpose of boosting instance selection.

Bayesian Neural Network
As per the approach, it is assumed all the features are equally valuable and independent of each other. In order to mode a feature, the Gaussian curve takes the role of the probability of membership. As per the work done by Moore and Zuev (Moore and Zuev, 2005), the above-mentioned approach is refined to explain the impact of classification accuracy by the enhanced features. Algorithmically, naïve Bayesian kernel estimation and naïve Bayesian method are similar; the only difference is estimating the membership of an instance to a specific group (2) where: h is called the kernel parameter and K(t)

…k) and the estimation by the use of kernel function
is any kernel, where a kernel is defined as any non-negative function normal- The gradient of error E s is repeatedly evaluated by the back-propagation learning algorithm (Lauret et al., 2008;. The tan sigmoid function is also used as the cost function. The suitable prior probability distribution like P(w) of weights is considered in the Bayesian approach. The posterior probability distribution for the weights, say P(w|s), can be given as follows: (4) where: P(w|s) is the data set likelihood function and the P(s) is the normalizing factor. The distribution of outputs for a given input vector x can be written in the form as given below (Maiti and Tiwari, 2010):

Boosting instance selection base Bayesian neural network classification modeling
The considered work involves such steps as preprocessing of data and classification, as shown in the proposed model ( Figure 3). In the first step specific approaches for data preprocessing were considered. Then, for selecting the appropriate instances, the preprocessed dataset was given as an input to the boosted instance selection approach. The boosting instance selection approach was applied to the classification process to remove the redundancy and insignificance in the instances. In the second step, the hydro-geochemical attributes and the Bayesian neural network were integrated for the construction of the classifier (Figure 3). In the proposed model, the input layer contains 12 different features as described in section 2, with a hidden layer and output layer. In the considered work, these relate to the three classes of membership. There can be more than one number of hidden layers comprising of the number of nodes within it. Each connection carries a weight w ij .In the hidden layer, activation function g j (u j ) is defined: (6) where: the sum is over all nodes i. The bias node is to be 1. The hyperbolic tangent function (Eq. 7) was used for sake of non-linearity found in the problem domain.
g i (u j ) = tanh(u j ) The proposed work is enumerated broadly in the form of an Algorithm, as given below.

EXPERIMENTAL STUDY Description about the model setup and parameters
Here, a dataset was divided into reciprocally limited parts: a training set and a testing set. The model is built using a training and testing set. For obtaining the accuracy of the proposed model, 500 iterations were considered. Bayesian neural networks and decision trees were used for comparing the classification exactness of the proposed work.
The WEKA 3 tool was used to gauge the performance of Naïve Bayes and decision tree classifiers (http://www.cs.waikato.ac.nz/ml/weka/).
According to the WHO-1984 standard limit, the training samples were produced taking hydro-geochemical information, as given in Table 1. The neural network model used in this groundwater quality assessment is 12-7-3, i.e., there are twelve input nodes, seven hidden nodes, and three output nodes (Figure 4).
The input node takes the hydro-geochemical attributes. For the accuracy of the mode three-layer architecture of a Bayesian neural network was used. For the better optimization process, seven numbers hidden nodes were found to be sufficient. Three nodes at the output layer denote the groundwater quality assessment (GQA) index. Here, the cross-validation technique was used for keeping uniformity of the model development process. Throughout the model development process, the two-parameter of the Bayesian approach were fixed, such as used the kernel estimator and 10 fold cross-validation.
In the defined error model of the data likelihood, the objective functions are defined as  (Dash et al., 2015). In this study, the root mean square error (RMSE), Kappa statistics, and Precision, Recall, F-measure, and confusion matrix were employed as the performance measurement for the groundwater quality assessment classification using Bayesian neural network. The equations for the parameters are as follows:  (8) where: y i is the observed value,   (9) where: y i is the observed agreement,  (10) where: TP represents the True Positives and FP represents the False Positive. Similarly, Recall is mathematically defined as follows: FN TP TP recall + = (11) where: TP represents the True Positive and FN represents the False Negative. Precision and recall play a tug of war in the classification process. Thus, precision and recall play an important role in examining and evaluating the effectiveness of a model. Similarly, F-measure is mathematically defined as follows: recall precision In the classification evaluation process, precision or recall alone can determine the effectiveness of the model. There may be situations where alternate importance of precision and recall can be found, for which F-Measure has been taken into consideration which leads to significant score value.
Further, the confusion matrix and its associated metrics were taken as an alternative tool for the evaluation of the considered method. The performance of a classification model (or "classifier") is determined by the confusion matrix which is a table represented in the form of a row and column applied on a set of test data for which the true values are known.

Result analysis
The groundwater quality governs the practicality of water for drinking, irrigation, and industrial uses. The chemical quality of the groundwater is affected to some level by the rock's chemical arrangement and mass of the soil. The chemical arrangement of groundwater is modified due to chemical reactions like oxidation and reduction (Schoeller, 1960). In the study, Table 2 describes different statistical behaviors of the collected groundwater samples (N=89). It was found that most of the parameters show wide ranges and high standard deviation (Table 2). Hence, it is very essential to study the performance of each The performance of the classifiers obtained through the fifteen self-regulating runs of the conducted experiment for the groundwater quality assessment is illustrated through Tables 3 to 12. The Cohen's Kappa (k) result and model classification accuracy obtained from the two classifiers named Bayesian and Decision tree used for groundwater quality assessment using the attributes taken for the classification is shown in Table 13.
Moreover, the confusion matrix was used as the performance measure for Bayesian and Decession Tree (J48) classification algorithm. Here, the comparative analysis was performed using parameters classification accuracy, classification error, sensitivity or recall, specificity, precision, and Matthew Correlation Coefficient (MCC) and the results obtained are shown in Table 14 and Figure 5. From the above-mentioned analysis, it was found that with some exceptions, the groundwater from both the shallow and deeper aquifers comes under       the portable category with respect to the maximum permissible limit. It is also a general observation that the water from deeper aquifers has better quality than that of the shallow aquifers. Therefore, from the quality point of view, the water from deep bore wells is most suitable for drinking purposes.

CONCLUSIONS
Groundwater has more greater importance as compared to surface water. Thus, proper planning of the groundwater becomes more essential nowadays. Hence, the efforts made by the authors for the classification of groundwater were carried out in two stages. In the first stage, the preprocessing task was carried out using the boosting instance approach. In the second stage of the suggested work, a hydro-geochemical attributes-based Bayesian classifier model was developed for the groundwater quality assessment using boosting approach and kernel estimator algorithm. Finally, the results obtained were compared with the Decision Tree (J48) classification algorithm and concluded that the classification exactness of the Decision Tree (J48) classification algorithm is better than the Bayesian neural network classifier in terms of precision, recall, F-measures, and Kappa statistics. Furthermore, this effort may be prolonged to a natured inspired classification by seeing precision and recall as the two important parameters. The future study suggests other bio-inspired metaheuristic approaches with more numbers data for the classification of groundwater of different geographical locations.