ارزیابی و مقایسه عملکرد روش‌های داده کاوی در تخمین شاخص کیفی SAR (مطالعه موردی: رودخانه آجی چای آذربایجان شرقی)

نویسندگان

چکیده

آب پاک یکی از عوامل مهم توسعه هر منطقه است. با توجه به قرارگیری ایران در منطقه گرم و خشک و کمبود منابع آب،‏ حفاظت و تأمین کیفیت آب لازم برای مصارف مختلف اهمیتی دو چندان دارد. به طور‌معمول ارزیابی کیفی آب‌های سطحی پرهزینه و زمان‌بر بوده و انتخاب روشی که در آن با حداقل پارامترهای هیدروشیمیایی بتوان پیش‌بینی به نسبت دقیقی از کیفیت آب داشت،‏ ترجیح داده می‏شود. یکی از مهم‌ترین پارامترهای کیفی آب در زمینه فعالیت‌‌های کشاورزی،‏ نسبت جذبی سدیم (SAR)‎ است که تخمین و ارزیابی دقیق مقدار آن بسیار ضروری است. در این بررسی،‏ امکان‌سنجی تخمین شاخص کیفی SAR در رودخانه آجی چای در منطقه آذربایجان شرقی با استفاده از پارامترهای هیدروشیمیایی مختلف با مدل درختی قوانین M5 و ماشین بردار پشتیبان بررسی شد. برای بررسی دقت مدل‌های M5 و ماشین بردار پشتیبان از چهار آماره‌‌ ضریب همبستگی (R)‎،‏ نش- ساتکلیف (NSC)‎،‏ جذر میانگین مربعات خطا (RMSE)‎ و میانگین خطای مطلق مقادیر (MAE)‎ استفاده شد. مقادیر این آماره‌ها برای روش ماشین بردار پشتیبان (98‎/0R=،‏ 97‎/0N-SC=،‏ (mg/l)‎22‎/6RMSE= و (mg/l)‎06‎/6MAE=) و برای مدل M5(98‎/0R=،‏ 96‎/0N-SC=،‏ (mg/l)‎33‎/7RMSE= و (mg/l)‎9‎/3MAE=) محاسبه شد. نتایج مقایسه نشان داد هر دو روش عملکرد خوبی در تخمین میزان SAR داشته‌اند،‏ اما مدل درختی قوانین M5 در محدوده داده‌های مورد استفاده روابط خطی ساده و کاربردی‌تر ارائه می‌کند.

کلیدواژه‌ها


عنوان مقاله [English]

Performance evaluation and comparison of data-mining methods in estimating SAR quality index (Case study: Ajichay river in East Azerbaijan)

نویسندگان [English]

  • ali rezazadeh joudi
  • mohammad taghi sattari
چکیده [English]

Clean water is one of the important factors in any region's development. Since Iran is located in an arid and semi-arid area with scarce water resources, preservation of water required for various uses and maintenance of its quality takes redoubles this importance. Evaluation of surface water is normally a costly and time-consuming process. Therefore, a method is preferred which has the minimum number of hydrochemical parameters and can yield a relatively accurate prediction of water quality. One of the most significant qualitative parameters of water for agricultural uses is the sodium absorption ratio (SAR), the factor which should be estimated and evaluated accurately. This research employed various hydrochemical parameters, a model tree using the M5-Rules, and a Support Vector Machine to study the feasibility of estimating the qualitative index SAR in the Ajichai River located in East Azerbaijan Province. The four statistics of correlation coefficient (R), Nash-Sutcliffe coefficient (NSC), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE) were used to determine the accuracy of both M5 model and the Support Vector Machine.
The study region was the Ajichai River on the northern hillsides of the Sahand Mountain. Hydrochemical data from the Hydrometric Station in Vanyar was used to evaluate and predict the SAR in the river. The Vanyar Station has the longitude of 46?24 ? east, the latitude of 38? 7? north, and the altitude of 1460 meters. Effects of Total Dissolved Solids (TDS), Electrical Conductivity (EC), PH, chlorine (Cl-), sulfate (SO42+), calcium (Ca2+), magnesium ( Mg2+) and sodium ( Na+) parameters on SAR were determined in SAR estimation. The model tree M5-Rules is a new data mining method. The main goal of this model is derived from regression trees. The difference is that this model has regression functions in its leaves instead of constant values and classification tags. The major advantage of the model tree M5-Rules over regression trees is that the model tree M5-Rules is much smaller than regression trees. Furthermore, regression functions normally do not include many parameters. A decision tree usually consists of four parts of root, branches, nodes, and leaves. Each node corresponds to a certain characteristic, and the branches represent values of the intervals. These intervals consider known values for each of the characters. The branching operation takes place with one of the predictor variables. The branching intervals are selected in a way that the sum of squared deviations from the mean of the data in each node is minimized. The branching criterion indicates the amount of the error in the related node, and the model calculates the minimum expected error as a result of each characteristic testing in the related node. The model error is generally assessed by measuring the predicted unobserved target values accuracy. In this research, the WEKA software which is developed at Waikato University in New Zealand was used to model the M5 method. Modeling was performed with this software using the option of M5-Rules which present simple and linear rules. Support Vector Machines are data mining algorithms similar to the model tree M5 and the artificial neural network. There are two groups of Support Vector Machines: Support Vector Classification (SVC) and Support Vector Regression. Furthermore, Support Vector Machines are based on the concept of decision planes that define decision boundaries, i.e. a decision plane separates data with different tags from each other. The goal in a linearization algorithm with the help of a Support Vector Machine, the assumptions of the input value of xi, and the output value of yi is to find a function with the minimum deviation (?) from the yis (? is the amount of deviation). In this research, the Statistica software is used for modeling the SAR values employing Support Vector Regression.
In the modeling of the SAR values by using the tree model M5-Rules, the best answer was obtained when 66 percent of the data was allocated to training and the rest to testing. To model the SAR values using the Support Vector Machine, various functions were tested as kernel functions, and it was found that the RBF function exhibited the best performance in the modeling of the SAR values. Among the 10 scenarios studied in this research, the best one was selected. The four statistics of correlation coefficient (R), Nash-Sutcliffe coefficient (NSC), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE) were used to determine the accuracy of both M5 model and the Support Vector Machine. The obtained values of these calculations were: R =0.98, N-SC=0.97, RMSE=6.22 (mg/l), MAE=6.06 (mg/l) for the Support Vector Machine method; and R=0.98, N-SC=0.96, RMSE=7.33 (mg/l), and MAE=3.9 (mg/l) for the M5 model. Results of the comparison indicated that both methods studies in this work, i.e. Support Vector Regression and the M5 model, were highly capable of predicting the SAR values in the Ajichai River, using the available data. However, the M5 model is recommended to be used due to the fact that the formulas employed in this method are so simple and linear.

کلیدواژه‌ها [English]

  • M? Model Tree Rules-Sodium Adsorption Ratio-Water Quality.-Support Vector Machine-