Efficiency of Random Sampling Based Data Size Reduction on Computing Time and Validity of Clustering in Data Mining
DOI:
https://doi.org/10.17700/jai.2016.7.1.266Abstract
In data mining, cluster analysis is one of the widely used analytics to discover existing groups in datasets. However, the traditional clustering algorithms become insufficient for the analysis of big data which have been formed with the enormous increase in the amount of collected data in recent years. Therefore, the scalability has been one of the most intensively studied research topics for clustering big data. The parallel clustering algorithms and the Map-Reduce framework based techniques on multiple machines are getting popular in scalability for big data analysis. However, applying the sampling techniques on big datasets could be still alternative or complementary task in order to run the traditional algorithms on single machines. The results obtained in this study showed that the data size reduction by the simple random sampling could be successfully used in cluster analysis for large datasets. The clustering validities by running K-means algorithm on the sample datasets were found as high as those of the complete datasets. Additionally the required execution time for cluster analysis on the sample datasets was significantly shorter than those obtained for the complete datasets.Downloads
Published
2016-04-29
How to Cite
Cebeci, Z., & Yildiz, F. (2016). Efficiency of Random Sampling Based Data Size Reduction on Computing Time and Validity of Clustering in Data Mining. Journal of Agricultural Informatics, 7(1). https://doi.org/10.17700/jai.2016.7.1.266
Issue
Section
Journal of Agricultural Informatics