Efficiency of Random Sampling Based Data Size Reduction on Computing Time and Validity of Clustering in Data Mining

Authors

  • Zeynel Cebeci Çukurova University
  • Figen Yildiz

DOI:

https://doi.org/10.17700/jai.2016.7.1.266

Abstract

In data mining, cluster analysis is one of the widely used analytics to discover existing groups in datasets. However, the traditional clustering algorithms become insufficient for the analysis of big data which have been formed with the enormous increase in the amount of collected data in recent years. Therefore, the scalability has been one of the most intensively studied research topics for clustering big data. The parallel clustering algorithms and the Map-Reduce framework based techniques on multiple machines are getting popular in scalability for big data analysis. However, applying the sampling techniques on big datasets could be still alternative or complementary task in order to run the traditional algorithms on single machines. The results obtained in this study showed that the data size reduction by the simple random sampling could be successfully used in cluster analysis for large datasets. The clustering validities by running K-means algorithm on the sample datasets were found as high as those of the complete datasets. Additionally the required execution time for cluster analysis on the sample datasets was significantly shorter than those obtained for the complete datasets.

Author Biography

Zeynel Cebeci, Çukurova University

Faculty of Agriculture Div. of Biometry and Genetics

Downloads

Published

2016-04-29

How to Cite

Cebeci, Z., & Yildiz, F. (2016). Efficiency of Random Sampling Based Data Size Reduction on Computing Time and Validity of Clustering in Data Mining. Journal of Agricultural Informatics, 7(1). https://doi.org/10.17700/jai.2016.7.1.266

Issue

Section

Journal of Agricultural Informatics