Efficiency of Random Sampling Based Data Size Reduction on Computing Time and Validity of Clustering in Data Mining

Zeynel Cebeci; Figen Yildiz

doi:10.17700/jai.2016.7.1.266

Authors

Zeynel Cebeci Çukurova University
Figen Yildiz

DOI:

https://doi.org/10.17700/jai.2016.7.1.266

Abstract

In data mining, cluster analysis is one of the widely used analytics to discover existing groups in datasets. However, the traditional clustering algorithms become insufficient for the analysis of big data which have been formed with the enormous increase in the amount of collected data in recent years. Therefore, the scalability has been one of the most intensively studied research topics for clustering big data. The parallel clustering algorithms and the Map-Reduce framework based techniques on multiple machines are getting popular in scalability for big data analysis. However, applying the sampling techniques on big datasets could be still alternative or complementary task in order to run the traditional algorithms on single machines. The results obtained in this study showed that the data size reduction by the simple random sampling could be successfully used in cluster analysis for large datasets. The clustering validities by running K-means algorithm on the sample datasets were found as high as those of the complete datasets. Additionally the required execution time for cluster analysis on the sample datasets was significantly shorter than those obtained for the complete datasets.

Author Biography

Zeynel Cebeci, Çukurova University

Faculty of Agriculture Div. of Biometry and Genetics

Efficiency of Random Sampling Based Data Size Reduction on Computing Time and Validity of Clustering in Data Mining

Authors

DOI:

Abstract

Author Biography

Zeynel Cebeci, Çukurova University

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)

Information

Make a Submission