Data Analytics using Machine LearningA study report on Machine Learning algorithms used for analyzing Large Datasets Ashika AvulaCollege of Computing and InformaticsUniversity of North CarolinaCharlotte, North [email protected] Abstract—This is an era in which we have a plethora of Machine Learning Algorithms which are widely available and are used for analyzing large variants of datasets effortlessly. Nowadays the volume of data being used for analysis is growing exponentially. This study report is majorly concerned on different Machine Learning algorithms which are being used for Data Analysis and in-depth comparison of the algorithms based on different parameters and their applicability as well as trying to understand which algorithm is more useful in what context. The algorithms which are currently in concern for the report are Decision Trees, Neural Networks, k-Nearest Neighbors(K-NN), Ensemble methods, Support Vector Machines(SVM) and Naïve Bayes. 1) Decision Trees: Decision tree learning is more popularly used for approximating discrete valued functions. 2) Neural Networks: Neural Nets learning is more robust, and it can be applied on problems such as speech recognition, robotics etc., 3) K-Nearest Neighbors: It is an instance based learning approach used in calculating distances based on particular instance. 4) Ensemble Methods: This method includes bagging and boosting which can be used in combination with other algorithms to increase their performance efficiency. 5)Support Vector Machines: This learning is based on different kernels used in algorithm and is popularly used in text, pattern or image recognition. 6) Naïve Bayes: This type of learning uses probabilistic approach for inference Keywords— Accuracy, performance, decision trees, neural networks, K- NN, Ensemble, Naïve Bayes.I. INTRODUCTION In present scenario, Data has evolved as vital resource for any field. Using the existing data for analyzing patterns plays a significant role in company’s growth and profit increment. There are many types of data such as: web data, social media data, business data, machine and sensor data, genomic data, astronomy data, GPS data, human related data etc. The data keeps on growing and at present we have some petabytes of data over internet that can be accessed by the user for performing analysis. The analysis on the data can be used for predicting future trends or meaningful insights that can be used in generating profits for enterprises and organizations. For example, consider social media data of Facebook. The content data such as status updates, wall posts, photos, comments and videos are shared on Facebook in the amounts of 2.5 billion per day approximately. Facebook data warehouse stores all these data at the rate of 600 Terabytes Ref: 1. It uses this data for deriving meaningful insights. Business data can be mined to manage supply chain, understand market trends and formulate pricing strategies. The more the people are getting connected with devices, the data being generated is increasing. We have entered an era called as “Internet of Things” in which the world is digitalized. The amount of data we have is not important, but what can be done with the data we have is very important nowadays. Using data analytics and machine learning technologies to gather and link massive data sets is what will support new means of economic growth. Storing capacity of the data has been increased dramatically, but the data access speeds have not kept up. So, the complexity of data is increasing.Big Data Deluge: Big data deluge can be defined in terms of volume, velocity and variety. The characteristics are defined as follows:Volume: As the name advocates, it is very clear that a massive amount of data is being added within a span of seconds and it is growing exponentially. Every day, 15 petabytes of latest information are being generated approximately. As of 2016, the information base of the world doubles every 12 hoursRef: 2.Velocity: As the amount of data is increasing, the processing of the data has also been changed. It has been shifted from batch processing to stream processing. With the help of stream processing data, we can make decisions in fraction of seconds.Variety: 90% of new data is unstructured and is generated largely by e-mail, images, video and audio Ref: 2. The variety of data includes structured data, semi-structured data and unstructured data.1.) Structured Data: It includes data from traditional databases, distributed file systems and parallel databases.2.) Semi-structured Data: This type of data includes web logs, system logs and XML data.3.) Unstructured Data: It includes social media logs such as images, video and audio, graphics, text and word files etc. Fig 1: Characteristics of Big Data Ref: 3.Handling variety of data for analysis is made easy with combination of different machine learning algorithms. Ongoing further, some of the important machine learning algorithms such as Decision tress, Neural nets, K-NNs, SVMs, Naïve Bayes is discussed.II. DECISION TREESDecision Trees build model in the form of a tree structure as shown in figure 2. They can be used to build classification and regression models. This is a supervised learning algorithm. This algorithm pics one attribute as decision node from the dataset using ID3 algorithm Ref: 5. The ID3 algorithm uses entropy and information Gain to calculate decision node. The decision node is also called as root node.A) Entropy:If the dataset is equally divided, then entropy would be one and if the dataset is homogeneous then the entropy is zero. We use the following formula to calculate entropy for one attribute. (1)Entropy of Target attribute is calculated first. Then it is compared with each attribute of dataset to calculate entropy of the respective two attributes. The entropy formula used for two attributes is as follows: (2)B) Information Gain:The information gain is based on decrease in the entropy which is calculated using two attributes formula. The formula for calculating information gain is as follows: (3) Fig 2: Decision tree structure for sample data Ref: 5The sample which has least entropy has highest Information Gain. The sample which has highest Information Gain is taken as decision node or root node. The above algorithm is repeated until the tree is built completely using all the attributes in the dataset. The backlog in Decision trees is pruning. It requires pruning of data to achieve high accuracy. Decision trees suffer from over fitting of data. The Normal Decision trees does not have a capability of analyzing causal relationship between predictor variable and outcome variable. For this purpose, the causal decision trees algorithm was proposed. The causal decision trees inherit the advantages of normal decision trees and in addition to that they provide graphical representation of causal relationships. Constructing causal decision trees is speedy when compared to normal decision trees Ref: 6.Casual decision trees follow divide and conquer approach. Since they follow divide and conquer approach they are ideal for analyzing large datasets and are also scalable. The idea of causal relationships between the variables provides better insights into data and helps in correct decision making. Hence finding causal relationships in the data is an important data analytics job. Randomised controlled trials (RCTs) can help in causal inference of relationships but it is impossible to conduct them frequently as they are not cost effective. Therefore, this approach would require domain experts’ knowledge in data collection or selection. If we have increasing accessibility of observational data, then causal decision trees method will be a gifted tool for automated detection of causal relationships in data, to support better decision making and action planning in various areas. III. ARTIFICIAL NEURAL NETWORKSArtificial Neural Networks are intelligent computers which replicate biological neural networks. They are an important part of artificial intelligence. They are used in building intelligent systems. An Artificial Neuron resembles the same approach as of biological neuron. In human body, biological neurons are interconnected in number of some millions. In the same way artificial neurons are web interconnected which are millions and millions in number. Artificial neural networks follow parallel processing methods same as biological neurons Ref: 10. Artificial Neural networks approach the problem in unique way. They can be used for identifying and classifying image data, pattern recognition data etc., The neural networks contain one input layer and one output layer and many hidden layers as shown in figure 3. Fig 3: Neural Networks Structure Ref: 7Artificial Neural Networks are powerful computational devices. The concept of high parallelism makes them very efficient. They can learn robotically from the training data. So, it does not require enormous programming. Neural Networks are fault and noise tolerant. Multilayer perceptrons are feedforward neural networks and they also have one or more layers between input and output nodes called as hidden nodes. They use Back Propagation algorithm in calculating weights. While implementing the algorithm we can specify the number of hidden layers. Based on the hidden layers, the computation of Back Propagation Changes. Neural networks have been used popularly as pattern classifiers in many applications. Nevertheless, the analysis of the ideal number of hidden nodes required for taking subjective decisions still seems to be an open-ended question. Generally, from recent experiments on pattern classification, we can find that people are considering the number of hidden nodes as function of the number of input training patterns in order to achieve optimum results.Some real-world applications of neural networks are: Financial modeling such as predicting of stock markets, time series prediction such as climate and weather, pattern recognition such as speech recognition and sonar signals classification, DNA sequencing in bio-informatics, data compression, data mining etc., IV. K- NEAREST NEIGHBORSK- Nearest Neighbors is an instance based learning algorithm. It will not construct any general model but only stores instances of training data. It can be used for both classification and regression problems. Based on the nearest neighbors, classification is done as shown in figure 4. It is a lazy learning algorithm. Fig 4: K- Nearest Neighbors illustration Ref: 12.In K-NN, K is the variable which can be specified by the user. For example, consider the above diagram. The training dataset is classified into red and green labels. The blue label is not classified, since it is test data. We must classify blue label based on K value. Suppose if we assume K as 3. So, blue label is closer to one green label and two red labels. K-NN uses Euclidian Distance formula and Manhattan Distance formula to calculate the distances. A) Euclidean Distance Formula: (1)B) Manhattan Distance Formula: (2)So, in the example shown in figure 4, the blue label belongs to the red label category. In this way, K-NN is used for classification of test data. K-NN is the simplest classification algorithm when compared to all machine learning algorithms Ref: 12. K-means clustering is also same as K-NN whereas K-NN uses a classification algorithm and K-means uses a clustering approach. These algorithms are used for initial step in data analysis called as data classification. Based on data classification results we can decide on which algorithm to choose in the next step of data analysis.V. ENSEMBLE METHODSEnsemble model is the combination of two or more models. By using combination models, the accuracy and robustness is increased over single model usage. Some of the applications of Ensemble methods are: distributed computing, large scale text data etc. This method uses divide and conquer approach. This approach is suitable for complex problems, as a complex problem can be divided into set of easy and solvable problems. Some of the Ensemble methods are Bagging and Boosting, Adaptive Boosting, Random Forests. Suppose consider Decision Trees. If we assume Decision tree model has achieved some X% accuracy on a sample dataset. If we train the model using both Decision Trees and Random Forests, the accuracy will be greater than X%. This method is called as Random Decision Trees. In this method the training is very efficient and does not require pruning. It can be used for large datasets. Random Forests is one of the popular methods of ensemble learning approach. Random Forests are very powerful in analyzing data insights. Random Forests are group of decision trees in which each tree will concentrate its focus on a specific feature, while preserving a summary of all features. While implementing Random Forests on any data we can specify number of trees in which data can be divided into. Every tree in Random Forests will do its own random train and test split of the given data, which is known as bootstrap aggregation. In this step some unnecessary samples are filtered out. These filtered out samples are called as ‘out-of-bag’ samples. Additional to this, each tree will do feature bagging at each node-branch split to reduce the effects of a sample that is highly correlated. Random forests predict the new labels based on the majority vote received from each and every tree in the computation. The main problem with ensemble learning is that the errors are not independent. The training time and classification time will increase exponentially. It cannot be used for small datasets. It works only for large datasets. In Random Forests of ensemble learning, individual tree might be sensitive to outliers, but the overall ensemble model is not sensitive to outliers. VI. SUPPORT VECTOR MACHINESSupport Vector Machines use linear classifiers. They typically define the margin of the hyperplane which maximizes the separation of positive and negative samples. A hyperplane can be defined as a line that splits the input variables in the space. In SVM, a hyperplane is designated to best discrete the points in the input space depending on their class. They can be classified into either class 0 or class 1. In 2-D space we can envision this as a line. The distance between the line and the neighboring data points is denoted as the margin. The best or optimum line that can distinct the two classes is called as the line which has largest margin. This is also called as the Maximal-Margin hyperplane. Only the neighboring data points are applicable for defining the line and in the building of the classifier. These points are known as the support vectors. They support or define the hyperplane. If the dataset is linear as shown in figure 5, it can be classified easily using linear SVM. If the dataset is non-linear as shown in figure 6, then we need to map dataset to a high dimensional space to create a margin of hyperplane. While training the dataset during programming, we can use Gaussian kernel to map datasets to high dimensional space. For linear classification, we can use linear kernel. A) Classifying based on linear kernel: Fig 5: linear kernel classifier in SVM Ref :15B) Classifying based on non-linear kernel: Fig 6: Gaussian kernel classifier in SVM Ref :15The advantage of Support Vector Machine is that, it can classify even the unbalanced data using penalty parameters. Some of the disadvantages of SVMs are they do not provide probability estimates. Instead they use 5-fold cross validation which is very expensive. If the features are more than samples, then over fitting of data occurs. SVMs use more memory for computation. VII. NAÏVE BAYESNaïve Bayes is a supervised learning algorithm. It can be implemented using Bayes theorem as shown in figure 7, with naïve assumption. Naïve Bayes is very simple, but it is very powerful machine learning algorithm. It can handle large datasets very easily. Some of the famous applications of Naïve Bayes are: text classification, document classification, spam e-mail filtering etc., Fig 7: Bayesian Probability terminology Naïve Bayes is popular for text data analysis. Naïve Bayes uses probabilistic classifiers. These classifiers are more powerful when compared with most sophisticated methods. It uses three different classifiers. They are: Gaussian Naïve Bayes, Multinomial Naïve Bayes and Bernoulli’s Naïve Bayes. Naïve Bayes classifiers differ mainly on assumptions of probability distribution P (xi | y). A) Gaussian Naïve Bayes: This type of classifier can be used for normal distribution of data. That is, if the data has continuous attribute values. It takes only one parameter – priors. The formula used for Gaussian Naïve Bayes is as follows: (1)B) Multinomial Naïve Bayes: This type of classifier can be used for multinomial distribution of data. This is one of the standard Naïve Bayes used for text and document classification. The formula for this is described as follows: (2)C) Bernoulli’s Naïve Bayes: This type of classifier can be used for multivariate Bernoulli distribution of data. In this, even though we have multiple features, we represent each feature as a binary valued variable. The formula used is as follows: (3) VIII. RESULTSI have experimented the above discussed six different supervised learning approaches on two different data sets. I have used Digit Recognition dataset and Amazon Review dataset. Digit recognition is numerical data and Amazon review data has both text and numeric values. The performance of these algorithms has been compared with various aspects of performance like Accuracy, complexity, precision, recall, confusion matrix, training speed, prediction speed, performance when number of observations are small and noise handling. The following are the definitions of the measures I used for analyzing the performance of the various algorithms.Accuracy: Accuracy is a measure of how often the classifier correctly predicted the label.Precision: Precision is the ratio of the number of True Positives to the number of predicted positives.Recall: Recall is the ratio of the number of True Positives to the number of actual positives in the dataset.Training Speed: Training Speed is a measure of how fast or slow the algorithm takes to train.Prediction Speed: Prediction Speed is a measure of how fast or slow the algorithm takes to predict the label.Performance with small number of observations: This is used to measure how the classifier performs when the number of training samples is small.Handles High noise: This is used to measure if high noise in the dataset is handled by the classifier or not.The results for the both datasets are as follows:Table 1: Digit Recognition DatasetThe Digit Recognition dataset contains recorded digits. Decision Tree Artificial Neural Networks K-NN Ada Boosting SVM Naïve BayesAccuracy 86.02% 95.15% 98.21% 96.38% 96.16% 89.14%Precision 0.84 0.94 0.97 0.95 0.97 0.89Recall 0.77 0.90 0.91 0.90 0.92 0.85F Score 0.81 0.92 0.95 0.93 0.95 0.87Train Speed Fast Slow Fast Slow Medium FastPrediction Speed Fast Fast Slow Fast Fast FastPerformance Performs poorly Performs poorly Performs poorly Performs poorly Performs poorly Performs wellHandle High Noise No Yes No No Yes YesTable 2: Amazon Review DatasetThe Amazon Review dataset contains reviews and ratings of various products. Decision Tree Artificial Neural Networks K-NN Ada Boosting SVM Naïve BayesAccuracy 83.58% 90.67% 60.19% 65.55% 77.07% 76.79%Precision 0.57 0.63 0.75 0.65 0.75 0.75Recall 0.53 0.59 0.71 0.61 0.70 0.70F Score 0.55 0.61 0.73 0.63 0.73 0.73Train Speed Fast Slow Fast Slow Medium FastPrediction Speed Fast Fast Slow Fast Fast FastPerformance Performs poorly Performs poorly Performs poorly Performs poorly Performs poorly Performs wellHandle High Noise No Yes No No Yes YesConsidering all parameters on both the datasets, Naïve Bayes with MultinomialNB performs well and next is Support Vector Machine using Linear Kernel. Generally, the accuracy is high for Artificial Neural Networks on both the datasets.IX. CONCLUSIONMachine Learning is a field which is still growing and there are enormous amounts of algorithms being discovered every year. We cannot generalize on which algorithms would perform better. It all depends on the dataset. We can always experiment different algorithms on the given datasets and can compare the accuracy and performance between the algorithms on the particular dataset. Based on the results we can finally élite the best suited algorithm for that dataset. Generally, Naïve Bayes performs well for text classification even with little amount of training data. Now-a-days, Random Forests are also the powerful classifiers that have high demand in the industry.X. REFERENCES 1 https://followthedata.wordpress.com/2014/06/24/data-size-estimates/2 https://www.ibm.com3 https://www.wordpress.com4 Dan Ji, Jianlin Qiu, Jianping Chen, Li Chen, Peng He, “An improved decision tree algorithm and its application in maize seed breeding”, Natural Computation (ICNC) 2010 Sixth International Conference on, vol. 1, pp. 117-121, 2010.5 Chen Jin, Luo De-lin, Mu Fen-xiang, “An Improved ID3 Decision Tree Algorithm”, Computer Science & Education, 4th International Conference on 2009. ICCSE ‘09.6 Jiuyong Li, Saisai Ma, Thuc Le, Lin Liu, and Jixue Liu, “Causal Decision Trees”, IEEE Transactions on Knowledge and Data Engineering, Volume: 29, Issue: 2, on Feb 1 2017.7 Dejan Tanikic’ and Vladimir Despotovic, “Artificial Intelligence Techniques For Modelling Of Temperature In The Metal Cutting Process”.8 Homayoun Valafar, Faramarz Valafar, Okan Ersoy, “Parallel, Self-Organizing, Consensus Neural Networks”, Neural Networks, International Joint Conference on, 1999. IJCNN ‘99.9 G. Mirchandani, W. Cao, “On hidden nodes for neural nets”, IEEE Transactions on Circuits and Systems, on May 1989.10 Manish Mishra, Monika Srivastava, “A View of Artificial Neural Network”, Advances in Engineering and Technology Research (ICAETR), 2014 International Conference on, 1-2 Aug 2014.11 Shiliang Sun, Rongqing Huang, “An Adaptive k-Nearest Neighbor Algorithm”, Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on, 10-12 Aug. 2010.12 Jorma Laaksonen, Erkki Oja, “Classification with Learning k-Nearest Neighbors”, Neural Networks, 1996., IEEE International Conference on, 3-6 June. 1996.13 Simon Bernard, Laurent Heutte, Sebastien Adam, “On the Selection of Decision Trees in Random Forests”, Neural Networks, 2009. International Joint Conference on, IJCNN 2009.14 Arnu Pretorius, Surette Bierman, Sarel J. Steel, “A Meta-Analysis of Research in Random Forests for Classification”, Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech) on, 2016.15 https://lasseschultebraucks.com/support-vector-machines/16 Christopher J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition”, Data Mining and Knowledge Discovery 2, 121-167, 1998.17 Yuguang Huang, Lei Li, “Naïve Bayes Classification Algorithm based on small dataset”, Cloud Computing and Intelligence Systems (CCIS), 2011 IEEE International Conference on, 15 -17 Sept. 2011.