Skip to main content

Most Used Algorithm by DataScientist

We will discuss mostly machine learning algorithms that are important for data scientists and classify them based on supervised and unsupervised roles. I will provide you an outline for all the important algorithms that you can deploy for improving your data science operations.Here is the list of top Data Science Algorithms that you must know to become a data scientist. Let’s start with the first one –


1. Linear Regression

Linear Regression is a method of measuring the relationship between two continuous variables. The two variables are –
  • Independent Variable – “x”
  • Dependent Variable – “y”
In the case of a simple linear regression, the independent value is the predictor value and it is only one. The relationship between x and y can be described as:
y = mx + c
Where m is the slope and c is the intercept.
Based on the predicted output and the actual output, we perform the calculation
linear regression

2. Logistic Regression

Logistic Regression is used for binary classification of data-points. It performs categorical classification that results in the output belonging to either of the two classes (1 or 0). For example, predicting whether it would rain or not, based on the weather condition is an example of logistic regression.
The two important parts of Logistic Regression are Hypothesis and the Sigmoid Curve. Using this hypothesis, we derive the likelihood of an event. The data that is produced from our hypothesis is fit into the log function that ultimately forms an S shaped curve called ‘sigmoid’. Based on this log function, we are able to determine the category of the class.
The sigmoid is an S-shaped curve which is represented as follows:
logistic regression
We generate this with the help of logistic function –
1 / (1 + e^-x)
Here, e represents base of natural log and we obtain the S-shaped curve with values between 0 and 1. The equation for logistic regression is written as:
y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))
Here, b0 and b1 are the coefficients of the input x. These coefficients are estimated using the data through “maximum likelihood estimation”.

3. K-Means Clustering

According to the formal definition of K-means clustering – K-means clustering is an iterative algorithm that partitions a group of data containing n values into k subgroups. Each of the n value belongs to the k cluster with the nearest mean. This means that given a group of objects, we partition that group into several sub-groups. These sub-groups are formed on the basis of their similarity and the distance of each data-point in the sub-group with the mean of their centroid. K-means clustering is the most popular form of an unsupervised learning algorithm. It is easy to understand and implement.
The objective of the K-means clustering is to minimize the Euclidean distance that each point has from the centroid of the cluster. This is known as intra-cluster variance and can be minimized using the following squared error function –
Where J is the objective function of the centroid of the cluster. K are the number of clusters and n are the number of cases. C is the number of centroids and j is the number of clusters. X is the given data-point from which we have to determine the Euclidean Distance to the centroid. Let us have a look at the algorithm for K-means clustering –
  • First, we randomly initialize and select the k-points. These k-points are the means.
  • We use the Euclidean distance to find data-points that are closest to their centreW of the cluster.
  • Then we calculate the mean of all the points in the cluster which is finding their centroid.
  • We iteratively repeat step 1, 2 and 3 until all the points are assigned to their respective clusters.

4. Principal Component Analysis

One of the most important part of data science is dimension. There are several dimensions in data. The dimensions are represented as n. For example, suppose that as a data scientist working in a financial company, you have to deal with customer data that involves their credit-score, personal details, salary and hundreds of other parameters. In order to understand the significant labels that contribute towards our model, we use dimensionality reduction. PCA is a type of reduction algorithm.
With the help of PCA, we can reduce the number of dimensions while keeping all the important ones in our model. There are PCAs based on the number of dimension and each one is perpendicular to the other (or orthogonal). The dot product of all the orthogonal PCs is 0.
Data science algorithm

5. Support Vector Machines

Support Vector machines are powerful classifiers for classification of binary data. They are also used in facial recognition and genetic classification. SVMs have pre-built regularization model that allows data scientists to SVMs automatically minimize the classification error. It, therefore, helps to increase the geometrical margin which is an essential part of an SVM classifier
k-means clustering
Support Vector Machines can map the input vectors to n-dimensional space. They do so by building a maximum separation hyperplane. SVM’s are formed by structure risk minimization. There also two other hyperplanes, on either side of the initially constructed hyperplane. We measure the distance from the central hyperplane to the other two hyperplanes.

6. Artificial Neural Networks

Neural Networks are modeled after the neurons in the human brain. It comprises many layers of neurons that are structured to transmit information from the input layer to the output layer. Between the input and the output layer, there are hidden layers present. These hidden layers can be many or just one. A simple neural network comprising of a single hidden layer is known as Perceptron.

In the above diagram for a simple neural network, there is an input layer that takes the input in the form of a vector. Then, this input is passed to the hidden layer which comprises of various mathematical functions that perform computation on the given input. For example, given the images of cats and dogs, our hidden layers perform various mathematical operations to find the maximum probability of the class our input image falls in. This is an example of binary classification where the class, that is, dog or cat, is assigned its appropriate place.

7. Decision Trees

With the help of decision trees, you can perform both prediction and classification. We use Decision Trees to make decisions with a given set of input. Understand decision tree with the help of the following example:
Suppose you go to the market to buy a product. First, you assess if you really need the product, that is, you will go to the market only if you do not have the product. After assessing it, you will determine if it is raining or not. Only if the sky is clear, you will go to the market, otherwise, you will not go. We can observe this in the form of a decision tree-

Using the same principle, we build a hierarchical tree to reach a result through a process of decisions. There are two steps to building a tree: Induction & Pruning. Induction is the process in which we build the tree, whereas, in pruning, we simplify the tree by removing complexities.

8. Recurrent Neural Networks

Recurrent Neural Networks are used for learning sequential information. These sequential problems consist of cycles that make use of the underlying time-steps. In order to compute this data, ANNs require a separate memory cell in order to store the data of the previous step. We use data that is represented in a series of time-steps. This makes RNN an ideal algorithm for solving problems related to text processing.
In the context of text-processing, RNNs are useful for predicting future sequences of words. RNNs that are stacked altogether are referred to as Deep Recurrent Neural Networks. RNNs are used in generating text, composing music and for time-series forecasting. Chatbots, recommendation systems and speech recognition systems use varying architectures of Recurrent Neural Networks.

9. Apriori

In 1994, R. Agrawal and R. Srikant developed the Apriori Algorithm. This algorithm is used for finding frequently occurring itemsets using the boolean association rule. This algorithm is called Apriori as it makes use of the ‘prior’ knowledge of the properties in an itemset. In this algorithm an iterative approach is applied. This is a level-wise search where we mine k-frequently occurring itemset to find k+1 itemsets.
Apriori makes the following assumptions –
  • The subsets of a frequent itemset must also be frequent.
  • Supersets of an in-frequent itemset must also be in-frequent.
The three significant components of an Apriori Algorithm are –
  • Support
  • Confidence
  • Lift
Support is a measure of the default popularity (which is a result of frequency) of an item ‘X’. Support is calculated through the division of the number of transactions in which X appears with the total number of transactions.
Data science algorithms
We can define the confidence of a rule as the division of the total number of transactions involving X and Y with the total number of transactions involving X.
Data science algorithm
Lift is the increase in the ratio of the sale of X when you sell the item Y. It is used to measure the likelihood of the Y being purchased when X is already purchased, taking into account the popularity of the item Y.
 Apriori Algorithm

Comments

Popular posts from this blog

Deep Learning Interview Questions - Part 1

Q1. What do you mean by Deep Learning?  Deep Learning  is nothing but a paradigm of machine learning which has shown incredible promise in recent years. This is because of the fact that Deep Learning shows a great analogy with the functioning of the human brain. Q2. What is the difference between machine learning and deep learning? Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed. Machine learning can be categorised in the following three categories. Supervised machine learning, Unsupervised machine learning, Reinforcement learning Deep Learning is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks. Q3. What, in your opinion, is the reason for the popularity of Deep Learning in recent times? Now although Deep Learning has been around for many years, the major breakthroughs from these te...

R vs Python: Who is the Winner according to me...!!

As a data scientist, you probably want and need to learn Structured Query Language, or SQL. SQL is the de-facto language of relational databases, where most corporate information still resides. But that only gives you the ability to retrieve the data — not to clean it up or run models against it — and that’s where Python and R come in.R and Python both share similar features and are the most popular tools used by data scientists. Both are open-source and henceforth free yet Python is structured as a broadly useful programming language while R is created for statistical analysis. A little background on R R was created by Ross Ihaka and Robert Gentleman — two statisticians from the University of Auckland in New Zealand. It was initially released in 1995 and they launched a stable beta version in 2000. It’s an interpreted language (you don’t need to run it through a compiler before running the code) and has an extremely powerful suite of tools for statistical modeling and graphing...

How to deal with missing values in data cleaning

The data you inherit for analysis will come from multiple sources and would have been pulled adhoc. So this data will not be immediately ready for you to run any kind of model on. One of the most common issues you will have to deal with is missing values in the dataset. There are many reasons why values might be missing - intentional, user did not fill up, online forms broken, accidentally deleted, legacy issues etc.  Either way you will need to fix this problem. There are 3 ways to do this - either you will ignore the missing values, delete the missing value rows or fill the missing values with an approximation. Its easiest to just drop the missing observations but you need to very careful before you do that, because the absence of a value might actually be conveying some information about the data pattern. If you decide to drop missing values : df_no_missing = df.dropna() will drop any rows with any value missing. Even if some values are available in a row it will still get dropp...