Skip to main content

Introduction to Datascience

Data Science has become one of the most demanded jobs of the 21st century.

What is Data Science?

“Data Science is about extraction, preparation, analysis, visualization, and maintenance of information. It is a cross-disciplinary field which uses scientific methods and processes to draw insights from data. ” As a data scientist, you take a complex business problem, compile research from it, creating it into data, then use that data to solve the problem. A Data Scientist, specializing in Data Science, not only analyzes the data but also uses machine learning algorithms to predict future occurrences of an event. Therefore, we can understand Data Science as a field that deals with data processing, analysis, and extraction of insights from the data using various statistical methods and computer algorithms. It is a multidisciplinary field that combines mathematics, statistics, and computer science.

Why Data Science?

So, after knowing what exactly Data Science is, you must explore why Data Science is important. So, data has become the fuel of industries. It is the new electricity. Companies require data to function, grow and improve their businesses. Data Scientists deal with the data in order to assist companies in making proper decisions. The data-driven approach undertaken by the companies with the help of Data Scientists who analyze a large amount of data to derive meaningful insights.

What does a Data Scientist do?

Data scientists work in a variety of fields. Each is crucial to finding solutions to problems and requires specific knowledge. These fields include data acquisition, preparation, mining and modeling, and model maintenance. Data scientists take raw data, turn it into a goldmine of information with the help of machine learning algorithms that answer questions for businesses seeking solutions to their queries. Each field can be defined as follows:

Data Acquisition

Here, data scientists take data from all its raw sources, such as databases and flat-files. Then, they integrate and transform it into a homogenous format, collecting it into what is known as a “data warehouse,” a system by which the data can be used to extract information from easily. Also known as ETL, this step can be done with some tools, such as Talend Studio, DataStage and Informatica.

Data Preparation

This is the most important stage, wherein 60 percent of a data scientist’s time is spent because often data is “dirty” or unfit for use and must be scalable, productive and meaningful. In fact, five sub-steps exist here:
Data Cleaning: Important because bad data can lead to bad models, this step handles missing values and null or void values that might cause the models to fail. Ultimately, it improves business decisions and productivity.
Data Transformation: Takes raw data and turns it into desired outputs by normalizing it. This step can use, for example, min-max normalization or z-score normalization.
Handling Outliers: This happens when some data falls outside the scope of the realm of the rest of the data. Using exploratory analysis, a data scientist quickly uses plots and graphs to determine what to do with the outliers and see why they’re there. Often, outliers are used for fraud detection.

Data Integration: Here, the data scientist ensures the data is accurate and reliable.

Data Reduction: This compiles multiple sources of data into one, increases storage capabilities, reduces costs and eliminates duplicate, redundant data.

Data Mining 

Here, data scientists uncover the data patterns and relationships to make better business decisions. It’s a discovery process to get hidden and useful knowledge, commonly known as exploratory data analysis. Data mining is useful for predicting future trends, recognizing customer patterns, helping to make decisions, quickly detecting fraud and choosing the correct algorithms. Tableau works nicely for data mining.

Model Building

This goes further than simple data mining and requires building a machine learning model. The model is built by selecting a machine learning algorithm that suits the data, problem statement and available resources. There are two types of machine learning algorithms: Supervised and Unsupervised:

Supervised

Supervised learning algorithms are used when the data is labeled. There are two types:

Regression: When you need to predict continuous values and variables are linearly dependent, algorithms used are linear and multiple regression, decision trees and random forest
Classification: When you need to predict categorical values, some of the classification algorithms used are KNN, logistic regression, SVM and Naïve-Bayes

Unsupervised

Unsupervised learning algorithms are used when the data is unlabeled, there is no labeled data to learn from. There are two types:

Clustering: This is the method of dividing the objects which are similar between them and dissimilar to others. K-Means and PCA clustering algorithms are commonly used.
Association-rule analysis: This is used to discover interesting relations between variables, Apriori and Hidden Markov Model algorithm can be used

Model Maintenance: After gathering data and performing the mining and model building, data scientists must maintain the model accuracy. Thus, they take the following steps:

Assess: Running a sample through the data occasionally to make sure it remains accurate
Retrain: When the results of the reassessment aren’t right, the data scientist must retrain the algorithm to provide the correct results again
Rebuild: If retraining fails, rebuilding must occur.

Tools for Data Science

Data Scientists use traditional statistical methodologies that form the core backbone of Machine Learning algorithms. They also use Deep Learning algorithms to generate robust predictions. Data Scientists use the following tools and programming languages:

i. R
R is a scripting language that is specifically tailored for statistical computing. It is widely used for data analysis, statistical modeling, time-series forecasting, clustering etc. R is mostly used for statistical operations. It also possesses the features of an object-oriented programming language. R is an interpreter based language and is widely popular across multiple industries

ii. Python
Like R, Python is an interpreter based high-level programming language. Python is a versatile language. It is mostly used for Data Science and Software Development. Python has gained popularity due to its ease of use and code readability. As a result, Python is widely used for Data Analysis, Natural Language Processing, and Computer Vision. Python comes with various graphical and statistical packages like Matplotlib, Numpy, SciPy and more advanced packages for Deep Learning such as TensorFlow, PyTorch, Keras etc. For the purpose of data mining, wrangling, visualizations and developing predictive models, we utilize Python. This makes Python a very flexible programming language.

iii. SQL
SQL stands for Structured Query Language. Data Scientists use SQL for managing and querying data stored in databases. Being able to extract information from databases is the first step towards analyzing the data. Relational Databases are a collection of data organized in tables. We use SQL for extracting, managing and manipulating the data. For example A Data Scientist working in the banking industry uses SQL for extracting information of customers. While Relational Databases use SQL, ‘NoSQL’ is a popular choice for non-relational or distributed databases. Recently NoSQL has been gaining popularity due to its flexible scalability, dynamic design, and open source nature. MongoDB, Redis, and Cassandra are some of the popular NoSQL languages.

iv. Hadoop
Big data is another trending term that deals with management and storage of huge amount of data. Data is either structured or unstructured. A Data Scientist must have a familiarity with complex data and must know tools that regulate the storage of massive datasets. One such tool is Hadoop. While being open-source software, Hadoop utilizes a distributed storage system using a model called ‘MapReduce’. There are several packages in Hadoop such as Apache Pig, Hive, HBase etc. Due to its ability to process colossal data quickly, its scalable architecture and low-cost deployment, Hadoop has grown to become the most popular software for Big Data.

v. Tableau
Tableau is a Data Visualization software specializing in graphical analysis of data. It allows its users to create interactive visualizations and dashboards. This makes Tableau an ideal choice for showing various trends and insights of the data in the form of interactable charts such as Treemaps, Histograms, Box plots etc. An important feature of Tableau is its ability to connect with spreadsheets, relational databases, and cloud platforms. This allows Tableau to process data directly, making it easier for the users.

vi. Weka
For Data Scientists looking forward to getting familiar with Machine Learning in action, Weka is can be an ideal option. Weka is generally used for Data Mining but also consists of various tools required for Machine Learning operations. It is completely open-source software that uses GUI Interface making it easier for users to interact with, without requiring any line of code.

Applications of Data Science

Data Science has created a strong foothold in several industries such as medicine, banking, manufacturing, transportation etc. It has immense applications and has variety of uses. Some of the following applications of Data Science are:

i. Data Science in Healthcare
Data Science has been playing a pivotal role in the Healthcare Industry. With the help of classification algorithms, doctors are able to detect cancer and tumors at an early stage using Image Recognition software. Genetic Industries use Data Science for analyzing and classifying patterns of genomic sequences. Various virtual assistants are also helping patients to resolve their physical and mental ailments.

ii. Data Science in E-commerce
Amazon uses a recommendation system that recommends users various products based on their historical purchase. Data Scientists have developed recommendation systems predict user preferences using Machine Learning.

iii. Data Science in Manufacturing
Industrial robots have made taken over mundane and repetitive roles required in the manufacturing unit. These industrial robots are autonomous in nature and use Data Science technologies such as Reinforcement Learning and Image Recognition.

iv. Data Science as Conversational Agents
Amazon’s Alexa and Siri by Apple use Speech Recognition to understand users. Data Scientists develop this speech recognition system, that converts human speech into textual data. Also, it uses various Machine Learning algorithms to classify user queries and provide an appropriate response.

v. Data Science in Transport
Self Driving Cars use autonomous agents that utilize Reinforcement Learning and Detection algorithms. Self-Driving Cars are no longer fiction due to advancements in Data Science.

Comments

Popular posts from this blog

CondaValueError: Value error: invalid package specification

Recently I was trying to create Conda Environment and wanted to install Tensorflow but i have faced some issue , so i have done some research and done trouble shooting related to that . Here am going to share how to trouble shoot if you are getting Conda Value error while creating Conda environment and install tensorflow . Open Anaconda Prompt (as administrator if it was installed for all users) Run  conda update conda Run the installer again Make sure all pkg are updated: Launch the console from Anaconda Navigator and conda create -n mypython python=3.6.8 After Installing Conda environment please active the conda now :  conda activate mypython once conda environment has been activated kindly install tensorflow 2.0 by using this command pip install tensorflow==2.0.0 once Tensorflow has been successfully install kindly run the command :  pip show tensorflow Try to Run Comman PIP Install Jupyter lab and after ins...

Data Science Interview Questions -Part 2

1) What are the differences between supervised and unsupervised learning? Supervised Learning Unsupervised Learning Uses known and labeled data as input Supervised learning has a feedback mechanism  Most commonly used supervised learning algorithms are decision trees, logistic regression, and support vector machine Uses unlabeled data as input Unsupervised learning has no feedback mechanism  Most commonly used unsupervised learning algorithms are k-means clustering, hierarchical clustering, and apriori algorithm 2) How is logistic regression done? Logistic regression measures the relationship between the dependent variable (our label of what we want to predict) and one or more independent variables (our features) by estimating probability using its underlying logistic function (sigmoid). The image shown below depicts how logistic regression works: The formula and graph for the sigmoid function is as shown: 3) Explain the steps in making a deci...