Skip to main content

DataScience Mandatory skills for 2020

The standard job description for a Data Scientist has long highlighted skills in R, Python, SQL, and Machine Learning. With the field evolving, these core competencies are no longer enough to stay competitive in the job market.Data Science is a competitive field, and people are quickly building more and more skills and experience. This has given rise to the booming job description of Machine Learning Engineer, and therefore, my advice for 2020 is that all Data Scientists need to be developers as well.


To stay competitive, make sure to prepare yourself for new ways of working that come with new tools.

1. Agile

Agile is a method of organizing work that is already much used by dev teams. Data Science roles are filled more and more by people who’s original skillset is pure software development, and this gives rise to the role of Machine Learning Engineer.More and more, Data Scientists/Machine Learning Engineers are managed as developers: continuously making improvements to Machine Learning elements in an existing codebase.
For this type of role, Data Scientists have to know the Agile way of working based on the Scrum method. It defines several roles for different people, and this role definition makes sure that continuous improvement and be implemented smoothly.

2. Github

Git and Github are software for developers that are of great help when managing different versions of software. They track all changes that are made to a code base, and in addition, they add real ease in collaboration when multiple developers make changes to the same project at the same time.
With the role of Data Scientist becoming more dev-heavy, it becomes key to be able to handle those dev tools. Git is becoming a serious job requirement, and it takes time to get used to best practices for using Git. It is easy to start working on Git when you’re alone or when your co-works are new, but when you join a team with Git experts and you’re still a newbie, you might struggle more than you think.
Git is the real skill to know for GitHub.

3. Industrialization

What is also changing in Data Science is the way we think about our projects. The Data Scientist is still the person who answers business questions with machine learning, as it has always been. But Data Science projects are more and more often developed for production systems, for example, as a micro-service in a larger software.
AWS is the biggest Cloud Vendor.
At the same time, advanced types of models are getting more and more CPU and RAM intensive to execute, especially when working with Neural Networks and Deep Learning.
In terms of job descriptions of a Data Scientist, it is becoming more important to not only think about the accuracy of your model but also take into account the time of execution or other industrialization aspects of your project.
Google also has a cloud service, just like Microsoft (Azure).

4. Cloud and Big Data

While industrialization of Machine Learning is becoming a more serious constraint for Data Scientists, it has also become a serious constraint for Data Engineers and IT in general.
Where the Data Scientist can work on reducing the time needed by a model, the IT people can contribute by changing to faster compute services that are generally obtained in one or both of the following:
  • Cloud: moving compute resources to external vendors like AWS, Microsoft Azure, or Google Cloud makes it very easy to set up a very fast Machine Learning environment that can be accessed from a distance. This asks from Data Scientists to have a basic understanding of Cloud functioning, for example: working with servers at distance instead of your computer, or working on Linux rather than on Windows / Mac.
PySpark is writing Python for parallel (Big Data) systems.
  • Big Data: a second aspect of faster IT is using Hadoop and Spark, which are tools that allow for the parallelization of tasks on many computers at the same time (worker nodes). This asks for using a different approach to implementing models as a Data Scientist because your code must allow for parallel execution.

5. NLP, Neural Networks, and Deep Learning

Recently, it has still been accepted for a Data Scientist to consider that NLP and image recognition as mere specializations of Data Science that not all have to master.
You will need to understand Deep Learning: Machine Learning based on the idea of the human brain.
But the use cases for image classification and NLP get more and more frequent even in ‘regular’ business. At current times, it has become unacceptable to not have at least basic knowledge of such models.
Even if you do not have direct applications of such models in your job, a hands-on project is easy to find and will allow you to understand the steps needed in image and text projects.

Comments

Popular posts from this blog

15 Common questions for Machine Learning...!!

  1. What is logistic regression? Logistic regression is a machine learning algorithm for classification. In this algorithm, the probabilities describing the possible outcomes of a single trial are modelled using a logistic function. 2. What is the syntax for logistic regression? Library: sklearn.linear_model.LogisticRegression Define model: lr = LogisticRegression() Fit model: model = lr.fit(x, y) Predictions: pred = model.predict_proba(test) 3. How do you split the data in train / test? Library: sklearn.model_selection.train_test_split Syntax: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) 4. What is decision tree? Given a data of attributes together with its classes, a decision tree produces a sequence of rules that can be used to classify the data. 5. What is the syntax for decision tree classifier? Library: sklearn.tree.DecisionTreeClassifier Define model: dtc = DecisionTreeClassifier() Fit model: model = dtc.fit(x, y) Predictions: p...

Introduction to Datascience

Data Science has become one of the most demanded jobs of the 21st century. What is Data Science? “Data Science is about extraction, preparation, analysis, visualization, and maintenance of information. It is a cross-disciplinary field which uses scientific methods and processes to draw insights from data. ” As a data scientist, you take a complex business problem, compile research from it, creating it into data, then use that data to solve the problem. A Data Scientist, specializing in Data Science, not only analyzes the data but also uses machine learning algorithms to predict future occurrences of an event. Therefore, we can understand Data Science as a field that deals with data processing, analysis, and extraction of insights from the data using various statistical methods and computer algorithms. It is a multidisciplinary field that combines mathematics, statistics, and computer science. Why Data Science? So, after knowing what exactly Data Science is, you must explore ...

Differentiate between univariate, bivariate and multivariate analysis.

Univariate analysis are descriptive statistical analysis techniques which can be differentiated based on one variable involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can the analysis can be referred to as univariate analysis. The bivariate analysis attempts to understand the difference between two variables at a time as in a scatterplot. For example, analyzing the volume of sale and spending can be considered as an example of bivariate analysis. Multivariate analysis deals with the study of more than two variables to understand the effect of variables on the responses.