Skip to main content

DataScience Mandatory skills for 2020

The standard job description for a Data Scientist has long highlighted skills in R, Python, SQL, and Machine Learning. With the field evolving, these core competencies are no longer enough to stay competitive in the job market.Data Science is a competitive field, and people are quickly building more and more skills and experience. This has given rise to the booming job description of Machine Learning Engineer, and therefore, my advice for 2020 is that all Data Scientists need to be developers as well.


To stay competitive, make sure to prepare yourself for new ways of working that come with new tools.

1. Agile

Agile is a method of organizing work that is already much used by dev teams. Data Science roles are filled more and more by people who’s original skillset is pure software development, and this gives rise to the role of Machine Learning Engineer.More and more, Data Scientists/Machine Learning Engineers are managed as developers: continuously making improvements to Machine Learning elements in an existing codebase.
For this type of role, Data Scientists have to know the Agile way of working based on the Scrum method. It defines several roles for different people, and this role definition makes sure that continuous improvement and be implemented smoothly.

2. Github

Git and Github are software for developers that are of great help when managing different versions of software. They track all changes that are made to a code base, and in addition, they add real ease in collaboration when multiple developers make changes to the same project at the same time.
With the role of Data Scientist becoming more dev-heavy, it becomes key to be able to handle those dev tools. Git is becoming a serious job requirement, and it takes time to get used to best practices for using Git. It is easy to start working on Git when you’re alone or when your co-works are new, but when you join a team with Git experts and you’re still a newbie, you might struggle more than you think.
Git is the real skill to know for GitHub.

3. Industrialization

What is also changing in Data Science is the way we think about our projects. The Data Scientist is still the person who answers business questions with machine learning, as it has always been. But Data Science projects are more and more often developed for production systems, for example, as a micro-service in a larger software.
AWS is the biggest Cloud Vendor.
At the same time, advanced types of models are getting more and more CPU and RAM intensive to execute, especially when working with Neural Networks and Deep Learning.
In terms of job descriptions of a Data Scientist, it is becoming more important to not only think about the accuracy of your model but also take into account the time of execution or other industrialization aspects of your project.
Google also has a cloud service, just like Microsoft (Azure).

4. Cloud and Big Data

While industrialization of Machine Learning is becoming a more serious constraint for Data Scientists, it has also become a serious constraint for Data Engineers and IT in general.
Where the Data Scientist can work on reducing the time needed by a model, the IT people can contribute by changing to faster compute services that are generally obtained in one or both of the following:
  • Cloud: moving compute resources to external vendors like AWS, Microsoft Azure, or Google Cloud makes it very easy to set up a very fast Machine Learning environment that can be accessed from a distance. This asks from Data Scientists to have a basic understanding of Cloud functioning, for example: working with servers at distance instead of your computer, or working on Linux rather than on Windows / Mac.
PySpark is writing Python for parallel (Big Data) systems.
  • Big Data: a second aspect of faster IT is using Hadoop and Spark, which are tools that allow for the parallelization of tasks on many computers at the same time (worker nodes). This asks for using a different approach to implementing models as a Data Scientist because your code must allow for parallel execution.

5. NLP, Neural Networks, and Deep Learning

Recently, it has still been accepted for a Data Scientist to consider that NLP and image recognition as mere specializations of Data Science that not all have to master.
You will need to understand Deep Learning: Machine Learning based on the idea of the human brain.
But the use cases for image classification and NLP get more and more frequent even in ‘regular’ business. At current times, it has become unacceptable to not have at least basic knowledge of such models.
Even if you do not have direct applications of such models in your job, a hands-on project is easy to find and will allow you to understand the steps needed in image and text projects.

Comments

Popular posts from this blog

CondaValueError: Value error: invalid package specification

Recently I was trying to create Conda Environment and wanted to install Tensorflow but i have faced some issue , so i have done some research and done trouble shooting related to that . Here am going to share how to trouble shoot if you are getting Conda Value error while creating Conda environment and install tensorflow . Open Anaconda Prompt (as administrator if it was installed for all users) Run  conda update conda Run the installer again Make sure all pkg are updated: Launch the console from Anaconda Navigator and conda create -n mypython python=3.6.8 After Installing Conda environment please active the conda now :  conda activate mypython once conda environment has been activated kindly install tensorflow 2.0 by using this command pip install tensorflow==2.0.0 once Tensorflow has been successfully install kindly run the command :  pip show tensorflow Try to Run Comman PIP Install Jupyter lab and after ins...

Important Python Libraries for Data Science

Python is the most widely used programming language today. When it comes to solving data science tasks and challenges, Python never ceases to surprise its users. Most data scientists are already leveraging the power of Python programming every day. Python is an easy-to-learn, easy-to-debug, widely used, object-oriented, open-source, high-performance language, and there are many more benefits to Python programming.People in Data Science definitely know about the Python libraries that can be used in Data Science but when asked in an interview to name them or state its function, we often fumble up or probably not remember more than 5 libraries. Important Python Libraries for Data Science: Pandas NumPy SciPy Matplotlib TensorFlow Seaborn Scikit Learn Keras 1. Pandas Pandas (Python data analysis) is a must in the data science life cycle. It is the most popular and widely used Python library for data science, along with NumPy in matplotlib. With around 17,00 comments on GitH...

How to deal with missing values in data cleaning

The data you inherit for analysis will come from multiple sources and would have been pulled adhoc. So this data will not be immediately ready for you to run any kind of model on. One of the most common issues you will have to deal with is missing values in the dataset. There are many reasons why values might be missing - intentional, user did not fill up, online forms broken, accidentally deleted, legacy issues etc.  Either way you will need to fix this problem. There are 3 ways to do this - either you will ignore the missing values, delete the missing value rows or fill the missing values with an approximation. Its easiest to just drop the missing observations but you need to very careful before you do that, because the absence of a value might actually be conveying some information about the data pattern. If you decide to drop missing values : df_no_missing = df.dropna() will drop any rows with any value missing. Even if some values are available in a row it will still get dropp...