Data Science, Data Analytics, Big Data, Business Analytics.. these are the latest trend topics in the industry. I have followed up several courses in these matters during my MSc and Udacity program this year and realized the importance of subject. So which way and program should i follow? This is my development program due to my researches and advises which i ve taken from gurus.

1. Statistics:

First and most important is the stats knowledge.
a. Probability distributions. Tip : central limit theorem.

b. Simple Stats and Jargons: 

Qualitative or quantitative data | Ratio/interval/ordinal/nominal data | Difference between population and sample – mean and variance | Skewness and Kurtosis | Standard deviation, mean, quartiles | Cheby Sheff’s Theorem, Coefficient of variation, Bayes law | Least square methods | Various Probability theory – classical/ Relative frequency/ subjective probability theory | Joint, marginal, conditional probability | Exclusive, Exhaustive, Independent events | Linear/non-linear correlation | Homo/heteroscedasticity | Outlier/ anomaly

c. Estimations, Hypothesis testing:

Z estimate | Difference of 2 mean | Analysis of variance (ANOVA) – one way/multiple comparisons – Fisher, Bonferroni, Tukey methods| T-test, F-test, Chi-square-test

d. Simple linear regression and correlation

 

2. SQL

3. Maching Learning Models

  • K fold, Stratified, Leave one out Cross Validation | Validation curve
  • Multiclass Evaluation – Macro and Micro Average
  • Ensemble Learning
  • Random Forest
  • Voting Classifier – Soft and Hard
  • Sampling Techniques – Bagging, Pasting
  • Boosting – Adaptive Boosting and Gradient Boosting: can be used for both classification and regression

Plus ; Neural Networks: Perceptron – Hebb’s rule, Multiple layer perceptron – back propagation, Epoc, Batch Size.

 

4. Data Visualization

a. Used for Model building & descriptive analytics – For data cleaning, model improvement, or for all above steps. These are generally for developers and are non-fancy. Majorly to get a sense of the data and predictions.

b. Used for Non-Tech Stakeholders
Tools: Tableau, Power BI. In fact, R & Python also has some nice packages for these.

 

5. Communication

The whole point of a Data Scientist job is to communicate the findings to senior level peoples including VPs and directors. If you can’t convey your message effectively, all your efforts will go to waste, your findings won’t convert into a product. You should always be ready for presentations or talking to a lot of people. Proper visualizations as discussed above and good public speaking skills will make it easy.

 

— > This is enough to become a Data Scientist

Normally for an average data scientist, these skills are enough. But I do believe they shouldn’t depend on a Data Engineer for data every time. Request and re-request and this go on an on for a cleaner, proper data. I feel I should be able to stand on my own two feet and pull the updated data whenever I want and however I want it. It increases my productivity by saving me more time.
Easier said than done. Now because of data boom, just SQL knowledge will not cut it. You should also know about big data architecture to pull data from.

 

 

6. Big Data – Hadoop

Now the sample you build the model on is nothing compared to the actual world- the Big Fish. You need to pull data from Hadoop infrastructure or ask one of your Data engineers to do so. Data engineers are responsible for cleaning, preparing and optimizing data for consumption. I feel being independent to get the data you need is one big step towards being a scientist.
Knowledge of Hadoop Architecture, Linux and HDFS Basic Commands, Sqoop, Impala, Hive, Pig Latin, Flume, Solr, Spark and Scala would be enough. Plus, you can always refer to the documentation if you need to get the code. Just basic knowledge of how to use it would be sufficient for a Data Scientist.

7. Amazon Web Services (AWS): Cloud Computing Services

Building and Maintaining a big data architecture is very expensive and time-consuming. So here comes cloud computing. You can do all your analytical processing for production instance in cloud taking help of these services. And the major giant in the market for this is Amazon Web Services (AWS). Basic knowledge of how to use the services like EC2, S3, DynamoDB, EMR, Athena, Lambda, and Elasticsearch should be enough.

8. Certifications

Now if you really want to showcase these to the world you can start with the below 3 certifications:
1. Dell EMC – Data Scientist Associate (DECA-DS) Certification
2. Cloudera – CCA Spark and Hadoop Developer Exam (CCA175)
3. Amazon Web Services – AWS Certified Developer/ AWS Certified Solutions Architect – Associate
Also, I did Salesforce certification to understand the CRM behavior better. Salesforce Wave for Big data can access data from on-premise Hadoop clusters and cloud-based big data repositories (AWS). It just broadens your knowledge (not necessary, your choice).

 

Sources :

https://www.linkedin.com/pulse/you-data-scientist-article-guide-become-one-dipendu-chanda-/

https://www.datasciencecentral.com/profiles/blogs/hitchhiker-s-guide-to-data-science-machine-learning-r-python

https://www.datasciencecentral.com/profiles/blogs/selecting-forecasting-methods-in-data-science

https://www.datasciencecentral.com/profiles/blogs/time-series-analysis-and-forecasting-novel-business-perspectives

https://blogs.sas.com/content/subconsciousmusings/2017/04/12/machine-learning-algorithm-use/?utm_source=twitter&utm_medium=cpc&utm_campaign=analytics-global&utm_content=US_tap

https://www.datasciencecentral.com/profiles/blogs/22-great-articles-about-statistics-for-data-scientists

https://www.datasciencecentral.com/profiles/blogs/steps-of-modelling