The Data Science Portfolio, Part Two: The Building Blocks
(This is the second installment of The Data Science Portfolio series. You can catch up on the first part here.)
Here is a sample plan for crafting a Data Science portfolio project:
1. Data Merging and Cleaning
If your data is from multiple sources or in different formats, now is the time to merge it all together and resolve any issues like missing data, incoherent or inconsistent entries and outliers.
2. Data Exploration and Hypothesis Formation
When starting a new project, the natural first step should always be an exploration of the data in order to gain an intuitive understanding about each variable and about the relationships between them. In data science this step is know as exploratory data analysis, or EDA for short. EDA includes, but is not limited to, data summarization, aggregation, visualization and correlation. This is the stage when the data scientist starts asking questions of the data, and forming hypotheses of what the answers might be. At the end of this stage, the data professional picks 1-2 questions or hypotheses, and proceeds to delve deeper into the advanced analytics needed to obtain an answer.
3. Hypothesis Testing and/or Machine Learning
At this step in the project, we have some intuition about the data, and we have a clear hypothesis or question for further research. Now, it’s time to do a deep dive and look for evidence that answers our question or supports our hypothesis. At this step, we do statistical analysis, feature and model selection and engineering, model training and predictions, cross-validation analysis, and model performance analysis. Did your chosen model/methodology do well on the testing data? Can you do better? This step is typically not a one-off process. You will often want to get back to the beginning and try different approaches until you come up with a satisfactory and defensible answer to your research question or evidence for your hypothesis (or against it).
If you need a refresher on statistics, this book is a great source:
4. Interpret Results and Make Recommendations
Finally, once you are satisfied that your science is sound and your results are tested and validated, it’s time to communicate your results and make business recommendations based on them. Whether you are trying to convince your CMO that one marketing strategy is far superior to another, or your CEO that the new recommender system you have built will deliver a much better user experience and increased profits compared to the old one, now is your time to shine! While all the hard work that you have put in so far is very important to the success of your project, this step can make it or break it. Why? Because all your work so far is very likely either not visible or not accessible to the decision makers in your company, and people tend to mistrust what they don’t understand. Present your results to your stakeholders in the wrong way, and your project has a good chance of not gaining any traction, or never seeing the light of day for its intended purpose. Telling a compelling story with data, and getting buy-in for your recommendations is a very broad and important topic, but here are a few bullet points to steer you in the right direction:
- Plain English, please!
- Evidence, evidence, evidence!
- If it’s too good to be true, it probably isn’t.
- What’s next?
When communicating your results to business stakeholders and making recommendations, use plain, jargon-free language, and steer clear of technical details as much as possible. Start with the bottom line of your results or recommendations, and with how they would best be put into practice for maximum impact.
Link the recommendations you make to specific results from your analysis. Give some background on your thinking and the logic that brought your to that particular conclusion. The more evidence you can provide, the better.
No data science project is every perfect. As data scientists, we make assumptions and inferences, and we define thresholds for success for our projects based on them. There is always a caveat to our findings and limitations to our analyses. When communicating results to stakeholders, it is important to be upfront about the assumptions you’ve made in your work, and to be specific about the scope of your recommendations.
While you have done your best to test and document the solution to your chosen research question or hypothesis, a data story is never finished. If you look back to your EDA step, there are probably several other interesting questions and hypotheses that you uncovered in the data, and that you were not able to pursue in the current project. What are they? What other directions or angles are left to be pursued, that would add to the company’s understanding of the subject being researched?
Congratulations! You have completed your first data science portfolio project! After many hours of work, and hopefully many rounds of feedback from friends and mentors, your data story is informative and compelling. The next step at this point is to share your work and insights with the world by posting your work or linking to it on GitHub, LinkedIn, Kaggle (if using data from there), your personal blog, etc. Answering relevant questions and engaging with the community on Stack Overflow or Quora will also help establish you as an expert in the field, while also increasing your visibility in all these professional channels that recruiters and hiring managers often go to when looking to add data scientists to their teams. As with anything, when starting to build a data science portfolio for the first time, the first project is always the hardest, but you got this: if anyone can do it, it’s you!