Table of Contents
- What Exactly is Data Science?
- What Do Data Scientists Do?
- How Do You Extract, Transform, and Load Data?
- What is Exploratory Data Analysis?
- Why is Data Mining Important?
- Why is Data Cleaning Important?
- What is Cloud Computing?
- What Does Database Management Mean?
- What is Machine Learning and Deep Learning?
- Why Are Data Visualization and Storytelling Important?
- Why Does Data Science Need Specialization?
- What Are The Ethics of Data Science?
On the internet, an estimated 2.5 quintillion bytes of data is generated every day. For reference, an average hard drive can store 1 terabyte of data. Multiply that by 2.5 million, and that’s about how much storage you’ll need to store just one day’s worth of new information.
Unfortunately, all of this data is not worth the cost of the servers it is stored on without the right humans to process it, analyze it, and build the models needed to separate the nuggets of valuable insight from the torrents of noise they are buried in. This is the job of data scientists.
In a 2020 survey by Analytics Insight, the company predicted that by 2021, there would be 3,037,809 new job openings in data science worldwide. In fact, the need for data scientists is so high that job cuts and layoffs for data science roles remained much lower than other software-related jobs, even as the economy dwindled under the COVID-19 pandemic. Due to this high demand and retention rate, even back in 2012, the Harvard Business Review labeled data science the sexiest job of the 21st century.
What Exactly is Data Science?
As history would have it, data science is more than 30 years old. However, the term “data science” was initially used as a substitute for computer science. In 2001, S. Cleveland proposed that data science be made an academic discipline that links computer science with data. The outcome was the journal “Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century," which defines data scientists as “the information and computer scientists, database and software engineers and programmers, disciplinary experts, curators, and expert annotators, librarians, archivists, and others, who are crucial to the successful management of digital data collections”. Although this description fits the overall purpose of a data scientist, it fails to elucidate exactly what data science is all about today.
Over the last decade, data science has boldly stepped out of the academic hallway to penetrate literally every other industry in the world. It has become the next big thing that industries depend on to forge ahead. By today’s standards, data science is the art of extracting valuable information from chaotic data by using various techniques, ranging from computer programming to mathematical modeling and statistics. Data scientists are the people who design the standardized paradigms required to derive insights from otherwise chaotic data. These insights are gold for businesses. Simply put, big data is the heartbeat that no modern company or industry can survive without, and data scientists are the only ones who can make sense of it all and translate it into meaningful business insight.
Given the ever-increasing demand for people to do the sexiest job of the century, why are data scientists so rare as to be likened to unicorns? It is quite true that the practice of data science requires math, statistics, engineering, design, forecast, analysis, communication, and business management. A data scientist is, therefore, expected to know all, or at least most, of those fields. Much like a mythical creature, finding one person with all of these skills is pretty close to impossible. Finding 140,000 of them? Well, that might take a while.
What Do Data Scientists Do?
According to Ricardo Vladimiro, a game analytics and data science lead at Miniclip, data scientists create data products. In other words, they are the engineers behind the interfaces used by human beings and machines for data. This explanation can be further broken down into five specific tasks that data scientists typically do daily. So that you get a clear picture of these tasks, here is a quick look into what each of them entails.
How Do You Extract, Transform, and Load Data?
Abbreviated as ETL, this process involves extraction of data from various sources, transforming the data into the required format for analysis and loading it into an end target, for instance, a data warehouse, for processing. The data can be extracted from several sources, including public APIs, clickstream capture, web scraping and third-party vendors. The heterogeneous data is then transformed to be loaded in a data store on a Hadoop cluster and queried homogeneously.
In a properly designed ETL system, the data is not only extracted from source systems but also checked to see that the company’s quality and consistency standards are met. Moreover, ETL is usually a time-consuming process and thus typically implemented in a pipelined manner. This means that the processes of extraction and transformation run concurrently. While some data is being extracted, a separate process transformation process works on already extracted data and prepares it for loading. Neither process waits for the other to finish entirely.
Hadoop is a massively scalable storage and batch data processing system that companies across many industry sectors widely use. If you are preparing for a data science interview, there is a good chance you will be asked about your familiarity with Hadoop and the various technologies in its ecosystem.
Image Source: Amazon
Tom White's book Hadoop: The Definitive Guide offers a comprehensive overview of many Hadoop technologies and is a great starting point for beginners or those preparing for tech or data science interviews.
Apache Spark is another ETL tool that is in very high demand right now, and usage of which has been correlated with significantly higher data scientist salaries in O'Reilly's 2016 salary survey. Spark is a fast, in-memory data processing engine with powerful and easy-to-use development APIs which allow for efficient data streaming in machine learning or SQL workflows that use enormous datasets.
Image Source: Amazon
To get started learning Spark, Learning Spark: Lightning-Fast Big Data Analysis written by the project's authors is currently the best intro book on the market. Although the content could be better organized, this book does an excellent job of covering Spark and its complementary language, Scala, with detailed code examples in Scala, Java and Python.
What is Exploratory Data Analysis?
Exploratory data analysis (EDA) is an important step in the data science cycle. The purpose of EDA is to begin exploring the data and to form hypotheses that will guide your collection of new data or design new experiments for further analysis. Basically, this step allows you to test your intuition about what you might find as you begin to scratch the surface of the data in front of you.
At this stage, you will start to see patterns in the data, try different modeling techniques, design experiments to better understand the data, and come up with an approach for continued analysis.
Image Source: Amazon
A good place to get started learning EDA techniques in Python is Joel Grus’s seminal book Data Science from Scratch: First Principles with Python, which gives thorough explanations of the statistics and machine learning concepts used, and easy-to-follow code examples in Python. While this book is a great intro to data science, it assumes some familiarity with Python programming.
Image Source: Amazon
If you are just starting out with Python programming, I strongly recommend Zed Shaw’s Learn Python the Hard Way, which, despite its name, is hands down the easiest introduction to Python for beginners that I have come across.
Article continues below
Want to learn more? Check out some of our courses:
Why is Data Mining Important?
Data mining is the phase where data scientists go in-depth into a dataset to draw specific insights. Using specialized data analysis tools, a data scientist transforms chaotic data into meaningful structures that suit the requirements of the business. Some of the parameters used for data mining are clustering, path analysis, forecasting, and so on.
The way that different companies mine data depends on the needs of the industry. For example, in finance, data mining is used for detecting fraudulent or anomalous transactions. In the manufacturing industry, these techniques are used to take care of product safety and quality issues.
Why is Data Cleaning Important?
A 2013 study by IBM found out that poor quality data costs the US economy up to $3.1 trillion every year. It is a fact of data science that not all data is useful for analysis. A large part of a data scientist's job is to clean and effectively sample big data into more relevant, smaller data, which is then mined for insight. Together with finding and organizing data, cleaning takes up about 80% of a data scientist's time, according to a 2018 report by Harvard Business Review, leaving only 20% for actual analysis.
Data in the real world is incomplete, noisy and inconsistent, and the quality of any analysis strongly depends on the quality of the inputs.
What is Cloud Computing?
Most companies have moved the bulk of their data and operations to the cloud. This is incredibly useful for enterprises that operate out of multiple locations which need access to the same data. Not only that, a shared remote server can be configured to be much more powerful for data science/ machine learning operations than any offline workstation.
Data as a Service (DaaS) is an emerging data sourcing paradigm where vendors supply companies with the kind of data they are looking for. This data is stored and distributed via a cloud infrastructure. To access and utilize it properly, data scientists to be proficient at using cloud services.
Amazon Web Services, Google Cloud, and Microsoft Azure have a host of tools that cater to data scientists. These include modules for data analytics, data warehousing, business intelligence analysis, running Apache Hadoop and Spark clusters, and much more.
What Does Database Management Mean?
The volume of data that an average company deals with has increased manifold over the past decade. The average enterprise database is also constantly being updated as new data comes in and the outdated records are deleted. Data scientists must be attuned to make use of this flow of data to keep their models updated. This is why a data scientist needs to know database management techniques. A good management schema is imperative for making strategic decisions and maintaining a systematic workflow. It is also helpful for generating updated reports and being able to query data efficiently.
What is Machine Learning and Deep Learning?
In data science, machine learning is not just a part of what data scientists do, but it is typically one of the top factors that differentiate data scientists from data analysts. Machine learning is a complex subject matter that requires a lot of effort to master but is incredibly powerful for deriving real value out of big data.
While many people confuse machine learning with deep learning, they are, in fact, distinctly different paradigms, although, in colloquial usage, machine learning is used as the umbrella term. Technically speaking, machine learning makes use of mathematics and statistics to derive conclusions from data. For example, a machine learning model can give you predictions for next year’s sales figures if you fit it with the data from the past years.
Deep learning, on the other hand, uses actual artificial intelligence to train these highly specialized data structures known as neural networks to analyze and learn from a given dataset. Neural networks are incredibly powerful at recognizing patterns within data and can be configured for a whole host of analytical tasks. This is what makes them so useful for data scientists.
If you have ever wondered how Google ranks your search results, how Amazon decides which products to recommend on your homepage, or how Match.com matches you to potential partners, the answer is machine learning, or ML, algorithms. They are everywhere on the web nowadays, and data scientists are responsible for building and maintaining them.
Image Source: Amazon
While there are plenty of web resources about the various ML algorithms, Toby Segaran’s Programming Collective Intelligence is still the best and most comprehensive resource I’ve found for thorough explanations of the most widely used ML techniques out there.
Why Are Data Visualization and Storytelling Important?
Data visualization, along with the story that their analysis findings tell, is another critical piece in data scientists’ work. Whether a Seaborn graph, a d3.js dynamic visualization or a well-designed infographic, pictures help convey meaning and insight instantaneously. They also form a necessary part of the story, which is the vehicle delivering the value of every data science project. The rule of thumb here is to present the value of your findings in a way that is not only accessible to laypeople but that will enable your audience to rally around your recommendations and see to it that they are implemented such that they drive real business value.
Image Source: Amazon
While there are many tutorials on the web to help you get up to speed with tools like Seaborn and d3.js, learning to weave results in a cohesive and compelling story is more of an art than a science, and good learning aids are scarce. However, one book that stands out from the pack is Wired for Story by Lisa Cron. This book is aimed primarily at writers and other practitioners of the written word, but I believe that data scientists could easily and effectively use the story structures and techniques that Cron describes to craft more effective presentations of their work and more readily reach their stakeholders.
Why Does Data Science Need Specialization?
As you may have understood from the vast array of jobs that data scientists work, it is virtually impossible for one person to be able to handle all of them. In practice, each module in the aforementioned list is a specialized job profile, and for a good reason. The skills required to be a data scientist are constantly evolving as automation takes root and the tools get better. Additionally, a diverse array of companies has started relying on data to drive business. For some, machine learning may take center stage. For others, cloud computing and database management might be important more than anything.
In an interview with Harvard Business Review, Jonathan Nolis, a notable data scientist working for Fortune 500 companies, said that the ability of a data scientist to present their findings convincingly is more important than their knowledge of sophisticated deep learning models. This goes to show that data science isn’t just about the “science." If you are unable to communicate the value of your insights, then the insights aren’t worth much. Communication is as important a skill in a data scientist’s toolkit as anything else.
What Are The Ethics of Data Science?
One of the most crucial considerations that data scientists have to make is where they source data from, what channels they use, and how interoperable their models are.
In the past, large tech companies have used shady software like trackers and background processes to collect personal data from users to serve them intrusively targeted ads or even for unethical purposes. A famous example of this is the Cambridge Analytica scandal. The company had access to the private chats of millions of Facebook users, the data from which they used to manipulate voter sentiment during the 2016 US presidential elections.
Another problem that data science is faced with is that of inherent bias within datasets. An instance of that is the COMPAS recidivism risk score that has been “used across the country to predict future criminals” in the US and is “biased against Blacks,” according to ProPublica.
Incidents such as these have led to many discussions among data scientists to implement standardized fairness policies for handling data. Data scientists need to build transparent models and explain the reasoning behind the predictions they make, as opposed to black-box models like neural networks. There is also a push by legislators and tech firms like Apple to implement robust privacy features into their products to stop data collection without consent.
Data science is a vast and convoluted field of study with endless possibilities and challenges, and its importance in modern society can’t be emphasized enough. In the information age, data scientists and their tools are the bread and butter of modern businesses.
As DJ Patil, Chief Data Scientist of the USA, put it, the dominant trait among data scientists is their passionate curiosity to unearth the best solution to a problem, ask relevant questions, and refine them into hypotheses to be tested until a valuable piece of insight is found. The world today and tomorrow is all about data, and data scientists are sorely needed as trusted advisors to the executive team of any company that wants to stay relevant in this ever-changing landscape. To find out more, be sure to check our Intro to Data Science with Python course.