I’ve recently started a new course about Data Science on Pluralsight. The course consists of 23 modules which range from basics of data driven business to data science, AI and ML fundamentals, data visualizations, communications, and statistics foundations.
Today I’ve finished one of the modules, created by Matthew Renze and entitled “Data Science: The Big Picture”. Matthew explains in a very clear way the differences between Data Analytics, Internet of Things, Big Data and Machine Learning, and also analyzes how these four trends work together.
In this post, I share the takeaways and main ideas that have helped me to better understand these four concepts and its potential for the future of humankind. Certainly, an exciting topic to dive into.
Data science is the practice of transforming raw data into actionable insights.
What is data science? It is an interdisciplinary field that results from the intersection of computer science, math and statistics, and domain expertise. Its main goal is to transform data into knowledge to help humans make better decisions. Data science is driven by economics and made possible by technology.
Why is data science important? In an automated world, the amount of data that is being collected is growing exponentially. In the coming years, there is going to be a huge breach in our society between those companies, governments and individuals who have learned to use data to their advantage and the ones that haven’t. So you’d better learn how to swim in this sea of data if you want to survive and also thrive in this new reality. Some people even predict that a new revolution after the industrial one is going to happen soon: the information revolution.
It may sound apocalyptic, but some studies forecast that if your current job does not involve working with data, it will likely disappear in the near future.
What tools are used to perform data science? In a survey conducted by O’Reilly in 2015, around 70% of the respondents mentioned SQL (Structured Query Language) as the most important tool for data science (including programming languages, data platforms and analytics tools), closely followed by Excel, Python, and R.
SQL is a programming language used for querying tables of data in a relational database. It is a very important tool in data science because data scientists need to spend a lot of time in cleaning up data.
How is data science performed?
- Have a question that needs an answer: a hypothesis to be tested, a decision, or a prediction to be made.
- Collect data.
- Prepare data for analysis (aka as data mangling).
- Create a model for the data. Models can be numerical, visual, statistical, or a machine learning module. We use this model to prepare evidence for or against a hypothesis.
- Evaluate the model.
- Deploy the model.
This is an iterative process, as we need to repeat it for every question we would like to answer. We also use the feedback from our previous result to adapt and improve the process.
Data analytics apply data science practices to the business world.
It all started with data analysis, the predecessor to data analytics. Data analysis was performed by scientists and statisticians. Some decades ago, testing an hypothesis with data was not easy, as it took significant time, money and effort. The software available was expensive and not very powerful. Communicating the results was not very effective either, as the quality and potential of visualization tools was very limited.
Data analytics is the new term coined for the updated version of data analysis. Data analytics has spread from the world of science to the world of business. Data is now less expensive and its quality and quantity has increased. For example, running an A/B on a website can be easily performed with a load balancer at a very low cost. Open-source programming languages like R and Python are relatively easy to work with and inexpensive. We have also better visualization tools like Tableau to represent data in a graphical way and help others draw conclusions and make decisions based on it. Dashboards and infographics can also help us convey data to a wider audience.
What will the future hold? Data-driven decision making will help business decision-makers make better decisions based on data. Data will be democratized and available to more people. I.e. anyone in a company can run an experiment to test a hyphotesis. We will be able to even ask questions to an Alexa-fashioned tool and get answers based on the company’s data. According to an MIT research, companies using data to drive their decision-making process will increase their productivity levels by 4%, and their profits by 6%.
Internet of Things
The internet of things is about connecting devices and sensors to the Internet, which creates a great amount of data to be analyzed.
Not so long ago, the Internet was used by humans to communicate, collaborate and exchange information. Cost was high and speed was slow. The Internet was mainly used to consume information, as creating even a single web page required to have some technical knowledge, which most of the users lacked.
Nowadays, the Internet is not designed to be used by people, but also by things. More and more devices can be connected to the cloud, which allows them to be monitored. The IoT is driven by economics. As long as the value derived from the collected data is higher that the cost of connecting a device to the internet, there will be a business opportunity to connect it to the internet. Good examples of devices connected to the internet are fitness trackers or home devices like heating or air conditioning.
What will the future hold? We will move towards the internet of everything. Most of the devices will be connected to the Internet. In the future, an internet connection will be as common to devices as electricity. Everything existing in our world will be interconnected through the Internet.
Big Data are basically datasets that go beyond the capabilities of conventional computing architecture as well as the technologies used to extract information from these data sets.
The term “big data” is often used to refer to two different things:
- Rapid increase in the amount of data generated in our world (which doubles every two years!)
- Data sets that have a volume, velocity and capacity that go beyond the processing capabilities of conventional computers.
Big data is now analyzed by distributed computing technologies like Spark and Hadoop, which distribute data over multiple computers. When we query a data set, the query is split up into small pieces. Each computer performs the instructions included in one or more pieces.
Distributed computers also create data lakes, which allow to store less frequently used data sets in their native form in a repository that contains the raw data of a company.
Data integration tools allow to combine multiple data sets with different types of data and create a unified view.
There are also other types of tools that analyze structured, semi-structured and unstructured data. For example, text can be analyzed from audio recordings, and sentiment can be inferred from text.
What will the future hold? Most probably, as the processing capabilities of computers grow, big data will be known as simply “data”, as “big data” will be supported by default. We will stop programming the computer and start programming the cloud. In fact, data centers will “become” the computer.
Machine learning is a software program that can “learn” to solve problems without being explicitly programmed to do so.
The artificial intelligence we had in the past was not really “intelligent”, as the programs making decisions had to be programmed to operate successfully. The final decision was always made by a human.
Today, most of the AI research focuses on ML. Machine Leaning is a sub-field of artificial intelligence based on statistics. With machine learning, existing data is used to learn a function that can make a prediction for new data. Some of the tasks that ML can currently perform are:
- Regression (prediction of a numeric outcome, like the price of a house given a number of variables)
- Clustering (grouping, like grouping customers into marketing segments)
- Anomaly detection
Machine learning algorithms already outperform humans in many areas, like games, handwriting character recognition, predicting a person’s age, etc
What does the future hold? Deep learning is a new type of machine learning and the future of AI. Deep learning consists of multiple layers of machine learning models that form a hierarchy. For example, to teach a machine how to recognize a face, the first layer is fed with a set of photos of human faces (input), the first hidden layer learns to detect the geometry of the faces (e.g. horizontal and vertical lines), the second layer detects complex facial features (eyes, noses, ears…), and the third layer detects the general pattern for entire faces. The last layer analyzes abstract representations of the person, like the name. Deep learning algorithms are improving everyday, and will likely outperform and replace humans to carry out some tasks in the future.
How these Four Trends Work Together
- The devices connected to the IoT send big amounts of data to Big Data infrastructure.
- The data stored and processed by this infrastructure is used by the ML algorithms to identify patterns to make decisions and predict outcomes.
- Machine Learning applications send commands to the devices connected to the IoT .
That leaves to a fully autonomous intelligent system that eliminates human intervention and will disrupt many industries and aspects of our society. These systems are known as “smart systems”, “cloud robotics” and “cyber-physical systems”.