Chapter 1

Introduction to Data Science

Data are descriptions of the world around us, collected through observation and stored on computers. Computers enable us to infer properties of the world from these descriptions. Data science is the discipline of drawing conclusions from data using computation. The aim of this textbook is to introduce you to this discipline.

There are three core aspects of effective data analysis: exploration, prediction, and inference.

Exploration involves identifying patterns in information.

Prediction involves using information we know to make informed guesses about values we wish we knew.

Inference involves quantifying our degree of certainty: will those patterns we found also appear in new observations? How accurate are our predictions?

Though any real data analysis involves all three, this introductory text is mostly about exploration. Your brain is still better than any existing computer at identifying patterns in data, provided a computer has processed the data and displayed them in the right form. Therefore, to explore data, you must collaborate with your computer. Exploration is a two-way communication task:

  • You use a programming language to instruct a computer to process data. Instructions given to computers must be simple and precise, so computer programming involves mathematical ideas like logic and abstraction.
  • The computer formats the results for your brain to analyze. Most humans are good at interpreting visual information like shapes and colors. A list of 1,000 numbers is virtually useless, while a bar chart can convey the information in the list instantly. Visualization is the key means of computer-to-human communication about data.

This text introduces you simultaneously to both means of communication. It interleaves a complete introduction to programming that assumes no prior knowledge. You will learn a small set of core data visualization techniques that will help you explore a vast range of real-world data. It will also prepare you for a future course in data science, should you choose to take one.

Data science also requires careful reasoning about quantities, but this text does not assume any background in mathematics or statistics beyond basic algebra. You will find very few equations in this text. Instead, you will see techniques in the same language in which they are described to the computers that execute them - a programming language.

