Interlude: Tables
Before we journey deeper into distributions, we will need a new data type. Visualizations often show several pieces of information about the individuals in a data set. For example:
- We made a scatter plot of the fuel efficiency (miles per gallon) and price of car models. For each car model, we needed to somehow link its fuel efficiency and its price.
- We made a line plot of the proportion of new comic book characters who were female versus time (in years). We needed to link years to proportions of female characters.
Visualizing distributions often has similar requirements.
It is possible to maintain a separate array for each kind of data and link individuals by their position in these arrays. For example, we could have an array of fuel efficiencies and an array of prices, where the first fuel efficiency number and the first price refer to the same car. However, there are many advantages to organizing them in a single data set.
Tables¶
Tables are a fundamental object type for representing data sets that include multiple pieces of information about individuals. A table can be viewed in two ways:
- a sequence of named columns that each describe a single aspect of all entries in a data set, or
- a sequence of rows that each contain all information about a single entry in a data set.
Terminology: A variable is a formal name for an 'aspect' or 'piece of information' or 'column' in a dataset. Variables are also called features. The term variable emphasizes that the piece of data can have different values for different individuals - each car model has (potentially) a different fuel efficiency.
Variables that have numerical values, such as 'fuel efficiency' or 'price,' are called quantitative or numerical variables. Variables that have non-numerical values, such as 'model name' or 'gender,' are called qualitative or categorical variables.
Creating an empty table¶
In order to use tables, import the module called datascience
, a module created for this text. You can simple write import datascience
, but the alternate import statement below allows us to refer to things in the datascience
module without writing "datascience.
" everywhere in our code.
from datascience import *
Empty tables can be created by calling the Table
function with no arguments. An empty table is useful because it can be extended to contain new rows and columns.
Table()
Adding columns¶
Recall that the replace
method of a string constructs a new string based on the existing string. Thus, the value of "foo".replace("o", "e")
is the string "fee"
, but this call to replace doesn't modify "foo"
.
Analogously, the with_columns
method on a table constructs and returns a new table with additional labeled columns. Each column of a table is an array. To add one new column to a table, call with_columns
with a label and an array. (The with_column
method can be used with the same effect.)
Below, we begin each example with an empty table that has no columns.
Table().with_columns('Number of petals', make_array(8, 34, 5))
To add two (or more) new columns, provide the label and array for each column. All columns must have the same length, or an error will occur.
# It's nice, but optional, to line up the arguments to with_columns
# using extra spaces. This makes it easy to see what's in the
# columns when we read the code.
Table().with_columns(
'Number of petals', make_array(8, 34, 5),
'Name', make_array('lotus', 'sunflower', 'rose')
)
We can give this table a name, and then extend the table with another column.
flowers = Table().with_columns(
'Number of petals', make_array(8, 34, 5),
'Name', make_array('lotus', 'sunflower', 'rose')
)
flowers.with_columns(
'Color', make_array('pink', 'yellow', 'red')
)
The with_columns
method creates a new table each time it is called, so the original table is not affected. For example, the table flowers
still has only the two columns that it had when it was created.
flowers
Loading data sets¶
Creating tables in this way involves a lot of typing. If the data have already been entered somewhere, it is usually possible to use Python to read it into a table, instead of typing it all in cell by cell.
Often, tables are created from files that contain comma-separated values. Such files are called CSV files.
Below, we use the Table method read_table
to read a CSV file that contains some of the data used by Minard in his graphic about Napoleon's Russian campaign. The data are placed in a table named minard
.
minard = Table.read_table('minard.csv')
minard
Accessing columns¶
The most basic method for accessing data in a table is column
. It takes a single argument: the name of a column to retrieve. It returns that column in the form of an array.
Suppose we would like to understand how many of Napolean's soldiers died between each location. Since the data are in chronological order, we could use np.diff
to find this out. We can first retrieve the Survivors column:
minard.column("Survivors")
We can give it a name if we'd like. The name can be anything we want:
minard_survivors = minard.column("Survivors")
Since minard_survivors
is just an array, we can perform array operations on it.
np.diff(minard_survivors)
The number of soldiers who died is the negative of this:
-1 * np.diff(minard_survivors)
The proportion of survivors¶
Suppose we want to understand how Napolean's army was depleted relative to its initial size. This means we want the proportion of initial soldiers who survived to reach each location. We divide each survivor count by the initial survivor count.
initial_soldier_count = minard_survivors.item(0)
proportion_surviving = minard_survivors / initial_soldier_count
proportion_surviving
Finally, we can use with_columns
to make a copy of the minard
table with this added as a new column:
minard.with_columns("Proportion surviving", proportion_surviving)
We will see many ways to work with data in tables in future sections.