Interlude: Tables

Interact

Before we journey deeper into distributions, we will need a new data type. Visualizations often show several pieces of information about the individuals in a data set. For example:

We made a scatter plot of the fuel efficiency (miles per gallon) and price of car models. For each car model, we needed to somehow link its fuel efficiency and its price.
We made a line plot of the proportion of new comic book characters who were female versus time (in years). We needed to link years to proportions of female characters.

Visualizing distributions often has similar requirements.

It is possible to maintain a separate array for each kind of data and link individuals by their position in these arrays. For example, we could have an array of fuel efficiencies and an array of prices, where the first fuel efficiency number and the first price refer to the same car. However, there are many advantages to organizing them in a single data set.

Tables¶

Tables are a fundamental object type for representing data sets that include multiple pieces of information about individuals. A table can be viewed in two ways:

a sequence of named columns that each describe a single aspect of all entries in a data set, or
a sequence of rows that each contain all information about a single entry in a data set.

Terminology: A variable is a formal name for an 'aspect' or 'piece of information' or 'column' in a dataset. Variables are also called features. The term variable emphasizes that the piece of data can have different values for different individuals - each car model has (potentially) a different fuel efficiency.

Variables that have numerical values, such as 'fuel efficiency' or 'price,' are called quantitative or numerical variables. Variables that have non-numerical values, such as 'model name' or 'gender,' are called qualitative or categorical variables.

Creating an empty table¶

In order to use tables, import the module called datascience, a module created for this text. You can simple write import datascience, but the alternate import statement below allows us to refer to things in the datascience module without writing "datascience." everywhere in our code.

from datascience import *

Empty tables can be created by calling the Table function with no arguments. An empty table is useful because it can be extended to contain new rows and columns.

Table()

Adding columns¶

Recall that the replace method of a string constructs a new string based on the existing string. Thus, the value of "foo".replace("o", "e") is the string "fee", but this call to replace doesn't modify "foo".

Analogously, the with_columns method on a table constructs and returns a new table with additional labeled columns. Each column of a table is an array. To add one new column to a table, call with_columns with a label and an array. (The with_column method can be used with the same effect.)

Below, we begin each example with an empty table that has no columns.

Table().with_columns('Number of petals', make_array(8, 34, 5))

Number of petals
8
34
5

To add two (or more) new columns, provide the label and array for each column. All columns must have the same length, or an error will occur.

# It's nice, but optional, to line up the arguments to with_columns
# using extra spaces.  This makes it easy to see what's in the
# columns when we read the code. 
Table().with_columns(
    'Number of petals', make_array(8,       34,          5),
    'Name',             make_array('lotus', 'sunflower', 'rose')
)

Number of petals	Name
8	lotus
34	sunflower
5	rose

We can give this table a name, and then extend the table with another column.

flowers = Table().with_columns(
    'Number of petals', make_array(8, 34, 5),
    'Name', make_array('lotus', 'sunflower', 'rose')
)

flowers.with_columns(
    'Color', make_array('pink', 'yellow', 'red')
)

Number of petals	Name	Color
8	lotus	pink
34	sunflower	yellow
5	rose	red

The with_columns method creates a new table each time it is called, so the original table is not affected. For example, the table flowers still has only the two columns that it had when it was created.

flowers

Number of petals	Name
8	lotus
34	sunflower
5	rose

Loading data sets¶

Creating tables in this way involves a lot of typing. If the data have already been entered somewhere, it is usually possible to use Python to read it into a table, instead of typing it all in cell by cell.

Often, tables are created from files that contain comma-separated values. Such files are called CSV files.

Below, we use the Table method read_table to read a CSV file that contains some of the data used by Minard in his graphic about Napoleon's Russian campaign. The data are placed in a table named minard.

minard = Table.read_table('minard.csv')
minard

Longitude	Latitude	City	Direction	Survivors
32	54.8	Smolensk	Advance	145000
33.2	54.9	Dorogobouge	Advance	140000
34.4	55.5	Chjat	Advance	127100
37.6	55.8	Moscou	Advance	100000
34.3	55.2	Wixma	Retreat	55000
32	54.6	Smolensk	Retreat	24000
30.4	54.4	Orscha	Retreat	20000
26.8	54.3	Moiodexno	Retreat	12000

Accessing columns¶

The most basic method for accessing data in a table is column. It takes a single argument: the name of a column to retrieve. It returns that column in the form of an array.

Suppose we would like to understand how many of Napolean's soldiers died between each location. Since the data are in chronological order, we could use np.diff to find this out. We can first retrieve the Survivors column:

minard.column("Survivors")

array([145000, 140000, 127100, 100000,  55000,  24000,  20000,  12000])

We can give it a name if we'd like. The name can be anything we want:

minard_survivors = minard.column("Survivors")

Since minard_survivors is just an array, we can perform array operations on it.

np.diff(minard_survivors)

array([ -5000, -12900, -27100, -45000, -31000,  -4000,  -8000])

The number of soldiers who died is the negative of this:

-1 * np.diff(minard_survivors)

array([ 5000, 12900, 27100, 45000, 31000,  4000,  8000])

The proportion of survivors¶

Suppose we want to understand how Napolean's army was depleted relative to its initial size. This means we want the proportion of initial soldiers who survived to reach each location. We divide each survivor count by the initial survivor count.

initial_soldier_count = minard_survivors.item(0)
proportion_surviving = minard_survivors / initial_soldier_count
proportion_surviving

array([ 1.        ,  0.96551724,  0.87655172,  0.68965517,  0.37931034,
        0.16551724,  0.13793103,  0.08275862])

Finally, we can use with_columns to make a copy of the minard table with this added as a new column:

minard.with_columns("Proportion surviving", proportion_surviving)

Longitude	Latitude	City	Direction	Survivors	Proportion surviving
32	54.8	Smolensk	Advance	145000	1
33.2	54.9	Dorogobouge	Advance	140000	0.965517
34.4	55.5	Chjat	Advance	127100	0.876552
37.6	55.8	Moscou	Advance	100000	0.689655
34.3	55.2	Wixma	Retreat	55000	0.37931
32	54.6	Smolensk	Retreat	24000	0.165517
30.4	54.4	Orscha	Retreat	20000	0.137931
26.8	54.3	Moiodexno	Retreat	12000	0.0827586

We will see many ways to work with data in tables in future sections.

Interlude: Tables