Tables
Tables are a fundamental object type for representing data sets. A table can be viewed in two ways:
- a sequence of named columns that each describe a single aspect of all entries in a data set, or
- a sequence of rows that each contain all information about a single entry in a data set.
In order to use tables, import all of the module called datascience
, a module created for this text.
from datascience import *
We have already seen several of the basic functions and methods for working with tables. Here is a summary:
Name | Type | Purpose | Example |
---|---|---|---|
Table |
function | Create an empty table | t = Table() |
Table.read_table |
function | Create a table from a file | minard = Table.read_table("minard.csv") |
with_columns |
method | Create a table with additional columns | t.with_columns("Nums", np.array(3)) |
column |
method | Create an array containing data from one column of a table | t.column("Nums") |
sort |
method | Create a copy of a table that's sorted based on one column | t.sort("Nums") |
group |
method | Create a table containing the count distribution of one column | t.group("Nums") |
Let us work again with the data from Minard's map of Napoleon's invasion of Russia.
minard = Table.read_table('minard.csv')
minard
We will use this small table to demonstrate some useful Table methods and some new ways of using the methods we've already seen. We will then use those same methods, and develop other methods, on much larger tables of data.
The Size of the Table¶
The method num_columns
gives the number of columns in the table, and num_rows
the number of rows.
minard.num_columns
minard.num_rows
Column Labels¶
labels
can be used to list the labels of all the columns. With minard
we don't gain much by this, but it can be very useful for tables that are so large that not all columns are visible on the screen.
minard.labels
Notice that there are no parentheses after labels
. That's because labels
isn't actually a method; rather it's something called a field. A field is anything that's accessed using dot syntax that isn't a method. A field is a value like a number, string, or array; it doesn't need to be called using parentheses.
We can change column labels using the relabeled
method. This creates a new table with a different label for the 'City'
column:
minard.relabeled('City', 'City Name')
However, calling this method does not change the original table.
minard
A common pattern is to assign the original name minard
to the new table, so that all future uses of minard
will refer to the relabeled table.
minard = minard.relabeled('City', 'City Name')
minard
Using this pattern in your code can lead to confusion. Some people prefer never to reassign existing names in this way. If you do, it is best to do all of your reassigning in the same cell, before you go on to use the table for analysis.
Accessing the Data in a Column¶
We can use a column's label to access the array of data in the column.
minard.column('Survivors')
The 5 columns are indexed 0, 1, 2, 3, and 4. The column Survivors
can also be accessed by using its column index.
minard.column(4)
The 8 items in the array are indexed 0, 1, 2, and so on, up to 7. The items in the column can be accessed using item
, as with any array.
minard.column(4).item(0)
minard.column(4).item(5)
Working with the Data in a Column¶
Because columns are arrays, we can use array operations on them to discover new information. For example, we can create a new column that contains the percent of all survivors at each city after Smolensk.
initial = minard.column('Survivors').item(0)
minard = minard.with_columns(
'Percent Surviving', minard.column('Survivors')/initial
)
minard
To make the proportions in the new columns appear as percents, we can use the method set_format
with the option PercentFormatter
. The set_format
method takes Formatter
objects, which exist for dates (DateFormatter
), currencies (CurrencyFormatter
), numbers, and percentages.
minard.set_format('Percent Surviving', PercentFormatter)
Choosing Sets of Columns¶
The method select
creates a new table that contains only the specified columns.
minard.select('Longitude', 'Latitude')
The same selection can be made using column indices instead of labels.
minard.select(0, 1)
The result of using select
is a new table, even when you select just one column.
minard.select('Survivors')
Notice that the result is a table, unlike the result of column
, which is an array.
minard.column('Survivors')
Another way to create a new table consisting of a set of columns is to drop
the columns you don't want.
minard.drop('Longitude', 'Latitude', 'Direction')
Neither select
nor drop
change the original table. Instead, they create new smaller tables that share the same data. The fact that the original table is preserved is useful! You can generate multiple different tables that only consider certain columns without worrying that one analysis will affect the other.
minard
All of the methods that we have used above can be applied to any table.