Tables

Interact

Tables are a fundamental object type for representing data sets. A table can be viewed in two ways:

  • a sequence of named columns that each describe a single aspect of all entries in a data set, or
  • a sequence of rows that each contain all information about a single entry in a data set.

In order to use tables, import all of the module called datascience, a module created for this text.

from datascience import *

We have already seen several of the basic functions and methods for working with tables. Here is a summary:

Name Type Purpose Example
Table function Create an empty table t = Table()
Table.read_table function Create a table from a file minard = Table.read_table("minard.csv")
with_columns method Create a table with additional columns t.with_columns("Nums", np.array(3))
column method Create an array containing data from one column of a table t.column("Nums")
sort method Create a copy of a table that's sorted based on one column t.sort("Nums")
group method Create a table containing the count distribution of one column t.group("Nums")

Let us work again with the data from Minard's map of Napoleon's invasion of Russia.

minard = Table.read_table('minard.csv')
minard
Longitude Latitude City Direction Survivors
32 54.8 Smolensk Advance 145000
33.2 54.9 Dorogobouge Advance 140000
34.4 55.5 Chjat Advance 127100
37.6 55.8 Moscou Advance 100000
34.3 55.2 Wixma Retreat 55000
32 54.6 Smolensk Retreat 24000
30.4 54.4 Orscha Retreat 20000
26.8 54.3 Moiodexno Retreat 12000

We will use this small table to demonstrate some useful Table methods and some new ways of using the methods we've already seen. We will then use those same methods, and develop other methods, on much larger tables of data.

The Size of the Table

The method num_columns gives the number of columns in the table, and num_rows the number of rows.

minard.num_columns
5
minard.num_rows
8

Column Labels

labels can be used to list the labels of all the columns. With minard we don't gain much by this, but it can be very useful for tables that are so large that not all columns are visible on the screen.

minard.labels
('Longitude', 'Latitude', 'City', 'Direction', 'Survivors')

Notice that there are no parentheses after labels. That's because labels isn't actually a method; rather it's something called a field. A field is anything that's accessed using dot syntax that isn't a method. A field is a value like a number, string, or array; it doesn't need to be called using parentheses.

We can change column labels using the relabeled method. This creates a new table with a different label for the 'City' column:

minard.relabeled('City', 'City Name')
Longitude Latitude City Name Direction Survivors
32 54.8 Smolensk Advance 145000
33.2 54.9 Dorogobouge Advance 140000
34.4 55.5 Chjat Advance 127100
37.6 55.8 Moscou Advance 100000
34.3 55.2 Wixma Retreat 55000
32 54.6 Smolensk Retreat 24000
30.4 54.4 Orscha Retreat 20000
26.8 54.3 Moiodexno Retreat 12000

However, calling this method does not change the original table.

minard
Longitude Latitude City Direction Survivors
32 54.8 Smolensk Advance 145000
33.2 54.9 Dorogobouge Advance 140000
34.4 55.5 Chjat Advance 127100
37.6 55.8 Moscou Advance 100000
34.3 55.2 Wixma Retreat 55000
32 54.6 Smolensk Retreat 24000
30.4 54.4 Orscha Retreat 20000
26.8 54.3 Moiodexno Retreat 12000

A common pattern is to assign the original name minard to the new table, so that all future uses of minard will refer to the relabeled table.

minard = minard.relabeled('City', 'City Name')
minard
Longitude Latitude City Name Direction Survivors
32 54.8 Smolensk Advance 145000
33.2 54.9 Dorogobouge Advance 140000
34.4 55.5 Chjat Advance 127100
37.6 55.8 Moscou Advance 100000
34.3 55.2 Wixma Retreat 55000
32 54.6 Smolensk Retreat 24000
30.4 54.4 Orscha Retreat 20000
26.8 54.3 Moiodexno Retreat 12000

Using this pattern in your code can lead to confusion. Some people prefer never to reassign existing names in this way. If you do, it is best to do all of your reassigning in the same cell, before you go on to use the table for analysis.

Accessing the Data in a Column

We can use a column's label to access the array of data in the column.

minard.column('Survivors')
array([145000, 140000, 127100, 100000,  55000,  24000,  20000,  12000])

The 5 columns are indexed 0, 1, 2, 3, and 4. The column Survivors can also be accessed by using its column index.

minard.column(4)
array([145000, 140000, 127100, 100000,  55000,  24000,  20000,  12000])

The 8 items in the array are indexed 0, 1, 2, and so on, up to 7. The items in the column can be accessed using item, as with any array.

minard.column(4).item(0)
145000
minard.column(4).item(5)
24000

Working with the Data in a Column

Because columns are arrays, we can use array operations on them to discover new information. For example, we can create a new column that contains the percent of all survivors at each city after Smolensk.

initial = minard.column('Survivors').item(0)
minard = minard.with_columns(
    'Percent Surviving', minard.column('Survivors')/initial
)
minard
Longitude Latitude City Name Direction Survivors Percent Surviving
32 54.8 Smolensk Advance 145000 100.00%
33.2 54.9 Dorogobouge Advance 140000 96.55%
34.4 55.5 Chjat Advance 127100 87.66%
37.6 55.8 Moscou Advance 100000 68.97%
34.3 55.2 Wixma Retreat 55000 37.93%
32 54.6 Smolensk Retreat 24000 16.55%
30.4 54.4 Orscha Retreat 20000 13.79%
26.8 54.3 Moiodexno Retreat 12000 8.28%

To make the proportions in the new columns appear as percents, we can use the method set_format with the option PercentFormatter. The set_format method takes Formatter objects, which exist for dates (DateFormatter), currencies (CurrencyFormatter), numbers, and percentages.

minard.set_format('Percent Surviving', PercentFormatter)
Longitude Latitude City Name Direction Survivors Percent Surviving
32 54.8 Smolensk Advance 145000 100.00%
33.2 54.9 Dorogobouge Advance 140000 96.55%
34.4 55.5 Chjat Advance 127100 87.66%
37.6 55.8 Moscou Advance 100000 68.97%
34.3 55.2 Wixma Retreat 55000 37.93%
32 54.6 Smolensk Retreat 24000 16.55%
30.4 54.4 Orscha Retreat 20000 13.79%
26.8 54.3 Moiodexno Retreat 12000 8.28%

Choosing Sets of Columns

The method select creates a new table that contains only the specified columns.

minard.select('Longitude', 'Latitude')
Longitude Latitude
32 54.8
33.2 54.9
34.4 55.5
37.6 55.8
34.3 55.2
32 54.6
30.4 54.4
26.8 54.3

The same selection can be made using column indices instead of labels.

minard.select(0, 1)
Longitude Latitude
32 54.8
33.2 54.9
34.4 55.5
37.6 55.8
34.3 55.2
32 54.6
30.4 54.4
26.8 54.3

The result of using select is a new table, even when you select just one column.

minard.select('Survivors')
Survivors
145000
140000
127100
100000
55000
24000
20000
12000

Notice that the result is a table, unlike the result of column, which is an array.

minard.column('Survivors')
array([145000, 140000, 127100, 100000,  55000,  24000,  20000,  12000])

Another way to create a new table consisting of a set of columns is to drop the columns you don't want.

minard.drop('Longitude', 'Latitude', 'Direction')
City Name Survivors Percent Surviving
Smolensk 145000 100.00%
Dorogobouge 140000 96.55%
Chjat 127100 87.66%
Moscou 100000 68.97%
Wixma 55000 37.93%
Smolensk 24000 16.55%
Orscha 20000 13.79%
Moiodexno 12000 8.28%

Neither select nor drop change the original table. Instead, they create new smaller tables that share the same data. The fact that the original table is preserved is useful! You can generate multiple different tables that only consider certain columns without worrying that one analysis will affect the other.

minard
Longitude Latitude City Name Direction Survivors Percent Surviving
32 54.8 Smolensk Advance 145000 100.00%
33.2 54.9 Dorogobouge Advance 140000 96.55%
34.4 55.5 Chjat Advance 127100 87.66%
37.6 55.8 Moscou Advance 100000 68.97%
34.3 55.2 Wixma Retreat 55000 37.93%
32 54.6 Smolensk Retreat 24000 16.55%
30.4 54.4 Orscha Retreat 20000 13.79%
26.8 54.3 Moiodexno Retreat 12000 8.28%

All of the methods that we have used above can be applied to any table.

results matching ""

    No results matching ""