Drawing Maps

Interact

The new table stations contains geographical information about each bike station, including latitude, longitude, and a "landmark" which is the name of the city where the station is located.

stations = Table.read_table('station.csv')
stations.show(3)
station_id name lat long dockcount landmark installation
2 San Jose Diridon Caltrain Station 37.3297 -121.902 27 San Jose 8/6/2013
3 San Jose Civic Center 37.3307 -121.889 15 San Jose 8/5/2013
4 Santa Clara at Almaden 37.334 -121.895 11 San Jose 8/6/2013

... (67 rows omitted)

We can draw a map of where the stations are located, using Marker.map_table. The function operates on a table, whose columns are (in order) latitude, longitude, and an optional identifier for each point.

Marker.map_table(stations.select('lat', 'long', 'name'))

The map is created using OpenStreetMap, which is an open online mapping system that you can use just as you would use Google Maps or any other online map. Zoom in to San Francisco to see how the stations are distributed. Click on a marker to see which station it is.

You can also represent points on a map by colored circles. Here is such a map of the San Francisco bike stations.

sf = stations.where('landmark', are.equal_to('San Francisco'))
sf_map_data = sf.select('lat', 'long', 'name')
Circle.map_table(sf_map_data, color='green', radius=200)

More Informative Maps

The bike stations are located in five different cities in the Bay Area. As a simple starting example, let us distinguish the points by using a different color for each city. We will again use apply to assign each city a color by looking up its color in a table.

cities = stations.group('landmark').relabeled('landmark', 'city')
cities
city count
Mountain View 7
Palo Alto 5
Redwood City 7
San Francisco 35
San Jose 16
colors = cities.with_column('color', make_array('blue', 'red', 'green', 'orange', 'purple'))
colors
city count color
Mountain View 7 blue
Palo Alto 5 red
Redwood City 7 green
San Francisco 35 orange
San Jose 16 purple

Now we can write a function to look up the color of a station by its city, using the colors table.

def find_color(city_name):
    return colors.where("city", are.equal_to(city_name)).column("color").item(0)

with_colors = stations.with_column("color", stations.apply(find_color, "landmark"))
with_colors.show(3)
station_id name lat long dockcount landmark installation color
2 San Jose Diridon Caltrain Station 37.3297 -121.902 27 San Jose 8/6/2013 purple
3 San Jose Civic Center 37.3307 -121.889 15 San Jose 8/5/2013 purple
4 Santa Clara at Almaden 37.334 -121.895 11 San Jose 8/6/2013 purple

... (67 rows omitted)

Marker.map_table(with_colors.select('lat', 'long', 'name', 'color'))

Now the markers have five different colors for the five different cities.

Where do most of the rentals originate?

To see where most of the bike rentals originate, let's identify the start stations:

starts = trips.group('Start Station').sort('count', descending=True)
starts.show(3)
Start Station count
San Francisco Caltrain (Townsend at 4th) 26304
San Francisco Caltrain 2 (330 Townsend) 21758
Harry Bridges Plaza (Ferry Building) 17255

... (67 rows omitted)

We can include this information in stations, again by using apply. We previously defined the function find_trip_count, which is reproduced below:

starts = trips.group("Start Station")

def find_trip_count(station_name):
    return starts.where("Start Station", are.equal_to(station_name)).column("count").item(0)

Some of the stations in the stations dataset are not present in the trips dataset. We must filter them out before applying find_trip_count to the remainder.

stations_with_trip_data = stations.where("name", are.contained_in(starts.column("Start Station")))
count_by_station = stations_with_trip_data.with_column(
    "Number of trips",
    stations_with_trip_data.apply(find_trip_count, "name"))
count_by_station
station_id name lat long dockcount landmark installation Number of trips
2 San Jose Diridon Caltrain Station 37.3297 -121.902 27 San Jose 8/6/2013 4968
3 San Jose Civic Center 37.3307 -121.889 15 San Jose 8/5/2013 774
4 Santa Clara at Almaden 37.334 -121.895 11 San Jose 8/6/2013 1958
5 Adobe on Almaden 37.3314 -121.893 19 San Jose 8/5/2013 562
6 San Pedro Square 37.3367 -121.894 15 San Jose 8/7/2013 1418
7 Paseo de San Antonio 37.3338 -121.887 15 San Jose 8/7/2013 856
8 San Salvador at 1st 37.3302 -121.886 15 San Jose 8/5/2013 495
9 Japantown 37.3487 -121.895 15 San Jose 8/5/2013 885
10 San Jose City Hall 37.3374 -121.887 15 San Jose 8/6/2013 832
11 MLK Library 37.3359 -121.886 19 San Jose 8/6/2013 1099

... (58 rows omitted)

Now we extract just the data needed for drawing our map, adding a color and an area to each station. The area is 1000 times the count of the number of rentals starting at each station, where the constant 1000 was chosen so that the circles would appear at an appropriate scale on the map.

starts_map_data = count_by_station.select('lat', 'long', 'name').with_columns(
    'color', 'blue',
    'area', count_by_station.column('Number of trips') * 1000
)
starts_map_data.show(3)
Circle.map_table(starts_map_data)
lat long name color area
37.3297 -121.902 San Jose Diridon Caltrain Station blue 4968000
37.3307 -121.889 San Jose Civic Center blue 774000
37.334 -121.895 Santa Clara at Almaden blue 1958000

... (65 rows omitted)

That huge blob in San Francisco shows that the eastern section of the city is the unrivaled capital of bike rentals in the Bay Area.

Trip duration

Recall the first part of our hypothesis: "There is a negative association between urban density and duration of bike trips." The map, plus a bit of background knowledge about the Bay Area, gives us a rough picture of urban density. San Francisco to the north and San Jose to the south are large, relatively dense cities. In between, the South Bay and Peninsula are relatively less dense. If we can display average trip duration on this map, we can use this knowledge about density to check our hypothesis.

We will start by adding the average trip duration to our count_by_station table, just as we did above.

durations = trips.group("Start Station", np.mean)

def find_average_duration(station_name):
    return durations.where("Start Station", are.equal_to(station_name)).column("Duration mean").item(0)

with_duration = count_by_station.with_column(
    "Average trip duration",
    count_by_station.apply(find_average_duration, "name"))
with_duration.show(3)
station_id name lat long dockcount landmark installation Number of trips Average trip duration
2 San Jose Diridon Caltrain Station 37.3297 -121.902 27 San Jose 8/6/2013 4968 884.375
3 San Jose Civic Center 37.3307 -121.889 15 San Jose 8/5/2013 774 5458.04
4 Santa Clara at Almaden 37.334 -121.895 11 San Jose 8/6/2013 1958 850.924

... (65 rows omitted)

Now we will create a column of colors. Bright red will show locations with long trips, and dark red or black will show locations with shorter trips.

Unfortunately, the map_table function requires us to specify colors in a particular format, and converting to that format involves some rather technical details about color encodings. We have written a function called duration_to_color to convert average trip duration numbers to map_table's color format. Don't worry about the implementation (the body of the function); the docstring describes what the function does. We simply apply the duration_to_color function to our "Average trip duration" column to produce colors.

def duration_to_color(average_duration):
    """Converts an average trip duration to a string describing a color.
    
    Longer durations will be closer to bright red, and shorter durations
    will be closer to black.
    
    Args:
      average_duration (float): The average trip duration for one
        station.
    
    Returns:
      (string): A string describing a color based on the given average
        trip duration.  The string is in 6-digit hexidecimal format,
        which is a common way to describe colors."""
    max_duration_color = 255
    color_bits = 8
    rescaled_duration = min(max_duration_color, int(256 * average_duration / 5000))
    red_amount = 2**(2*color_bits) * rescaled_duration
    color = '#{:06X}'.format(red_amount)
    return color

duration_map_data = with_duration.select('lat', 'long', 'name').with_columns(
    'color', with_duration.apply(duration_to_color, 'Average trip duration'),
    'area', with_duration.column('Number of trips') * 4000,
)
duration_map_data.show(3)
Circle.map_table(duration_map_data, fill_opacity=1)
lat long name color area
37.3297 -121.902 San Jose Diridon Caltrain Station #2D0000 19872000
37.3307 -121.889 San Jose Civic Center #FF0000 3096000
37.334 -121.895 Santa Clara at Almaden #2B0000 7832000

... (65 rows omitted)

Conclusions

It seems that the locations with long trip durations are mostly in Palo Alto and Redwood City, with one exception in San Jose. These are the least urban bike stations on the map. The data are therefore compatible with our hypothesis.

Until now, we have not proposed a causal mechanism for the association. Here are a few that are plausible:

  • Palo Alto and Redwood City are close to long bike routes in the hills to the southwest. Perhaps people take long recreational biking trips through the hills.
  • Perhaps Stanford students rent bicycles to get around campus for days at a time.
  • Perhaps some people who live or work in the long suburban peninsula between San Francisco and San Jose commute for long distances by bicycle.

Question for thought: The trips dataset includes the date and time of day for the start and end of each trip. How might we use this information to test some of the proposals above?

results matching ""

    No results matching ""