Drawing Maps
The new table stations contains geographical information about each bike station, including latitude, longitude, and a "landmark" which is the name of the city where the station is located.
stations = Table.read_table('station.csv')
stations.show(3)
We can draw a map of where the stations are located, using Marker.map_table. The function operates on a table, whose columns are (in order) latitude, longitude, and an optional identifier for each point.
Marker.map_table(stations.select('lat', 'long', 'name'))
The map is created using OpenStreetMap, which is an open online mapping system that you can use just as you would use Google Maps or any other online map. Zoom in to San Francisco to see how the stations are distributed. Click on a marker to see which station it is.
You can also represent points on a map by colored circles. Here is such a map of the San Francisco bike stations.
sf = stations.where('landmark', are.equal_to('San Francisco'))
sf_map_data = sf.select('lat', 'long', 'name')
Circle.map_table(sf_map_data, color='green', radius=200)
More Informative Maps¶
The bike stations are located in five different cities in the Bay Area. As a simple starting example, let us distinguish the points by using a different color for each city. We will again use apply to assign each city a color by looking up its color in a table.
cities = stations.group('landmark').relabeled('landmark', 'city')
cities
colors = cities.with_column('color', make_array('blue', 'red', 'green', 'orange', 'purple'))
colors
Now we can write a function to look up the color of a station by its city, using the colors table.
def find_color(city_name):
return colors.where("city", are.equal_to(city_name)).column("color").item(0)
with_colors = stations.with_column("color", stations.apply(find_color, "landmark"))
with_colors.show(3)
Marker.map_table(with_colors.select('lat', 'long', 'name', 'color'))
Now the markers have five different colors for the five different cities.
Where do most of the rentals originate?¶
To see where most of the bike rentals originate, let's identify the start stations:
starts = trips.group('Start Station').sort('count', descending=True)
starts.show(3)
We can include this information in stations, again by using apply. We previously defined the function find_trip_count, which is reproduced below:
starts = trips.group("Start Station")
def find_trip_count(station_name):
return starts.where("Start Station", are.equal_to(station_name)).column("count").item(0)
Some of the stations in the stations dataset are not present in the trips dataset. We must filter them out before applying find_trip_count to the remainder.
stations_with_trip_data = stations.where("name", are.contained_in(starts.column("Start Station")))
count_by_station = stations_with_trip_data.with_column(
"Number of trips",
stations_with_trip_data.apply(find_trip_count, "name"))
count_by_station
Now we extract just the data needed for drawing our map, adding a color and an area to each station. The area is 1000 times the count of the number of rentals starting at each station, where the constant 1000 was chosen so that the circles would appear at an appropriate scale on the map.
starts_map_data = count_by_station.select('lat', 'long', 'name').with_columns(
'color', 'blue',
'area', count_by_station.column('Number of trips') * 1000
)
starts_map_data.show(3)
Circle.map_table(starts_map_data)
That huge blob in San Francisco shows that the eastern section of the city is the unrivaled capital of bike rentals in the Bay Area.
Trip duration¶
Recall the first part of our hypothesis: "There is a negative association between urban density and duration of bike trips." The map, plus a bit of background knowledge about the Bay Area, gives us a rough picture of urban density. San Francisco to the north and San Jose to the south are large, relatively dense cities. In between, the South Bay and Peninsula are relatively less dense. If we can display average trip duration on this map, we can use this knowledge about density to check our hypothesis.
We will start by adding the average trip duration to our count_by_station table, just as we did above.
durations = trips.group("Start Station", np.mean)
def find_average_duration(station_name):
return durations.where("Start Station", are.equal_to(station_name)).column("Duration mean").item(0)
with_duration = count_by_station.with_column(
"Average trip duration",
count_by_station.apply(find_average_duration, "name"))
with_duration.show(3)
Now we will create a column of colors. Bright red will show locations with long trips, and dark red or black will show locations with shorter trips.
Unfortunately, the map_table function requires us to specify colors in a particular format, and converting to that format involves some rather technical details about color encodings. We have written a function called duration_to_color to convert average trip duration numbers to map_table's color format. Don't worry about the implementation (the body of the function); the docstring describes what the function does. We simply apply the duration_to_color function to our "Average trip duration" column to produce colors.
def duration_to_color(average_duration):
"""Converts an average trip duration to a string describing a color.
Longer durations will be closer to bright red, and shorter durations
will be closer to black.
Args:
average_duration (float): The average trip duration for one
station.
Returns:
(string): A string describing a color based on the given average
trip duration. The string is in 6-digit hexidecimal format,
which is a common way to describe colors."""
max_duration_color = 255
color_bits = 8
rescaled_duration = min(max_duration_color, int(256 * average_duration / 5000))
red_amount = 2**(2*color_bits) * rescaled_duration
color = '#{:06X}'.format(red_amount)
return color
duration_map_data = with_duration.select('lat', 'long', 'name').with_columns(
'color', with_duration.apply(duration_to_color, 'Average trip duration'),
'area', with_duration.column('Number of trips') * 4000,
)
duration_map_data.show(3)
Circle.map_table(duration_map_data, fill_opacity=1)
Conclusions¶
It seems that the locations with long trip durations are mostly in Palo Alto and Redwood City, with one exception in San Jose. These are the least urban bike stations on the map. The data are therefore compatible with our hypothesis.
Until now, we have not proposed a causal mechanism for the association. Here are a few that are plausible:
- Palo Alto and Redwood City are close to long bike routes in the hills to the southwest. Perhaps people take long recreational biking trips through the hills.
- Perhaps Stanford students rent bicycles to get around campus for days at a time.
- Perhaps some people who live or work in the long suburban peninsula between San Francisco and San Jose commute for long distances by bicycle.
Question for thought: The trips dataset includes the date and time of day for the start and end of each trip. How might we use this information to test some of the proposals above?