How many people live in this area?

Not long ago I was tasked with finding out how many people live within an arbitrary polygon. In this particular case, the polygon represented the portion of the United States within a drive-time of 10 hours. For this example, the polygon(s) can be anything you wish. This post will act as a tutorial of sorts on how to answer questions like these using python. Sorry to my Deutsch Freunden on this site, but this will be a U.S based answer as using the Census API is a key part of it.

This is a classic case of the modifiable area unit problem. Our arbitrary polygon will likely not be an exact geometric union of census tracts. If for some reason, your area of choice is a collection of census tracts, then do not do this method. You should literally just identify the tracts themselves. For the census tracts our polygon partially intersects, we need to estimate a number of people based on some other ratio. Without using any other data, we will pro-rate the population numbers in each tract by the percent of the tract area our polygon intersects. This a crude, but common, way to approach this problem as the assumption we are making is that census tracts contain equally spatially distributed populations. This is often the opposite of reality. Therefore the smaller or more populous a polygon, the higher the error via this estimation. In the census tract I reside in, all but a few houses are in one little corner of it.

Let’s get started. These are the python libraries we will be using:

pandas – DataFrame structure
fiona – Reading and Writing geographic data formats
shapely – Manipulating geometry
census – a wrapper for the census api

All of these can be install via pip, but if you are on a windows system you might have some trouble with fiona and shapely. I will not go into detail on the installation here, but leave that as an exercise for the avid googler.

Step 1: Obtaining the total population for each census tract.

To start off lets import the libraries and initiate our census object with an api key and the year of data we want. It is really important to remember that the corresponding geographic data we use later needs to be of the same year, or our unique identifier(FIPS) and possible even the geographic extent may not represent the same area the population values are for. They rarely change, but they do change.

from census import Census
from us import states

cen = Census(my_api_key,year=2012) #most recent american community survey is 2012

The census api has a limit on how many records it will return with one call. So we cannot ask for every tract in the country at once. To get around this we ask for each state one at a time, and mash all the results into one long list.

data = []
for s in states.STATES:
    data.extend(cen.acs.get(["B01003_001E"],{'for':'tract:*','in':'state:'+s.fips+'+county:*'}))

data[0:5]

[{u'B01003_001E': u'1812',
  u'county': u'001',
  u'state': u'01',
  u'tract': u'020100'},
 {u'B01003_001E': u'2218',
  u'county': u'001',
  u'state': u'01',
  u'tract': u'020200'},
 {u'B01003_001E': u'3155',
  u'county': u'001',
  u'state': u'01',
  u'tract': u'020300'},
 {u'B01003_001E': u'4337',
  u'county': u'001',
  u'state': u'01',
  u'tract': u'020400'},
 {u'B01003_001E': u'10498',
  u'county': u'001',
  u'state': u'01',
  u'tract': u'020500'}]

There are many variables one can query for, MANY! Here we ask for ‘B01003_001E’ which are estimates for total population. Depending on my question, I could just as easily ask for variables for different sex, race, age, people who bike to work, income etc.

So we have this giant list of dictionaries, each dictionary representing the key:value pairs for one census tract. To make this easier to digest and understand we turn to pandas. If you haven’t used pandas before, commit the next week of your life to learn it, it is that useful.

import pandas as pd
population_tracts = pd.DataFrame(data)
population_tracts.head()

	B01003_001E	county	state	tract
0	1812	001	01	020100
1	2218	001	01	020200
2	3155	001	01	020300
3	4337	001	01	020400
4	10498	001	01	020500

Now we have the more familiar table-like structure we are used to seeing when working with data. Pandas brings the power R has in dataframes, improves it (my opinion) and brings it to python. At this point, we have the total population for each census tract in the entire U.S.

Step 2: Reading in the geometry.

Next we need to open up a shapefile that contains all the above census tracts for the same year, as well as our polygons. What I do here is create a dictionary where the key is the tract GEOID and the value is the tract geometry. There is nothing else we need to bring into memory from the tract shapefile. The GEOID is a string concatenation of the StateFIPS,CountyFIPS,and Tract FIPS. It is also necessary to do this concatenation in the pandas dataframe so that the values will match the shapefile’s.

import fiona
from shapely.geometry import shape

#create GEOID in the dataframe
population_tracts["GEOID"] = population_tracts.state+population_tracts.county +population_tracts.tract

#read the tract geometry
tracts = {t['properties']['GEOID']:shape(t['geometry']) for t in fiona.collection('tracts.shp')}

#read the polygon geometry
polygons= {t['properties']['id']:shape(t['geometry']) for t in fiona.collection('polygons.shp')}

That can take awhile, but now we have all the geometry in memory.

Step 3: The fun part

Next we need to get a list of tracts GEOID’s that intersect each polygon. The following will create a dictionary of polygon object id’s as the key and a list tract GEOID’s as the value for each tract a polygon intersects.

polygon_tracts={objectid:[geoid for geoid,tract_geom in tracts.iteritems() if polygon_geom.intersects(tract_geom)] for objectid,tract_geom in polygons.iteritems()}

The final step is to take this dictionary and use it to add up percent of area from each tract we need times its population.

#this line really does it all, break it up if needed
poly_pop= pd.DataFrame([(i,sum([polygons[i].intersection(tracts[geoid]).area / tracts[geoid].area * float(population_tracts[population_tracts.GEOID == geoid]['B01003_001E']) for geoid in tract_list])) for i,tract_list in polygon_tracts.iteritems()])
poly_pop.columns = ['id','population']

poly_pop

	id	population
0	1	2009863.173037
1	2	1338960.355493

There you have it. Enjoy!

Reference

Everything is in a github repo (except the tracts)

This bash script will make your own tracts shapefile.

#!/bin/bash
wget -r ftp://ftp2.census.gov/geo/tiger/TIGER2000/TRACT/
mv ftp2.census.gov/geo/tiger/TIGER2013/TRACT/*.zip .
unzip \*tract.zip
DATA=`find . -name '*tract.shp'`
for i in $DATA
do
	ogr2ogr -append -update tracts.shp $i -f "Esri Shapefile"
done