Gramblings: August 2023

Find the data you want to access

The Western PA Regional Data Center (WPRDC) is an awesome resource with lots of Pittsburgh-related data. I wanted to access Pittsburgh city arrest data.

After you find your data file, you'll need to find the file's resource id. It's a hash, a long string of letters & numbers with four hyphens. There's a few places this is located. If you download a sample file, the name of that file is the id. The resource id is also the last section of the page's URL. If you're visiting this webpage:

https://data.wprdc.org/datastore/dump/e03a89dd-134a-4ee8-a2bd-62c40aeebc6f

then your resource ID is

e03a89dd-134a-4ee8-a2bd-62c40aeebc6f

Build your request URL

WPRDC uses DataStore to house its data. You can ask WPRDC to send you data by requesting it in a URL format the database understands, an API. To make a request, we'll use the datastore_search call. Here's the base of the URL:

https://data.wprdc.org/api/3/action/datastore_search?

Now we need to tell the database which file we're looking for. Add your request id:

https://data.wprdc.org/api/3/action/datastore_search?resource_id=e03a89dd-134a-4ee8-a2bd-62c40aeebc6f

Since we're just getting started still, let's add a limit on the amount of data WPRDC will send us at once. That way we don't accidentally request huge data files, wasting resources for both us and WPRDC. I'll start with a limit of 5 rows of data. Once we know our program is working, we can come back later and remove the limit.

https://data.wprdc.org/api/3/action/datastore_search?resource_id=e03a89dd-134a-4ee8-a2bd-62c40aeebc6f&limit=5

There's lots of other helpful tricks you can use when building your URL. For example, you can add filters to reduce the amount of irrelevant data you receive. The datastore_search documentation explains more. For now, just hold on to that URL.

Install Python and a few libraries

Install Python 3.

Install requests and pandas. The easiest way is to use pip.

pip install requests

pip install pandas

(Optional) I like to use Jupyter notebooks to mess around with tabular data. Consider installing Jupyter Lab and running your code in a notebook.

pip install notebook

jupyter notebook

Request your data using Python

Open a new Python file or notebook, and import your libraries:

import requests

import pandas as pd

Paste in the request URL we built earlier:

url = https://data.wprdc.org/api/3/action/datastore_search?resource_id=e03a89dd-134a-4ee8-a2bd-62c40aeebc6f&limit=5

Ask (request) WPRDC to send you that data, then turn it into a readable format (JSON).

resp = requests.get(url).json()

The response contains a bunch of extra metadata that we don't need right now. So let's grab the meaty part of the response ("result") and pull out the actual data ("records"). We'll use pandas to turn that into a nice spreadsheet table (a "DataFrame").

data = pd.DataFrame(resp['result']['records'])

data (if you're using a notebook)

display(data.to_string()) (if you're not using a notebook)

A note on API politeness

Every time you run requests.get(url), you're connecting to the web, asking WPRDC to send you data, and downloading the response. When you're using an API, try to be conscientious about the frequency and size of the requests you're making. Many databases will enforce a limit on the number of requests you make in a day, and some will even ban your IP address if they think you're trying to abuse their servers with tons of spammy requests. I didn't find any documentation about the API limits for WPRDC, but it's still best-practice to be intentional about how you design your code to only request new data when you actually need it.

This tutorial was written in August 2023.

Gramblings

Saturday, August 19, 2023

How to look at Pittsburgh arrest data using an API - Beginner-friendly

Find the data you want to access

Build your request URL

Install Python and a few libraries

Request your data using Python

A note on API politeness