@routineactivity

Using Open Data to Explore Boston Crime

Boston Crime Analysis Mini-Project

Logo Logo

In this mini-project, I used three tools:


Step 1: Schema Design in PostgreSQL

Within my local PostgreSQL instance, I spun up a database named usa and defined a boston schema to house all raw and processed tables.

First, I ingested the unaltered CSV exports and shapefiles from Analyze Boston into staging tables alongside the corresponding geometry layers (neighbourhood boundaries, police districts, census tracts). From there, I built a consolidated view that:

This clean view will be the single source of truth for downstream analysis and visualisation in Power BI. The most recent 12 months will be used for a record-level analysis report page in Power BI.

Additionally, I created a materialised view that aggregates incidents by crime category, police district, and ISO week. By precomputing weekly counts, this view will support a responsive performance report page in Power BI, eliminating the need to query large raw and row-level tables in real time and ensuring fast, seamless rendering of key trends.

Logo


Step 2: ETL in Python

To keep the crime datasets up to date, I needed a way to dynamically retrieve the latest CSV download links from the Analyze Boston data pages, since the URLs change with each dataset refresh. I used Python to scrape the current download links, read the updated data, and append only new rows to the existing PostgreSQL tables for crime incidents and shootings.

To automate the update process, you can schedule the script to run at regular intervals using tools like a Power Automate Desktop flow or a simple .bat file combined with Windows Task Scheduler. The best approach depends on your setup—whether you’re working on a corporate machine with Power Automate access, a personal Windows laptop, or a Linux-based environment where a cron job might be more appropriate.

Scrape URLs.

# Scrape the CSV URLs from Analyze Boston crime incidents
# and shootings pages to keep our tables up to date

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

def fetch_csv_url(page_url):
    r = requests.get(page_url); r.raise_for_status()
    soup = BeautifulSoup(r.text, "html.parser")
    link = soup.find("a", href=lambda h: h and h.endswith(".csv"))
    return urljoin(page_url, link["href"])

crime_url = fetch_csv_url(
    "https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system"
)
shootings_url = fetch_csv_url(
    "https://data.boston.gov/dataset/shootings/resource/73c7e069-701f-4910-986d-b950f46c91a1"
)

print(crime_url, shootings_url)

Then append new rows.

import pandas as pd
from sqlalchemy import create_engine, text
from io import StringIO

user = "your_username"
password = "your_password"
host = "your_localhost"
port = "your_port_number"
database = "usa"

engine = create_engine(
    "postgresql+psycopg2://{user}:{password}@{host}:{port}/{database}"
)

def update_table(csv_url, table_name):
    res = requests.get(csv_url); res.raise_for_status()
    df = pd.read_csv(StringIO(res.text), low_memory=False)
    # fetch existing keys
    with engine.connect() as conn:
        existing = pd.read_sql(
            text(f"SELECT incident_number FROM boston.{table_name}"),
            conn
        )
    new = df[~df["INCIDENT_NUMBER"].isin(existing["incident_number"])]
    new.columns = new.columns.str.lower()
    if "shooting" in new:
        new["shooting"] = new["shooting"].astype(bool)
    if not new.empty:
        new.to_sql(table_name, engine, schema="boston", if_exists="append", index=False)
        print(f"Inserted {len(new)} new rows into {table_name}")
    else:
        print(f"No new rows for {table_name}")

update_table(crime_url, '"crime_incidents"')
update_table(shootings_url, "shootings")

Step 3: SQL Queries for Power BI

From my PostgreSQL views, I created two key queries to load into Power BI:

One query pulls data from the materialised weekly summary view, optimised for fast rendering of trends and comparisons. The second query extracts 12 months of record level crime data to support more granular analysis of locations, patterns, and temporal distributions. In addition, I imported a Boston Police Districts boundary file into Power BI by converting its geometry to WKT format and loading it as a CSV, making it easy to map without requiring direct spatial database connections.

Query for the performance data:

SELECT * FROM boston.part_one_two_weekly_data

Query for the crime trends data:

SELECT
  fid,
  incident_number,
  offense_code,
  offense_code_group,
  offense_description_p,
  custom_group,
  district_nm,
  occurred_timestamp,
  date_from,
  year,
  month,
  day_of_week,
  hour,
  
  -- 1) round up into 3-hour buckets (0–2→3, 3–5→6 etc)
  CEIL( (hour + 1) / 3.0 ) * 3 AS three_hour_bin,
  
  -- 2) ISO weekday number (1 = Mon, 2 = Tue etc)
  EXTRACT( ISODOW FROM date_from ) AS weekday_number,
  
  -- 3) Propercase street + space + district + census tract
  INITCAP(street || ' ' || district_nm || ' ' || ct20) AS street_district,
  
  street,
  lat,
  long,
  ct20,
  ucr_part,
  shooting_type_v2,
  victim_gender,
  victim_race,
  multi_victim

FROM boston.crime_with_shootings
WHERE date_from >= (
    SELECT MAX(date_from) - INTERVAL '12 months'
    FROM boston.crime_with_shootings
)
  AND date_from <= (
    SELECT MAX(date_from)
    FROM boston.crime_with_shootings
)

Step 4: Building the Power BI Dashboard

This is the fun part! I created two simple report view pages:

Performance Overview

Crime Patterns

Report link below, opens in new tab.

👉 View Power BI Report

This mini-project highlights how open data, combined with a few open accessible tools — Python, PostgreSQL, and Power BI — can help deliver timely insights.

I hope this inspires others to explore public datasets and build their own data-driven workflows.


Further reading