Canopy Height Model Creation: A Python Workflow for Forestry and Ecological Applications

Canopy Height Model Creation serves as the foundational step for quantifying vertical forest structure, enabling precise biomass estimation, habitat suitability modeling, and disturbance monitoring. For foresters, ecologists, and spatial developers, generating a reliable CHM requires a reproducible Python-based pipeline that transforms raw airborne LiDAR returns into a continuous raster surface representing vegetation height above ground. This workflow integrates geospatial validation, algorithmic ground filtering, and raster arithmetic to produce ecologically meaningful outputs. The broader framework of Canopy Height Modeling & Terrain Extraction establishes the methodological standards necessary for cross-project consistency and regulatory compliance in conservation research.

Data Ingestion & Point Cloud Validation

Raw LiDAR datasets rarely arrive analysis-ready. Before height derivation, point clouds must undergo rigorous quality control, including coordinate system verification, outlier removal, and classification standardization. Spatial developers typically rely on laspy for low-level LAS/LAZ parsing, while PDAL pipelines handle large-scale batch processing and format translation. Critical preprocessing steps involve removing non-vegetation artifacts such as power lines, building footprints, and atmospheric noise. Properly executed LiDAR Point Cloud Preprocessing ensures that subsequent ground classification algorithms operate on clean, topologically sound data, minimizing false elevation peaks that would otherwise propagate into the final canopy surface.

Spatial accuracy at this stage hinges on strict CRS enforcement. All point clouds must be projected to a metric coordinate reference system (e.g., UTM or State Plane) with vertical datum alignment (e.g., NAVD88 or EGM96). Misaligned vertical datums introduce systematic biases that compound during raster arithmetic, rendering ecological inferences invalid.

Ground Classification & Terrain Extraction

The accuracy of any CHM is directly constrained by the precision of the underlying Digital Terrain Model (DTM). Python workflows typically implement iterative morphological filtering, cloth simulation filtering (CSF), or progressive triangulated irregular network (TIN) densification to separate ground returns from vegetation. PDAL provides filters.csf and filters.smrf for ground point extraction. Once ground points are isolated, writers.gdal with output_type="min" generates a continuous bare-earth surface from the classified returns. A rigorously validated Digital Terrain Model Generation pipeline accounts for steep slopes, riparian corridors, and micro-topographic variation, which are critical for accurate height normalization in complex forest ecosystems.

When generating the DTM, cell size selection must align with the intended ecological scale. A 1-meter resolution typically captures understory structure, while 0.5-meter grids are preferred for precision forestry and individual tree crown delineation. The USGS 3D Elevation Program (3DEP) provides authoritative guidelines for vertical accuracy thresholds that should be referenced during QA/QC phases.

Raster Alignment & Height Derivation

With aligned DSM and DTM rasters, canopy height is derived through pixel-wise subtraction. However, spatial alignment, resolution matching, and extent clipping must be enforced programmatically using rasterio and numpy. The workflow requires resampling both surfaces to a common grid cell size and verifying affine matrix compatibility before arithmetic operations.

import rasterio
import numpy as np
from rasterio.enums import Resampling
from rasterio.warp import calculate_default_transform, reproject

def create_chm(dsm_path: str, dtm_path: str, chm_output: str, cell_size: float = 0.5):
    """
    Derive a Canopy Height Model by subtracting a DTM from a DSM.
    Both rasters are resampled onto a common grid at `cell_size` resolution
    in the DTM's CRS, so the pixel-wise subtraction is shape-safe.
    """
    with rasterio.open(dtm_path) as dtm_src:
        target_crs = dtm_src.crs
        # Build the common target grid from the DTM's extent.
        transform, width, height = calculate_default_transform(
            dtm_src.crs, target_crs,
            dtm_src.width, dtm_src.height,
            *dtm_src.bounds, resolution=(cell_size, cell_size),
        )

        # Reproject the DTM onto the target grid.
        dtm_data = np.full((height, width), np.nan, dtype=np.float32)
        reproject(
            source=rasterio.band(dtm_src, 1),
            destination=dtm_data,
            src_transform=dtm_src.transform,
            src_crs=dtm_src.crs,
            dst_transform=transform,
            dst_crs=target_crs,
            resampling=Resampling.bilinear,
        )

        # Reproject the DSM onto the same target grid.
        dsm_data = np.full((height, width), np.nan, dtype=np.float32)
        with rasterio.open(dsm_path) as dsm_src:
            reproject(
                source=rasterio.band(dsm_src, 1),
                destination=dsm_data,
                src_transform=dsm_src.transform,
                src_crs=dsm_src.crs,
                dst_transform=transform,
                dst_crs=target_crs,
                resampling=Resampling.bilinear,
            )

        # Capture the profile before the DTM dataset closes.
        out_profile = dtm_src.profile.copy()

    # Raster arithmetic with NaN handling
    chm = np.where(
        (dsm_data > dtm_data) & np.isfinite(dsm_data) & np.isfinite(dtm_data),
        dsm_data - dtm_data,
        np.nan,
    )

    # Clip negative values (sub-canopy noise) to zero
    chm = np.where(np.isnan(chm), np.nan, np.clip(chm, 0.0, None))

    out_profile.update(
        driver='GTiff',
        dtype='float32',
        height=height,
        width=width,
        transform=transform,
        crs=target_crs,
        count=1,
        nodata=np.nan,
    )

    with rasterio.open(chm_output, 'w', **out_profile) as dst:
        dst.write(chm.astype(np.float32), 1)

    return chm_output

This implementation guarantees memory-safe operations, preserves geospatial metadata, and prevents negative height artifacts caused by interpolation overshoot. For production environments, chunked processing via rasterio.windows is recommended to handle regional-scale LiDAR coverage without exceeding system RAM.

Post-Processing & Structural Metrics

Raw CHM rasters often contain high-frequency noise from understory vegetation, sensor artifacts, or edge effects at canopy boundaries. Applying spatial smoothing with scipy.ndimage.gaussian_filter improves ecological interpretability while preserving dominant canopy peaks. Choose sigma values that correspond to the expected crown diameter distribution: for a stand where crowns average 4 m in diameter at 0.5 m resolution, sigma ≈ 2–3 pixels provides appropriate smoothing without merging adjacent crowns.

from scipy.ndimage import gaussian_filter
import numpy as np

def smooth_chm(chm: np.ndarray, sigma: float = 1.5, nodata: float = np.nan) -> np.ndarray:
    """
    Apply Gaussian smoothing to a CHM array, preserving nodata regions.
    sigma is in pixel units; convert from metres using sigma_px = sigma_m / cell_size.
    """
    valid = np.isfinite(chm)
    filled = np.where(valid, chm, 0.0)
    smoothed = gaussian_filter(filled, sigma=sigma)
    # Restore nodata where the original had none
    smoothed[~valid] = nodata
    return smoothed

Once smoothed, the CHM serves as the primary input for structural metrics. Threshold-based binarization enables Calculating canopy cover from CHM in python, which is essential for habitat fragmentation studies and microclimate modeling.

Validation & Reproducibility Standards

Ecological validity requires independent ground-truthing. Field-measured tree heights, UAV photogrammetry, or terrestrial laser scanning (TLS) should be used to compute RMSE and bias metrics against the derived CHM. A standard validation approach extracts CHM values at the location of measured trees and regresses predicted against observed height. All pipeline parameters—including filtering thresholds, interpolation kernels, and resampling methods—must be version-controlled and documented alongside the output rasters. Adhering to these standards ensures that Canopy Height Model Creation remains a transparent, auditable process suitable for peer-reviewed research and regulatory reporting.