Project Showcase - ParticleBox

Overview

ParticleBox is a performant C++ particle simulator designed to efficiently simulate the interactions of thousands of particles in real-time. It utilizes modern C++ features and platform-specific optimizations to achieve high frame rates, even with large particle counts.

Key performance techniques implemented in ParticleBox:

Spatial Partitioning: Implements a grid-based partitioning system (physics.cpp) to dramatically reduce the number of pairwise collision checks, changing the complexity from O(n²) towards O(n) in typical scenarios. Cell indices are calculated efficiently using bit-shifting (>> CELL_SIZE_SHIFT).
Multithreaded Processing: Leverages multi-core processors for parallel computation. Uses Apple's Grand Central Dispatch (GCD) via dispatch/dispatch.h for optimized task scheduling on Apple Silicon (simulation.cpp) and OpenMP (#pragma omp parallel for in physics.cpp) with std::async/std::future (simulation.cpp) for cross-platform parallelization on Linux/Windows.
SIMD Vector Operations: Utilizes native SIMD instructions on Apple Silicon via the Accelerate framework (simd/simd.h, simd_float2) for vector math (particle.h, physics.cpp, simulation.cpp). Compiler flags like -ftree-vectorize and -march=native (Makefile) encourage auto-vectorization on other platforms.
Optimized Memory Patterns: Uses std::vector::reserve() to pre-allocate memory for particles (simulation.cpp), minimizing reallocations during runtime. Employs thread_local random number generators (simulation.cpp) to reduce contention in multithreaded particle spawning. The grid uses flat std::vectors for cell data, improving cache locality.

Technical Highlights

Zero-Overhead Abstractions: Leverages C++17 features and compiler optimizations to create efficient code structures.
Advanced Math Optimizations: Implements the Fast Inverse Square Root algorithm (Vec2::fastInvSqrt in particle.h) and uses squared distance checks to avoid costly square roots in collision detection loops (physics.cpp). Employs bitwise shift operations for fast grid cell index calculation.
Data-Oriented Design Influence: Prioritizes efficient data access patterns through contiguous storage in std::vector, pre-allocation, and the flat-array structure of the spatial grid. Uses thread-local storage for RNGs. Particles are stored as an Array of Structs (AoS).
Minimized Allocation Runtime: The core simulation update loop (force calculation, position/velocity updates) avoids dynamic memory allocation by pre-allocating particle storage using reserve().

Gallery

Particle simulation screenshot - Example 1

Animated particle simulation - Example 2

Background

Mathematical Foundation

ParticleBox simulates particle interactions using optimized numerical methods.

Collision Detection Mathematics

Collision detection between two particles (p1, p2) primarily uses squared distances to avoid expensive square root calculations in the inner loop. The squared distance \(d^2\) is calculated:

\[ d^2 = (p2.x - p1.x)^2 + (p2.y - p1.y)^2 \]

A collision is detected if \(d^2 < (r1 + r2)^2\), where \(r1\) and \(r2\) are the particle radii. When the actual distance or a normalized direction vector is required (e.g., for force calculation), the optimized Fast Inverse Square Root (fastInvSqrt) function is used to calculate \(1/\sqrt{d^2}\).

Force Calculation

A simple linear repulsion force is applied when particles overlap. The magnitude of the repulsion force \(F_r\) between overlapping particles is proportional to the overlap depth:

\[ \mathbf{F}_r = -\hat{\mathbf{n}} \cdot \text{repulsionStrength} \cdot \delta \]

Where \(\delta\) is the overlap distance \((r1 + r2) - d\), \(\text{repulsionStrength}\) is a constant factor (REPULSION_STRENGTH in physics.h), and \(\hat{\mathbf{n}}\) is the normalized direction vector from particle \(i\) to particle \(j\), calculated efficiently using fastInvSqrt. Force calculations leverage SIMD vector types (simd_float2) on Apple platforms.

Integration Method

The simulation uses a Semi-Implicit Euler integration method to update particle positions and velocities:

\[ \mathbf{v}_{t+\Delta t} = \mathbf{v}_t + (\mathbf{F}_t / m) \cdot \Delta t \] \[ \mathbf{p}_{t+\Delta t} = \mathbf{p}_t + \mathbf{v}_{t+\Delta t} \cdot \Delta t \]

Where \(\mathbf{p}\) is position, \(\mathbf{v}\) is velocity, \(\mathbf{F}\) is the net force, \(m\) is mass, and \(\Delta t\) is the time step. This method is straightforward to implement and computationally inexpensive.

Implementation

System Architecture

ParticleBox is structured into several key components:

Main Application (main.cpp): Initializes SDL, creates windows/renderers, manages the main event loop, and orchestrates simulation updates and rendering.
Simulation Core (simulation.h, simulation.cpp): Manages the collection of particles (std::vector), controls the simulation state (running/stopped), handles particle creation/deletion, delegates physics updates, manages frame timing, and triggers rendering. Contains the logic for selecting multithreading strategies (GCD vs. OpenMP/async).
Physics Engine (physics.h, physics.cpp): Calculates forces (gravity, repulsion), performs collision detection (using grid or brute-force), applies boundary conditions, and contains toggles for physics features (gravity, grid, pairwise optimization). Implements the spatial partitioning grid logic.
Particle Representation (particle.h, particle.cpp): Defines the Particle struct and the Vec2 structure (with SIMD optimizations for Apple platforms). Handles individual particle rendering and state (position, velocity, mass, etc.). Includes particle texture caching.
GUI (gui.h, gui.cpp): Manages the separate control window using SDL and SDL_ttf, providing buttons, toggles, and performance metrics display (FPS, particle count, graphs).

Spatial Partitioning (Grid)

When enabled, the spatial partitioning system uses a uniform grid to optimize collision detection:

Grid Structure: Divides the simulation space into fixed-size cells (CELL_SIZE = 8.0f).
Optimized Cell Indexing: Uses fast bit-shifting (pos.x >> CELL_SIZE_SHIFT) to calculate the grid cell indices for each particle's position.
Neighbor Search: For each particle, collision checks are limited to particles within the same cell and adjacent neighboring cells (typically 9 cells total, can be reduced based on `reducedPairwiseComparisonsEnabled`).
Data Structure: Uses flat std::vectors (cellCounts, cellParticles, cellStartIndices) to store particle indices sorted by cell, promoting better cache locality during neighbor iteration compared to nested structures.

Multithreaded Physics Engine

The physics engine distributes computational work across available CPU cores:

Platform-Specific Parallelism:
- On macOS/iOS (Apple Silicon): Utilizes Grand Central Dispatch (GCD) via dispatch_async and semaphores for efficient, low-overhead task scheduling managed by the OS.
- On Linux/Windows: Uses a combination of std::async with std::launch::async to distribute work across threads (managed by the C++ runtime) and OpenMP directives (#pragma omp parallel for, #pragma omp atomic) within the force calculation loops in physics.cpp.
Dynamic Thread Count: Determines the number of threads based on hardware capabilities using std::thread::hardware_concurrency() (with a fallback).
Work Distribution: Divides the particle list into chunks, assigning each chunk to a separate task/thread for force calculation and state updates.
Synchronization: Uses #pragma omp atomic for safe updates to shared force contributions in the OpenMP path and GCD's task-based model implicitly manages synchronization or uses semaphores where needed.
Thread-local Storage: Employs thread_local std::mt19937 random number generators to avoid lock contention during concurrent particle creation.

SIMD Optimizations

Vector operations are accelerated using Single Instruction, Multiple Data (SIMD) techniques:

Apple Accelerate/simd: Explicitly uses Apple's `simd/simd.h` types (simd_float2) and functions (e.g., simd_dot) for Vec2 operations (addition, subtraction, dot product) and within the physics/update loops on macOS/iOS, providing native performance. Enabled via `-framework Accelerate` and `-mcpu=apple-m1` in the Makefile.
Auto-Vectorization: Uses compiler flags like -O3, -ffast-math, -ftree-vectorize, and -march=native in the Makefile to encourage the compiler (like GCC/Clang) to automatically generate SIMD instructions (SSE, AVX, NEON depending on the target architecture) for loops and vectorizable code on other platforms.
Usage: Applied in force accumulation, velocity/position updates, and distance calculations within physics.cpp and simulation.cpp.

Memory Optimizations

Memory access and allocation patterns are optimized for performance:

Pre-allocation: Uses std::vector::reserve() when resetting or spawning particles to pre-allocate sufficient memory, reducing the likelihood of costly reallocations during the simulation run.
Contiguous Storage: Leverages std::vector which stores elements contiguously, improving cache locality during iteration compared to linked structures.
Array of Structs (AoS): Particle data (position, velocity, mass, etc.) is stored using the standard AoS pattern within std::vector. While straightforward, SIMD utilization might be less optimal than a Structure of Arrays (SoA) approach for certain operations.
Texture Caching: Implements a simple cache (circleTextureCache in particle.cpp) for pre-rendered particle circle textures, reducing redundant texture creation and GPU overhead.
Thread-local RNGs: Avoids contention on a shared random number generator during multithreaded particle creation.

Math Optimizations

Mathematical calculations are optimized for speed:

Fast Inverse Square Root: Uses the classic `fastInvSqrt` algorithm for normalizing vectors and calculating distances when needed.
Squared Distance Checks: Avoids `sqrt` in the main collision detection loop by comparing squared distances.
Bitwise Operations: Uses fast bit-shifting (>>) for grid cell index calculations instead of division.
Multiplication by Inverse: Uses pre-calculated inverse mass (invMass) to replace division in force-to-acceleration calculations.

Performance Optimization

Multithreading Implementation

ParticleBox utilizes multi-core CPUs effectively through:

Adaptive Parallelism: Employs GCD for optimal OS-level scheduling on Apple Silicon and a combination of `std::async` and OpenMP for broad platform compatibility.
Workload Balancing: Divides particles into roughly equal chunks for processing by different threads/tasks.
Minimized Synchronization: Leverages atomic operations (`#pragma omp atomic`) in OpenMP and task-based separation in GCD to reduce explicit locking overhead.
Contention Reduction: Uses thread_local RNGs for parallel particle spawning without locks.

Memory Optimization Techniques

Memory performance is enhanced through:

Pre-allocation Strategy: Consistent use of std::vector::reserve() minimizes runtime reallocations and memory fragmentation.
Cache-Friendly Data Structures: Use of contiguous std::vector and the flat array layout within the spatial grid promotes better CPU cache utilization.
Texture Caching: Reduces GPU overhead by reusing particle textures.
Data Locality Focus: While not strictly SoA, the grid implementation and use of `std::vector` prioritize keeping related data close in memory.

Profiling-Driven Optimizations

Performance improvements were guided by analyzing bottlenecks. Key areas addressed include:

Collision Detection Cost: Addressed by implementing the spatial partitioning grid, significantly reducing pairwise checks.
CPU-Bound Calculations: Parallelized force computation and particle updates using multithreading (GCD/OpenMP/async).
Vector Math Intensity: Accelerated using SIMD instructions (explicitly on Apple Silicon, encouraged via compiler flags elsewhere).
Profiling Tools Used: Valgrind, Intel VTune, Apple Instruments (as listed by user).

Technology Stack

Core Language: C++17
Graphics/Windowing API: SDL2, SDL2_ttf
Build System: Make (Makefile) with platform detection and optimization flags (-O3, -march=native, -mcpu=apple-m1, -ftree-vectorize, -fopenmp).
Parallelism: OpenMP, C++11 Threads (std::async, std::future), Apple Grand Central Dispatch (GCD)
SIMD: Apple Accelerate Framework (simd/simd.h), Compiler Auto-Vectorization (via flags)
Key Optimization Techniques: Spatial Partitioning (Grid), Multithreading, SIMD, Memory Pre-allocation, Fast Math Algorithms, Texture Caching.
Profiling Tools: Valgrind, Intel VTune, Apple Instruments
Development Environment: Visual Studio Code, CLion (User specified)
Platform Support: Windows, macOS (with Apple Silicon optimizations), Linux.