Darshan Makwana

After 3 months of standing desk

2026-05-06T00:00:00+05:30

Since the last 2 months I had to change my environment in which I used to do my long hours of deep work. One of the things that was part of the change was a standing desk. Actually it was not a desk per se but just a platform on top of which I can keep my laptop and just start coding away, the platform was just tall enough that I can’t sit and do work and slightly short enought that I had to bend my knees, so I had to add a wooden plank on top of the platform to level it up.

So after this I just went away with this setup. The first couple of days were not bad I mean my legs were already in pretty good shape as I have a regular running habit. I wasn’t feeling fatigued or lethargic at all, I would say I was more aware of my own physical body or how much energy/force it has to exert while doing a task which I used to do just sitting. I was just slightly distracted by this fact but not distracted enough to not be able to do any work at all. After the first week it became natural to do long hours of work in this new setup of mine.

I have observed that the speed at which I am approaching and getting things done has slightly increased. I have also become more proactive and more communicative. I now love standing while working. I have now kept two 10L water jugs alongside my desk and I have been experimenting with it doing bicep curls/hammer curls/triceps in every 30mins interval, I have also experimented with adding pushups in this combo, it has been working out fantastic for me. I think this is clear connection between productivity and slight pressure to body to keep it in order, every time we lift weights or make our muscles work under pressure we are unconsciously reminding our mind that we are control of our physical selves and our actions does have meaningful impact in the world that we live in

I remain energetic even after long hours of deep work, I don’t get tired after a long day which I usually felt previously. I don’t think weather the gamified setup of exercise + work or the standing desk is the causation of this, but I can confirm there is definitely a correlation and it has been working out fanstastic for me

I gave a couple of interviews in the last months and I cleared a majority of them while standing. I now speak more energetically and aggressively while I am standing. I have noticed that when I don’t get exercise or my body does not undergo any form of activity I simply cannot concentrate and if I am in chair then end up spending my day chaining positions

Gaussian Splatting for Dummies

2026-04-12T00:00:00+05:30

Gaussian Splatting is a fascinating scene reconstruction technique introduced by INRIA and last year I had a lot of fun tinkering with it while on my semex. I recently discovered some of my notes related to it and decided to digitize it this weekend, along the way I reimplemented the forward rasterization pass in rust and decided it would be fun to write a tutorial explaining gaussian splatting to everyone, so here it is

what is a gaussian splat?

a 3D Gaussian splat is an oriented ellipsoid in space that carries some color and opacity. you can think of it as a fuzzy colored blob. a scene is made of hundreds of thousands of these blobs, and when you look at them from a particular viewpoint, they overlap and blend to form the final image

We represent each gaussian with these attributes:

pub struct Splat {
    pub pos: Vec3,      // center position in world space
    pub scale: Vec3,    // size along each local axis
    pub rot: Quat,      // orientation as a unit quaternion
    pub color: Vec3,    // RGB color (already decoded from spherical harmonics)
    pub opacity: f32,   // how opaque this blob is, in [0, 1]
}

scales are stored in log space, opacities as logits, colors as spherical harmonics coefficients and quaternions are normzlied to unit length to ensure the values lie within their respective range

spherical harmonic (SH) coefficients are just a frequency-domain representation of a color function defined over the unit sphere, now why spherical harmonics? because in the real world, the color of a surface depends on the viewing direction. SH coefficients encode this view-dependent appearance compactly.

SH functions are organized in bands (like octaves in music), as you go higher up in the bands you have more coefficients and thus they capture more finer details, the INRIA 3DGS format stores up to band 3 (48 coefficients per splat for RGB)

To decode bash 0, the band-0 SH basis function is $Y_0^0 = \frac{1}{2\sqrt{\pi}} \approx 0.282$. the conversion from SH coefficient to RGB is:

\[\text{color} = \text{clamp}\left(0.5 + C_0 \cdot f_{dc},\ 0,\ 1\right)\]

where $C_0 = Y_0^0$ and $f_{dc}$ is the 3-component DC coefficient from the file.

pub const SH_C0: f32 = 0.28209479177387814;

pub fn sh_band0_to_rgb(f_dc: Vec3) -> Vec3 {
    (Vec3::splat(0.5) + SH_C0 * f_dc).clamp(Vec3::ZERO, Vec3::ONE)
}

the forward pass pipeline

the forward pass turns a list of 3D Gaussians + a camera into a 2D image. here is an overview of the rendering pipeline:

Step 1: Projecting Splats

1.1: building the 3D covariance matrix

for each splat given the raw (scale, rotation) pairs we need to construct a 3D covariance matrix $\Sigma$ that describes the shape and orientation of the Gaussian in world space. the formula is:

\[\Sigma = R \cdot S \cdot S^T \cdot R^T\]

where R is the 3×3 rotation matrix from the quaternion, and S is a diagonal matrix of scales. if we let M = R·S, this simplifies to:

\[\Sigma = M \cdot M^T\]

let r_mat = Mat3::from_quat(s.rot);
let s_mat = Mat3::from_diagonal(s.scale);
let m = r_mat * s_mat;
let cov3d = m * m.transpose();

Note: why dowe decompose the covariance this way?

Covariance matrices have physical meaning only when they are positive semi-definite. gradient descent cannot easily be constrained to produce valid matrices, by expressing the covariance as $M \cdot M^T$, it is guaranteed to be positive semi-definite, a matrix of the form $A^T A$ always is. this is a reparametrization trick: we optimize scale and rotation separately, which are unconstrained, and the covariance we derive from them is always valid

what does this matrix actually look like? for a splat with scale = (0.1, 0.05, 0.02) and identity rotation:

\[\Sigma = \begin{pmatrix} 0.01 & 0 & 0 \\ 0 & 0.0025 & 0 \\ 0 & 0 & 0.0004 \end{pmatrix} \quad \begin{aligned} &= \text{diag}(0.1^2,\; 0.05^2,\; 0.02^2) \end{aligned}\]

with identity rotation, it is just the squared scales on the diagonal, an axis-aligned ellipsoid

1.2: transforming into view space

the 3D covariance we just computed lives in world space. to project it onto the camera’s image plane, we first need to rotate it into view space, the coordinate system where the camera is at the origin, looking down −z

for the splat center, this is just a matrix-vector multiply with the 4×4 view matrix:

let p_view4 = view * Vec4::new(s.pos.x, s.pos.y, s.pos.z, 1.0);
let p_view = Vec3::new(p_view4.x, p_view4.y, p_view4.z);
if p_view.z > -znear || p_view.z < -zfar {
    return None;
}
let zc = -p_view.z;

note zc = -p_view.z. our view space is right-handed with the camera looking down −z, so points in front of the camera have negative z. we use zc (positive in front) as the depth for sorting and projection.

for the covariance, we rotate it by the 3×3 part of the view matrix W:

\[\Sigma_{view} = W \cdot \Sigma \cdot W^T\]

let w_mat = Mat3::from_mat4(view);
let w_mat_t = w_mat.transpose();

let cov3d_view = w_mat * cov3d * w_mat_t;

this is just the standard basis-change formula for a covariance matrix. the shape of the ellipsoid does not change due to this infact we are only re-expressing it in the camera’s coordinate system.

1.3: projecting to 2D

now we have a 3D Gaussian in view space and we need to project it onto the 2D image plane. the projection is perspective, which means a 3D Gaussian does not project to an exact 2D Gaussian because perspective is a nonlinear transform. but we can locally linearize it using the Jacobian of the projection function, and the result is close enough.

the projection function maps a 3D point (x, y, z) in view space to pixel coordinates (u, v):

$u = f_x \cdot \frac{x}{z_c} + c_x$ $v = f_y \cdot \frac{y}{z_c} + c_y$

where $f_x, f_y$ are the focal lengths and $c_x, c_y$ are the principal point (image center). for simplicity of calcuations we can also assume, $f_x = f_y$

the Jacobian J of this projection evaluated at the splat center is:

\[J = \begin{bmatrix} \frac{f_x}{z_c} & 0 & \frac{f_x \cdot x_v}{z_c^2} \\ 0 & \frac{f_y}{z_c} & \frac{f_y \cdot y_v}{z_c^2} \\ 0 & 0 & 0 \end{bmatrix}\]

the structure of this matrix is very sparse, only 4 of the 9 entries are nonzero.we can just do the full $JC J^T$ with two 3×3 matrix multiplies (~54 scalar multiplies). but because we only need the top-left $2\times 2$ of $J$ Cov3D_view $J^T$, since the third row of $J$ is all zeros. and the first two rows of J each have only two nonzero entries. so instead of two full matrix multiplies, we can compute the 2D covariance with ~20 scalar multiplies by expanding the product by hand:

let c = &cov3d_view;
let inv_zc = 1.0 / zc;
let inv_zc2 = inv_zc * inv_zc;

let j00 = fx * inv_zc;
let j02 = fx * xv * inv_zc2;
let j11 = fy * inv_zc;
let j12 = fy * yv * inv_zc2;

// Row 0 of J * C: [j00*c00 + j02*c20, j00*c01 + j02*c21, j00*c02 + j02*c22]
let t0x = j00 * c.x_axis.x + j02 * c.z_axis.x;
let t0y = j00 * c.y_axis.x + j02 * c.z_axis.y;
let t0z = j00 * c.x_axis.z + j02 * c.z_axis.z;

// Row 1 of J * C: [j11*c10 + j12*c20, j11*c11 + j12*c21, j11*c12 + j12*c22]
let t1y = j11 * c.y_axis.y + j12 * c.z_axis.y;
let t1z = j11 * c.y_axis.z + j12 * c.z_axis.z;

// 2D cov = (J*C) * J^T, top-left 2x2:
let cov2d_00 = t0x * j00 + t0z * j02 + eps2d;
let cov2d_01 = t0y * j11 + t0z * j12;
let cov2d_11 = t1y * j11 + t1z * j12 + eps2d;

notice the eps2d on the diagonal entries. that is a small dilation (default 0.3) added for numerical stability, it just ensures the 2D covariance is strictly positive definite (not just semi-definite), which means it is always invertible.

Note: why the eps2d trick works

by construction, the 2D covariance $JCJ^T$ is only positive semi-definite ($A^T A$ form). but we need to invert it later (for evaluating the Gaussian at each pixel). a singular matrix is not invertible.

adding eps2d to the diagonal means adding $\lambda I$ to the matrix. for any vector x:

\[x^T \cdot (A^T A + \lambda I) \cdot x = \|Ax\|^2 + \lambda \|x\|^2 > 0\]

this is strictly positive for any nonzero x, which is the definition of positive definite, invertible, with all eigenvalues strictly positive

next we invert the $2\times 2$ covariance. for a $2\times 2$ matrix the inverse has a closed form:

\[\begin{bmatrix} a & b \\ b & d \end{bmatrix}^{-1} = \frac{1}{ad - b^2} \begin{bmatrix} d & -b \\ -b & a \end{bmatrix}\]

let det = cov2d_00 * cov2d_11 - cov2d_01 * cov2d_01;
if det <= 0.0 {
    return None;
}

let inv_det = 1.0 / det;
let cov2d_inv = Mat2::from_cols(
    Vec2::new(cov2d_11 * inv_det, -cov2d_01 * inv_det),
    Vec2::new(-cov2d_01 * inv_det, cov2d_00 * inv_det),
);

and the screen position is a standard perspective divide:

let sx = fx * xv * inv_zc + cx;
let sy = fy * yv * inv_zc + cy;

at this point, for each splat we have: screen position (sx, sy), depth zc, and the inverse 2D covariance matrix cov2d_inv. this is everything we need to evaluate the Gaussian at any pixel on screen.

step 2: computing bounding boxes

We would now have to evalute the splat at every pixel in the screen because the 2D gaussian has infinite support, it never truly reaches zero, but we can compute a bounding box that encloses the region where the gaussian has any visible effect and only evalute pixels inside this bounding box

the original 3DGS code computes the two eigenvalues $\lambda_1$, $\lambda_2$ of the 2D covariance (the variances along the two principal axes of the ellipse), takes $r = 3\sqrt{max(\lambda_1, \lambda_2)}$ (the 3-sigma rule, covering 99.7% of the Gaussian), and uses a circle of that radius as the bounding box

this is simple but wasteful. when a Gaussian is elongated (one eigenvalue much larger than the other), the bounding circle will also include a lot of empty space

We can create a tighter bounding box by observing that:

the extent along each axis is $k\sqrt{\Sigma_{ii}}$ where $\Sigma_{ii}$ is the diagonal entry of the 2D covariance for that axis. for an elongated ellipse, the short-axis extent is much smaller than the long-axis extent, so the bounding box is tighter.
the classic $3\sigma$ rule is conservative. a faint splat (low opacity) does not need $3\sigma$ because its contribution drops below the visibility threshold much sooner. the cutoff distance $k$ can be computed when our splat’s opacity falls below certain threshold $\tau$

based on this each pixel gets

\[\alpha = \text{opacity} \cdot \exp(-\tfrac{1}{2} \cdot \mathbf{d}^T \Sigma^{-1} \mathbf{d})\]

we want $\alpha \geq \tau$, which rearranges to:

\[\mathbf{d}^T \Sigma^{-1} \mathbf{d} \leq 2 \ln\left(\frac{\text{opacity}}{\tau}\right) = k^2\]

for a near-opaque splat (opacity = 1), $k^2 = 2ln(255) = 11.1$, which is close to the $3\sigma$ value of 9. for a faint splat (opacity = 0.1), $k^2 = 2ln(255) = 11.1$, so the box and computation shrinks substantially.

if s.opacity <= alpha_threshold {
    return None;
}
let k2 = (2.0 * (s.opacity / alpha_threshold).ln()).min(max_k2);
if !(k2 > 0.0) {
    return None;
}

let rx_f = (k2 * cov2d_00).sqrt();
let ry_f = (k2 * cov2d_11).sqrt();
if !rx_f.is_finite() || !ry_f.is_finite() || rx_f < 1.0 || ry_f < 1.0 {
    return None;
}

let x0 = (sx - rx_f).floor() as i32;
let y0 = (sy - ry_f).floor() as i32;
let x1 = (sx + rx_f).ceil() as i32;
let y1 = (sy + ry_f).ceil() as i32;

// Clip to framebuffer.
let x0 = x0.max(0);
let y0 = y0.max(0);
let x1 = x1.min(w_i - 1);
let y1 = y1.min(h_i - 1);
if x0 > x1 || y0 > y1 {
    return None;
}

Also because each splat is independent of each other the projection computation is embarassingly parallel, we use rayon’s par_iter().filter_map() to project all splats across all cores:

pub fn project(
    splats: &[Splat],
    camera: &OrbitCamera,
    params: &RenderParams,
    pool: &Option<rayon::ThreadPool>,
) -> Vec<Projected> {
    // ... precompute view matrix, intrinsics, etc (once per frame) ...

    let do_project = || {
        splats
            .par_iter()
            .filter_map(|s| {
                // ... all the math above, returning Some(Projected) or None ...
            })
            .collect()
    };

    match pool.as_ref() {
        Some(p) => p.install(do_project),
        None => do_project(),
    }
}

we can now just pack everything into a Projected struct

pub struct Projected {
    pub screen: Vec2,       // pixel center (sx, sy)
    pub depth: f32,         // zc (positive in front)
    pub cov2d_inv: Mat2,    // inverse 2D covariance
    pub bbox: [i32; 4],     // inclusive: x0, y0, x1, y1
    pub color: Vec3,        // RGB
    pub opacity: f32,       // [0, 1]
}

step 3: depth sort

For alpha compositing we use painter’s algorithm in reverse. this means we need the projected splats sorted by depth before compositing. the order matters here, if a close splat occludes a far one, it must be composited first so it has more weight in the blending process. because all depths are positive (we culled anything behind the camera), so the bit pattern of an f32 preserves float ordering when reinterpreted as a u32. this lets us use depth.to_bits() as a sort key

for the actual sorting, we use a simple 2-pass 16-bit radix sort. this is faster than comparison-based sorting (like Rust’s sort_unstable_by_key) for our input sizes (~100k–200k elements) and it runs in O(n) time with small constants. 2 passes of 16 bits seemed to have worked better instead of 4 passes of 8 bits, fewer passes means fewer data traversals and also the 65536-entry histograms (256KB each) fit comfortably in my laptops L2 cache. at 200k splats this is consistently 2 times faster than Rust’s comparison sort.

pub fn sort_by_depth(projected: &mut [Projected], scratch: &mut ScratchBuffers) {
    let n = projected.len();
    if n <= 1 {
        return;
    }

    scratch.sort_aux.clear();
    scratch.sort_aux.reserve(n.saturating_sub(scratch.sort_aux.capacity()));
    unsafe { scratch.sort_aux.set_len(n); }
    let aux = &mut scratch.sort_aux;

    // Both histograms can be computed in a single pass over the keys.
    // Stack-allocated to avoid heap alloc.
    let mut counts_lo = [0u32; 65536];
    let mut counts_hi = [0u32; 65536];
    for p in projected.iter() {
        let k = p.depth.to_bits();
        counts_lo[(k & 0xFFFF) as usize] += 1;
        counts_hi[(k >> 16) as usize] += 1;
    }

    // Pass 1: sort by low 16 bits
    {
        let mut offsets = [0u32; 65536];
        let mut sum = 0u32;
        for i in 0..65536 {
            offsets[i] = sum;
            sum += counts_lo[i];
        }
        for p in projected.iter() {
            let bucket = (p.depth.to_bits() & 0xFFFF) as usize;
            let pos = offsets[bucket] as usize;
            offsets[bucket] += 1;
            aux[pos] = *p;
        }
    }

    // Pass 2: sort by high 16 bits
    {
        let mut offsets = [0u32; 65536];
        let mut sum = 0u32;
        for i in 0..65536 {
            offsets[i] = sum;
            sum += counts_hi[i];
        }
        for p in aux.iter() {
            let bucket = (p.depth.to_bits() >> 16) as usize;
            let pos = offsets[bucket] as usize;
            offsets[bucket] += 1;
            projected[pos] = *p;
        }
    }
}

pub fn sort_by_depth_parallel(
    projected: &mut [Projected],
    scratch: &mut ScratchBuffers,
    pool: &Option<rayon::ThreadPool>,
) {
    if projected.len() < 50_000 || pool.is_none() {
        sort_by_depth(projected, scratch);
        return;
    }
    pool.as_ref().unwrap().install(|| {
        projected.par_sort_unstable_by_key(|p| p.depth.to_bits());
    });
}

step 4: tile binning

we now have a depth-sorted list of projected splats. the naive approach to compositing would be: for each splat, iterate every pixel in its bounding box and accumulate color. this works but it is cache-hostile because different splats touch overlapping pixel regions in unpredictable order.

the original 3DGS implementation divides the image into tiles (16×16 pixel blocks) and builds an index that tells each tile exactly which splats overlap it. then each tile composites only its own splats, in order, touching only its own pixels

the binning uses a count-then-scatter two-pass approach. in pass 1 for each splat, count how many tiles its bbox touches then build a per-tile count array. in pass 2, prefix-sum the counts into offsets (so offsets[i] = start of tile i’s bucket). then scatter each splat’s index into its tile buckets using a cursor that advances. now because we iterate splats in depth-sorted order, each tile’s bucket ends up automatically depth-sorted so no additional per-tile sort needed

pub fn bin_splats(
    projected: &[Projected],
    width: u32,
    height: u32,
    bins: &mut TileBins,
) {
    let num_tiles_x = ((width as i32) + TILE_W - 1) / TILE_W;
    let num_tiles_y = ((height as i32) + TILE_H - 1) / TILE_H;
    let num_tiles = (num_tiles_x * num_tiles_y) as usize;

    // offsets is used first as a count array, then prefix-summed in place.
    bins.offsets.clear();
    bins.offsets.resize(num_tiles + 1, 0);

    // Pass 1: count tile touches per splat.
    for p in projected {
        let [x0, y0, x1, y1] = p.bbox;
        let tx0 = (x0 / TILE_W).max(0);
        let ty0 = (y0 / TILE_H).max(0);
        let tx1 = (x1 / TILE_W).min(num_tiles_x - 1);
        let ty1 = (y1 / TILE_H).min(num_tiles_y - 1);
        if tx0 > tx1 || ty0 > ty1 {
            continue;
        }
        for ty in ty0..=ty1 {
            let row = (ty * num_tiles_x) as usize;
            for tx in tx0..=tx1 {
                bins.offsets[row + tx as usize + 1] += 1;
            }
        }
    }

    // Prefix sum: offsets[i] = start of bucket i.
    for i in 1..=num_tiles {
        bins.offsets[i] += bins.offsets[i - 1];
    }
    let total = bins.offsets[num_tiles] as usize;

    bins.splat_indices.clear();
    bins.splat_indices.resize(total, 0);

    // Cursor starts at each bucket's begin and advances as we scatter.
    bins.cursor.clear();
    bins.cursor.extend_from_slice(&bins.offsets[..num_tiles]);

    // Pass 2: scatter splat indices into their tile buckets.
    for (idx, p) in projected.iter().enumerate() {
        let [x0, y0, x1, y1] = p.bbox;
        let tx0 = (x0 / TILE_W).max(0);
        let ty0 = (y0 / TILE_H).max(0);
        let tx1 = (x1 / TILE_W).min(num_tiles_x - 1);
        let ty1 = (y1 / TILE_H).min(num_tiles_y - 1);
        if tx0 > tx1 || ty0 > ty1 {
            continue;
        }
        for ty in ty0..=ty1 {
            let row = (ty * num_tiles_x) as usize;
            for tx in tx0..=tx1 {
                let tile = row + tx as usize;
                let pos = bins.cursor[tile] as usize;
                bins.splat_indices[pos] = idx as u32;
                bins.cursor[tile] += 1;
            }
        }
    }
}

let me show what the data structure looks like for a small example. say we have 4 splats (A, B, C, D) then after depth sorting we will get something like this

step 5: alpha compositing

now for each tile, we iterate its splats front-to-back, and for each pixel in the splat’s bounding box (clipped to the tile), we evaluate the 2D Gaussian and blend the color.

for a pixel at position $(px, py)$ and a splat centered at $(sx, sy)$, the displacement is $d = (px - sx, py - sy)$. the Gaussian exponent is:

\[\text{power} = -\tfrac{1}{2} \cdot \mathbf{d}^T \cdot \Sigma^{-1} \cdot \mathbf{d}\]

expanding for a 2×2 symmetric inverse covariance with entries $(a, b; b, d)$:

\[\text{power} = -\tfrac{1}{2} \cdot (a \cdot dx^2 + 2b \cdot dx \cdot dy + d \cdot dy^2)\]

the alpha for this splat at this pixel is:

\[\alpha = \text{opacity} \cdot \exp(\text{power})\]

we now composite front-to-back. the framebuffer stores (rgb_accum, alpha_accum) per pixel, both starting at zero. for each splat, the contribution is:

$T = 1 - \alpha_{accum}$ $\text{contrib} = T \cdot \alpha$ $\text{rgb}_{accum} \mathrel{+}= \text{contrib} \cdot \text{color}$ $\alpha_{accum} \mathrel{+}= \text{contrib}$

here $T$ is the transmittance i.e how much light can still pass through. once alpha_accum reaches the saturation threshold (0.999), the pixel is considered opaque and we skip further splats on it.

the inner pixel loop is the hottest code in the entire rasterizer. to make it as fast as possible, we hoist row-constant terms outside the inner (column) loop. for a fixed row $p_y$, $d_y = p_y - s_y$ is constant, so we precompute everything else and then the inner loop becomes a simple quadratic in dx with all coefficients precomputed

fn composite_splat(p: &Projected, fb: &mut [(Vec3, f32)], w: usize, params: &RenderParams) {
    let [x0, y0, x1, y1] = p.bbox;

    // Extract the inverse covariance matrix elements.
    let a = p.cov2d_inv.x_axis.x; // (0,0)
    let b = p.cov2d_inv.x_axis.y; // (0,1) = (1,0) since symmetric
    let d = p.cov2d_inv.y_axis.y; // (1,1)

    // Coefficients for the decomposed quadratic.
    let dx_coeff = -0.5 * a;
    let dy_coeff = -0.5 * d;
    let cross_coeff = -b; // -0.5 * 2 * b

    let saturation = params.saturation;
    let alpha_threshold = params.alpha_threshold;
    let opacity = p.opacity;
    let color = p.color;
    let sx = p.screen.x;
    let sy = p.screen.y;

    for py in y0..=y1 {
        let dy = py as f32 - sy;
        let row_base = dy_coeff * dy * dy;
        let row_slope = cross_coeff * dy;
        let row_offset = py as usize * w;

        for px in x0..=x1 {
            let idx = row_offset + px as usize;
            let cell = &mut fb[idx];
            if cell.1 >= saturation {
                continue;
            }
            let dx = px as f32 - sx;
            let power = dx_coeff * dx * dx + row_slope * dx + row_base;
            if power > 0.0 {
                continue;
            }
            let alpha = (opacity * fast_exp(power)).min(0.999);
            if alpha < alpha_threshold {
                continue;
            }
            let t = 1.0 - cell.1;
            let contrib = t * alpha;
            cell.0 += contrib * color;
            cell.1 += contrib;
        }
    }
}

tiled parallel compositing

now because each tile owns a disjoint pixel sets so we can composite all tiles in parallel with zero synchronization, no atomics and no locks, this becomes very effective on CPU

we use rayon’s par_iter over tile indices, and raw-pointer writes to give each tile direct access to its pixel rectangle without violating Rust’s &mut aliasing rules:

pub fn composite_tiled(
    projected: &[Projected],
    bins: &TileBins,
    fb: &mut [(Vec3, f32)],
    width: u32,
    height: u32,
    params: &RenderParams,
    pool: &Option<rayon::ThreadPool>,
) {
    let w = width as usize;
    let h_i = height as i32;
    let w_i = width as i32;
    let num_tiles_x = bins.num_tiles_x;
    let num_tiles = bins.num_tiles();
    let fb_ptr = FbPtr(fb.as_mut_ptr());

    let do_composite = || {
        (0..num_tiles).into_par_iter().for_each(move |tile_idx| {
            let fbp = fb_ptr;
            let tile_x = (tile_idx as i32) % num_tiles_x;
            let tile_y = (tile_idx as i32) / num_tiles_x;
            let px0 = tile_x * TILE_W;
            let py0 = tile_y * TILE_H;
            let px1 = (px0 + TILE_W - 1).min(w_i - 1);
            let py1 = (py0 + TILE_H - 1).min(h_i - 1);
            if px0 > px1 || py0 > py1 {
                return;
            }

            let start = bins.offsets[tile_idx] as usize;
            let end = bins.offsets[tile_idx + 1] as usize;
            if start == end {
                return;
            }
            let splat_ids = &bins.splat_indices[start..end];

            for &sid in splat_ids {
                let p = unsafe { projected.get_unchecked(sid as usize) };
                let [bx0, by0, bx1, by1] = p.bbox;
                let x0 = bx0.max(px0);
                let y0 = by0.max(py0);
                let x1 = bx1.min(px1);
                let y1 = by1.min(py1);
                if x0 > x1 || y0 > y1 {
                    continue;
                }
                unsafe {
                    composite_splat_region(p, fbp.0, w, x0, y0, x1, y1, params);
                }
            }
        });
    };

    match pool.as_ref() {
        Some(p) => p.install(do_composite),
        None => do_composite(),
    }
}

Note: the FbPtr trick

Rust’s borrow checker does not let multiple threads hold &mut references to the same buffer. but we know that each tile writes to a disjoint pixel rectangle the geometry guarantees it. so we wrap the raw pointer in a Send + Sync newtype and give each tile direct access:

struct FbPtr(*mut (Vec3, f32));
unsafe impl Send for FbPtr {}
unsafe impl Sync for FbPtr {}

this is the one place in the rasterizer where we need unsafe. the safety argument is spatial here and tile (tx, ty) only writes to pixels [tx*16..(tx+1)*16, ty*16..(ty+1)*16], and no two tiles share the same (tx, ty).

optimizations

Here are some additional optimizations that can be applied to make it fast for realtime use on CPU

fast approximate exp

the exp() function in the inner loop is by far the most expensive operation — it dominates everything else, if we don’t need 64-bit libm precision. the Schraudolph trick reinterprets a scaled float as an IEEE 754 bit pattern:

fn fast_exp(x: f32) -> f32 {
    let x = x.max(-87.0);
    let v = (12102203.0f32 * x + 1065353216.0) as i32;
    f32::from_bits(v as u32)
}

row-level early-out

in the tiled compositor, before entering the inner pixel loop for a row, we check whether the peak of the Gaussian along that row is below the visibility threshold. the Gaussian exponent along a row is a concave quadratic in dx:

\[\text{power}(dx) = \text{dx\_coeff} \cdot dx^2 + \text{row\_slope} \cdot dx + \text{row\_base}\]

the peak of this quadratic (at the vertex) is:

\[\text{row\_peak} = \text{row\_base} + \frac{\text{row\_slope}^2}{2 \cdot a}\]

if even the peak is below ln(alpha_threshold / opacity), the entire row is invisible and we skip it:

let row_peak_cutoff = (alpha_threshold / opacity).ln();
let inv_a = 1.0 / a;

for py in y0..=y1 {
    let dy = py as f32 - sy;
    let row_base = dy_coeff * dy * dy;
    let row_slope = cross_coeff * dy;
    let row_peak = row_base + 0.5 * row_slope * row_slope * inv_a;
    if row_peak < row_peak_cutoff {
        continue;    // entire row is below threshold, skip it
    }
    // ... inner pixel loop ...
}

a subtlety: the peak is at the vertex of the quadratic, not at $dx = 0$. using row_base alone (which is the value at $dx = 0$) would miss peaks that sit off-center and could incorrectly skip visible rows for elongated, tilted Gaussians.

scratch buffer reuse

the radix sort needs an auxiliary buffer the same size as the input. allocating this every frame would be wasteful. instead we allocate it once and reuse it across frames via a ScratchBuffers struct:

pub struct ScratchBuffers {
    pub sort_aux: Vec<Projected>,
    pub tiles: TileBins,
}

the sort_aux vec grows to fit the first frame and never shrinks this enables us to do zero per-frame heap allocation once the system has warmed up.

What Happens When You SFT a Human on an LLM

2026-04-08T00:00:00+05:30

We have spent the last few years training LLMs on human conversation data with the explicit goal of making them sound like us, but we have also been training ourselves on them in return. I have spent a lot of hours talking to claude over the past couple of months, my juniors grew up doing their assignments with gpt, and there are kids in undergrad right now whose first real intellectual sparring partner was a language model instead of another human. This is not a one way street. The gradients are running in both directions, they are doing SGD on us and we are doing something like SFT on them, and the effect is small enough per interaction that you don’t notice it until one day you start hearing “Yes absolutely I can help you with that” from freinds and you realize you have never heard them say that phrase in your life. There is a USC study that measures this drift in spoken youtube videos after chatgpt, words like delve and meticulous and underscore spiking in frequency in actual human speech, not copy pasted text but mouths, and the comments under the post are full of people confessing that they now say “you’re absolutely right” out loud in real conversations, which is maybe the most claude coded sentence a human being can utter and also the kind of thing you would only notice if someone else said it to you first

The usual response I get when I bring this up with friends is that tools have always shaped the people using them, the printing press did it, radio did it, standardized spelling did it, twitter did it, so what is the big deal about this one. And I think the big deal is that the loop is closed now and the loop is short. Previous tools did not update, the printing press in 1500 was the same printing press in 1600, twitter is roughly the same twitter it was five years ago. But claude 4.6 is not claude 3.5 and it will not be claude 5 in six months, and every retraining cycle grounds on an updated human interaction, and the 2026 internet is written by humans who have spent the last two years being subtly finetuned by the 2024 version of the model, which was itself trained on humans that had been talking to the 2022 version. The model distribution and the human distribution are idk sort of converging, and maybe at one point the reference you would use to even measure the drift, stops existing because nobody is untouched anymore. I don’t know if this converges into something stable or diverges into something strange but either way the concept of “what unassisted human thinking sounds like” quietly diminishes, and I don’t think we are going to notice the day that happens

I heard the other day that people have stopped open sourcing work that they would have happily published a few years ago, specifically because they know it will end up inside a training corpus and the edge they got from writing it will evaporate. The funny second order thing here is that the models got good in the first place because people were generous in public, and the models being good is now the exact reason people are not being generous in public anymore. The commons is being fed into a thing that makes continuing to contribute to the commons feel like a bad trade, and I don’t see any version of that where the commons ends up richer

The other thought which I came across while writing this post is a quiet shift in what humans are actually for in this whole arrangement. I met a second year undergrad a couple weeks ago building a startup, smart kid, and he walked me through his workflow with real pride. He maintains this enormous knowledge base about his company and the agent acts as the CEO in his framing, and he is the hands, the agent drafts the email and he clicks send, the agent preps the talking points and he goes and says them to the investor, the agent reads the contract and tells him where to sign, and I could not figure out in the moment how to tell him that from where I was sitting he had just described himself as a peripheral, an io device with legs, a face and a bank account rented out to something else’s cognition. And the same shape is showing up in people who route feedback and delegation and even performance reviews through a model, so your manager is not writing your review and your response will also not be written by you, and the two models are going to have a little conversation about your career while the two humans pretend to have opinions in the meeting afterwards. The core limiting factor of most industries used to be intelligence, and the story of the last century is basically the story of that factor getting commoditized in stages, first physical labour, then access to knowledge, and now the parsing of that knowledge itself, which is the thing we used to call being smart. What you cannot hand off yet is the legal identity, the bank account, the face on the zoom call, the body that walks into the room, and so that is quietly what humans are being optimized into in this new stack. Not the thinker anymore, the thing that makes the agent’s output count as a real action in the real world, maybe humans will become just an extended version of tools to interact with the world

What if all this becomes the norm and we all shrug and get along with it. In an earlier post I wrote about asking claude to apply for my visa and I framed it as a fun anecdote because it genuinely was, but there is a darker version of that same story and it is the one where my juniors are firing up a claude and asking it to apply to 400 jobs a night while recruiters on the other side are firing up their own claude to screen the same 400 applications, and the only humans in the loop are the ones who wrote the prompts on either end. I already see it happening, my juniors are not landing interviews and I do not think they are less capable than I was at their age, the signal to noise ratio of the application pool has just collapsed because the noise is free to generate and the signal costs the same it always did. I do not want to live in a world where two models negotiate the hiring decision in the middle while the humans on both sides wait around to be told the outcome, and I am not sure if we can get a vote on this

Claude Code Chronicles

2026-03-29T00:00:00+05:30

A couple of months back I was facing a kernel panic “Unable to mount root fs” while rebooting by ubuntu with normal mode, I found a quick workaround via recovery mode by dropping to root shell and patching some systemd files to boot with an older kernel version. This is how I had been restarting my computer since then. Today I saw by brightness keys where somehow not functioning at all, I dropped into a root shell and asked claudecode to fix it, left for lunch and then cameback an hour later I noticed all my terminals where gone and my browser closed

I opened up a new terminal and ran claude -c and started reading what claude essentially did, so it fixed the brigtness issue, but then it also noticed that I am in recovery mode, questioned why was I in recovery mode, ran some bash commands and later deduced that initramfs files where never generated and thus NVMe storage drivers couldn’t be loaded in normal mode so kernel would always panik, it fixed the issue by generating the files and then rebooted my computer, the last command which was registered in my ~/.bash_history was reboot

Asked claude code to apply for my visa application(inside a headed browser so I can monitor alongside). Used a browser cli skill for interacting with chromium which has all my profiles and session ids logged in. Requires credentials for the visa portal log in. Claude starts searching for credentials, opened my obsidian vault, greped for files with visa in the name, there was indeed a file where I had stored my visa credentials. Took the credentials and then logged in as usual and started filling my application. Ig claude’s automemory feature remembered how I use obsidian and store my notes and probably figured I might have stored them somewhere for him to use

Quantization, Floating Points and TurboQuant

2026-03-28T00:00:00+05:30

A lot of effort is spent to make LLM inference cheaper and performant. Quantization is the standard way to do this, where we reduce model’s size by representing it with parameters with fewer bits so they take up less memory and move faster through the memory hierarchy. The progression from 32-bit -> mixed precision -> 16-bit -> 8-bit -> 4-bit formats has been one of the most impactful practical developments in LLM inference

Floating Point Formats

A floating point number consists of a sign bit, $E$ exponent bits and $M$ mantissa bits. If $e$ is the value of the exponent bits (potentially biased) and $m$ is the value of the mantissa bits, the represented value is

\[f = \text{sign} \cdot 2^e \cdot \left(1 + \frac{m}{2^M}\right)\]

The exponent determines the rough scale of the number and the mantissa determines the precise value within that scale. Standard float32 uses $E = 8, M = 23$ for 32 bits total. This is the reference precision for most LLM training

For 16-bit inference it has become popular to use bfloat16 ($E = 8, M = 7$) over traditional float16 ($E = 5, M = 10$). The key reason is that bfloat16 preserves the same exponent range as float32, so quantizing from float32 to bfloat16 is straightforward, we can just truncate the mantissa. Having a wider dynamic range matters more than fine grained precision for ML workloads where gradients and activations can span several orders of magnitude.

NVFP4 and the Limits of Scalar Quantization

Things gets interesting when we go below 8 bits, with 4 bits we can only represent 16 distinct values. At that point it is not even obvious that anything useful can be preserved. NVFP4 is NVIDIA’s answer to this, and it pushes scalar quantization to the extreme

NVFP4 is not really a standalone data type. It is a format for an entire tensor. Each element is stored as a 4-bit float ($E = 2, M = 1$) but the tensor also carries an 8-bit scaling factor for every 16 elements and a single 32-bit scaling factor for the whole tensor. The per-block scale captures the local magnitude distribution and the per-tensor scale captures the global one. Together they compensate for the limited range of 4 raw bits.

The overhead works out to about 0.5 extra bits per element (the 8-bit scale amortized over 16 values), bringing effective storage to 4.5 bits per weight. NVIDIA built hardware support for this into the Blackwell architecture so the complexity is an abstraction for CUDA kernel developers.

TurboQuant

TurboQuant is a recently popular vector quantization algorithm. It takes a vector of numbers and produces a quantized version that uses less memory. At its core the idea is pretty simple, before apply quantization apply a random rotation in an n-dimensional vector space that it lives in and for dequantization apply the corresponding reverse quantization. This rotation is not learned, not input-dependent, not sampled from some special distribution. It is just a random orthogonal transformation. And it dramatically improves quantization quality

TurboQuant then uses a second correction step for eliminating bias in the attention block computation, I haven’t read about it entirely yet but they use Quantized Johnson-Lindenstrauss transform to preserve dot products accurately, and provide theoretical guarantess to support it

References:

A System of Journaling

2026-03-06T00:00:00+05:30

One of the biggest hurdles I faced in consistently maintaining a blog like this site is having to manually copy paste my notes into my github.io directory as markdown files. This friction compounded over time and I would end up with a backlog of drafts that never made it to this site. So I decided to tinker around this a bit and create a more automated solution

The Markdown Era

Before any of this I used to maintain a single markdown file named journal.md to log everything like passwords, things I am currently working on, upcoming deadlines, calendar. The system was dead simple, I had several sections in the file like

## ToDo
- [ ] ...
- [ ] ...

## Deadlines
- ...
- ...
  
## some stuff
- ...
- ...

Whenever I had to log something I used to create a section for it in the journal and add a couple of checkpoints and bullets around it for context, when I had to put it in priority it goes to the top in the #ToDo section. This kind of worked because I wasn’t logging a lot of things, then when I started journaling this system just wasn’t going to scale

Age of Google Docs

So then to scale this I started maintaining a google doc named “Agency” and used it as a journal where I jotted down my daily thoughts, reflections, notes from the books I am reading, learnings, plans for new year resolutions and everything that usually happens around ones life

I used this method of journaling extensively in 2025 but it grew so enormous in size, overtime it had more than 80 tabs in a single google doc, that it was becoming hard for me to refactor and review everything. I broke it down into categories of months and added cross links between tabs so each month had a single tab and other pages were linked from a single index page. This worked for a while but when the number of tabs reached 150+ this system still broke down. I had to remember the title of the tab where I had stored some info which I wanted to revisit. The lack of a global search feature to search across tabs and across tab titles made it even worse

This was around the same time that I got my hands on cursor which is an AI assisted IDE. Cursor was the tool which exposed me to AI agents and their scaffolding around codebases. It was my first hand experience of feeling the impact of AI in the labour market. This was also around the same time that I wanted to tinker with the idea of using these agents to traverse this mesh of thoughts that I have and render them transparent so I can ask them clarifying questions on them. This meant going back to the markdown era but with the organizational system that I had already developed for myself around google docs

Enter Obsidian

I discovered obsidian around the same time. It’s a markdown based editor that comes with a lot of functionalities and plugins that are created for exactly the same reasons which I was considering to leave google docs for. It has builtin support for search across notes, ability to cross link notes with [wikilinks](wikilinks), templates, calendar support, workspace layouts and awesome themes. I have customized my obsidian to look like the dark version of lesswrong hehe

It also comes with a graph viewer which renders your entire vault as an interconnected graph where connections naturally emerge when you cross link and cross reference things. I just started using obsidian a month ago and have been absolutely loving it

Publishing System

So now I had a great writing environment but the original problem remained, how do I get stuff from obsidian to my github pages site without manually copying things around? I needed a way to sync my vault content to a remote repository where it gets rendered as a static site

This site uses Jekyll which is a static site generator that github pages natively supports. This site uses a custom css theme derived from minima. Jekyll expects markdown files with YAML frontmatter in a specific directory structure (_posts/ for blog posts, root for standalone pages) but Obsidian writes markdown but with its own conventions [wikilinks](wikilinks) for cross references, ![embeds](/assets/images/embeds) for images, and dataview queries for dynamic content. So I also had to ensure that this conversion process is also handled correctly

Trying Quartz Syncer

My first attempt was Quartz, a batteries included static site generator that transforms markdown content into fully functional websites. Quartz already handles obsidian flavored markdown so the translation problem goes away. All I needed was a way to sync content from obsidian to the remote quartz repository

I tried quartz-syncer, an obsidian plugin built exactly for this. I set everything up following the documentation, added my github repo, and the connection showed successful I wrote a dummy page with publish: true in the frontmatter and it correctly showed up as unpublished in the publication center. When I hit publish it said publication successful. But there were no commits in my repo. Nothing actually happened I opened an issue about this, tried giving all repo access to my classic token thinking permission access could be the reason, but it still failed to create commits. I spent a good amount of time debugging this and tweaking settings but couldn’t get it to work

Writing My Own Script

At that point I decided to just write it myself. The requirements were simple enough:

Crawl specific folders and files from my vault based on a config
Convert obsidian markdown to jekyll compatible markdown
Handle image embeds and wikilinks
Push to the upstream repo

This is the system design of the entire setup that I put together I put together a python script that does exactly this. The setup is driven by a publish.md config file in the vault root where I specify the vault path, the site repo path and which paths to sync:

---
site_repo: /path/to/darshanmakwana412.github.io
obsidian_vault: /path/to/obsidian
paths:
  - posts/
  - bookmarks.md
  - birding.md
  - Bookshelf.md
---

Directories ending with / are synced as jekyll blog posts into _posts/, standalone files are synced as pages at the root. The script handles the obsidian to jekyll translation ![image.png](/assets/images/image.png) becomes ![image.png](/assets/images/...), [wikilinks](wikilinks) become standard markdown links, and it copies over all the image assets to the right places

Running ./publish.py sync crawls the vault and syncs everything. ./publish.py push commits and pushes to remote. ./publish.py publish does both in sequence. There is also a ./publish.py watch mode that uses watchdog to detect file changes in the vault and auto syncs, which is pretty nice when you are actively writing and want to see changes reflected quickly. So just usually keep this script running in the background all the time

Some Glue Work

The one thing that still needs work is dataview parsing. Dataview is an obsidian plugin that lets you query your vault like a database. I use it on my bookshelf page to render a table of books I am currently reading:

TABLE WITHOUT ID
	link(file.link, title) AS Book,
	author AS Author,
	" " + chapters_read + "/" + total_chapters AS Progress,
	notes as Notes,
	embed(link(meta(cover).path)) AS Cover
FROM #book
WHERE status = "reading"
SORT file.mtime DESC

The script has a basic dataview resolver that parses FROM #tag and WHERE clauses and renders them as markdown tables. It works for simple queries but anything with computed columns like the progress bar or cover embeds needs more work. You can see how it currently renders at darshanmakwana412.github.io/bookshelf

2 ways to bet on a Trillion Dollar Market

2026-02-17T00:00:00+05:30

I was listening to Dario Amodei’s interview with dwarkesh patel and found his insights into how anthropic plans their capex investments and path to profitability quite fascinating. They need to balance their risks into how much compute to build for the next 2 years in advance based on current demands because the data centers take 2 years to build. If they overestimate their demand then they won’t have enough profit in the next years and will go bankrupt while if they underestimate it they won’t be able to match the demand and will risk losing their customers to their competitors, this is what he calls their cone of uncertainty. This sentiment felt weird to me because openai seems to aggressively bullish on their capex investments, infact sam altman disclosed they will be spending $1 trillion on compute infra across microsoft, oracle, nvidia and coreweave between 2025 and 2035 while also partnerring with cerebras, so why do these 2 AI companies have completely different capex investment strategies?

The business model of anthropic relies on building things that enterprises will pay for and use that to build a path of profitability. As dario said anthropic has 10X it’s revenue every year since 2023, but it can only continue for so long as the GDP is only finite and once majority of the value is catpured it will start showing diminishing returns, If growth slows down to 5X and not 10X and they purchased compute based 10X multiplier prediction they will go bankrupt. They are cautious of capex and only focusing on a few things and getting them right

Throughout our history whenever a technological revolution occurs it brings together an enormous value generation with it which can be measured in terms of productivity, GDP increase, increase in standard of living, etc. But dario mentions that AI hasn’t fully diffused as it’s effects are yet to be seen in economic growth¹. This also means that there is a lot of value that is untouched and yet to be generated, openai seems to be in a strategy where they want to dominate the market and capture all this value. They seem to be like facebook in this sense, aggressively grow, capture the entire market and thus capture all the value that will be generated by the technology and a path of profitability will emerge later. This is pretty much consistent with their investments across every possible field where AI is yet to emerge

They wanted to acquire Windsurf for $3 billion to own the developer IDE stack (they got out competed by google at the last minute)
They acquired Neptune to strengthen their internal model training infrastructure
They merged with io Products bringing Jony Ive into the fold to build hardware products
They launched Prism, a colloborative and AI assisted latex editor
They introduced OpenAI for Healthcare
They used GPT 5 to autonomously design cell synthesis recipies
They built shopping research into ChatGPT
They invested in building Human Brain Computer Interfaces
and just this month they acqui hired², Peter Steinberger and OpenClaw to drive the next generation of personal agents
and many other things which I might (definitely) have missed

They are trying to have a hold of everything that AI will have impact on

Anthropic is trying to survive by focusing only on a few bets while openai needs at least some of their many bets to work to eventually get the payouts. I don’t know which strategy will workout for either of them, but it would be surely fun to come back to this note after 5, 10, or 20 years and figure out how things eventually turned out for them. This brings me to another thing which I found fascinating from the interview which is the “country of geniuses” and by dario’s predictions will take 2 years in the best scenario and 10 years in the worst case. So if the combined market is going to be so enormous then ig it does not make sense which strategy you are using as long as you don’t die (by die I mean you go bankrupt), there will be enough value to capture for everyone

¹ There is an argument to be made here as to why has this diffusion has been slow, infact it is quite fast compared to other technological revolutions, but it appears to be slow because the improvement in models capabilities are far exceeding than it's adoption can keep up. If you were an enterprise and someone used to do workflow x or y, you realize the models can now do it, but someone still has to program it to do it, provide the relevant context around it, handle the edge cases where someone else was correlated with workflow x and y via z, you update the context and it's harness and realize that the model is now smart enough to handle it on it's own but someone has to now test it, and yade yada the list goes on, and also don't forget by the time you did all this Opus 4.6 dropped which can now one shot workflow x and y via z : ) ↩

² Aqui hire is a relatively new term that was extensively used in 2025 (nvidia aqui hired groq, Meta acqui hires Scale AI, Google acqui hired windsurf), this happens when companies want to acquire a competitor but cannot go through the legal route because of anti trust lawsuits and DOJ, so instead they hire out all the founders, key people (and the domain expertise) who built the underlying technology stack and replicate in their business. The company is then left as a husk just to survive with it's employees unfair value returns on their stock options, again quite fascinating to read, https://en.wikipedia.org/wiki/Acqui-hiring ↩

It takes very high agency

2026-02-08T00:00:00+05:30

It takes very high agency

To realize when you are in the wrong place
To realize the world is drifting apart from the old rules and new rules are being written, the old rules which you learned no longer apply
To realize the rewards tremendously outweigh the costs of consistently putting efforts on updating your mental models
To realize that people don’t act with incentives as much as they think
To see the world with clarity and take it as it is and act on it
Games of money, power, hierarchies are not worth playing and avoid people who are playing them
Conviction is a hell of power drug, it makes you do things that you would never do otherwise

How to Forge Your Conviction

2026-01-25T00:00:00+05:30

In the later years of my undergraduate studies I was lucky enough to have surrounded myself with mentors and peers who taught me not to chase after thin desires, to not become the machine, to cultivate confidence, invest time in forging relationships, invest in updating your mental models and be equanimious. By giving me frequent feedback I was able to learn more quickly than I could have starting out on my own, not so unsurprisingly an analogy of this exists in agentic coding models like claude code, opencode, codex, etc where agents get execution trace of the programs as context allowing them to reflect on their actions and perform much better. When you do this enough times you build a world model of the people that you interact with either consciously or unconsciously, some thought processes like “How would vignesh think through these trade offs?”, “How would nikhil approach this?”, etc starts to go around when you confront difficult choices in your life

A couple of years passed by and I had to invariably face the fact that I no longer have access to my mentors and had to rely on my own mental models as we had split ways and I choose the local optima and became a part of a team where I come across as a very technically opinionated engineer in absolute terms. My team’s primary tasks were cost optimization and codebase improvement, so I was assigned to improve our infrastructure for STT models. This means I had to make a couple of high stakes decisions on my own but because the stakes are higher it takes a long time to see results, all these decisions require conviction: a belief that your ideas are worth spending the time and effort on, you need to be very confident before you actually commit to your actions. I had a lot of ideas to work on in this but I did not had access to the verifiers/reward models that I had in the past. I started spending a significant amount of time thinking through each of my ideas and actually implemented each of them. Some of them turned out to be failures like the confidence aware router using an LLM models internal activation signals to access the models confidence in answering a query but some of them worked surprisingly great like building a better scheduler for scheduling STT queries based on expected completion time. The important thing that happened consciously or unconsciously is I started building world models of the problems and not my mentors. I believe this cycle repeated across multiple iterations is what gives rise to/develops conviction

I have to do a lot more thinking when making architectural decisions myself rather than relying on more experienced engineers in my team, turns out this is very effective in internalising the mistakes and learnings in my decision making process to ensure I never repeat it again. Also unsurprisingly another similar analogy exists when training LLMs, progress stagnates with SFT where you train the model on full generations of human conversation, imitation can only take you so far (oppenheimer!), and you have to do RL post training where you enable the model to learn which of it’s generations actually worked and which ones did not

Before the world can realize your ideas, you have to realize it in your subjective reality, this belief is not something that arises as an eureka but has to be wielded. You have to earn the right to have your conviction. This kind of conviction can’t be learned from mentors, people or books it has to learned from one’s own mistakes. Now it turns out this is very hard to do. It’s very hard to actually sit down and face the chaos in your head and churn them into clarity and conviction which you can act on. People by nature take the path of least resistance, don’t be like them. It is very easy to build things at the surface or have a surface understanding but the right infrastructure, mental models or investment compounds in weird and interesting patterns that gives rise to long term moats. It is that by doing hard things that you go deeper and gain massive advantage over others

Strassen’s matmul with AVX 512 kernel

2026-01-18T00:00:00+05:30

In the previous matmul with avx512 and loop tiling note we managed to build a cpu kernel that achieves 92% of peak FLOPS. I wanted to check if we can do better by using Strassen’s algorithm. Strassen’s algorithm from 1969 showed that matrix multiplication can be done in $O(n^{2.807})$ by trading multiplications for additions. The main idea of this post is to use strassens for high level recursions and fallback to our highly optimized kernel for lower levels

Table of Contents:

The Standard Algorithm
Strassen’s Algorithm
Implementation
Choosing the Recursion Depth
Benchmarks

The Standard Algorithm

Standard matrix multiplication for $C = A \times B$ where all matrices are $n \times n$ computes:

\[C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}\]

This requires $n^3$ multiplications and $n^3$ additions for a total of $2n^3$ floating point operations. For decades this was assumed optimal until Strassen showed otherwise

Strassen’s Algorithm

Strassen observed that multiplying two $2 \times 2$ matrices:

\[\begin{bmatrix} C_{11} & C_{12} \\ C_{21} & C_{22} \end{bmatrix} = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} \begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{bmatrix}\]

normally requires 8 multiplications (each $C_{ij}$ needs 2 products). But with clever grouping we can do it with only 7 multiplications at the cost of more additions

The key insight is that additions are cheap compared to multiplications, especially when the “elements” are themselves large submatrices. If we partition $n \times n$ matrices into four $n/2 \times n/2$ blocks, we can recursively apply this trick. At each level we replace 8 recursive calls with 7, giving the recurrence:

\[T(n) = 7 T(n/2) + O(n^2)\]

The $O(n^2)$ term comes from the matrix additions. Solving this recurrence gives $T(n) = O(n^{\log_2 7}) = O(n^{2.807})$

Strassen’s algorithm computes seven intermediate products M1 through M7:

\[\begin{aligned} M_1 &= (A_{11} + A_{22})(B_{11} + B_{22}) \\ M_2 &= (A_{21} + A_{22}) B_{11} \\ M_3 &= A_{11} (B_{12} - B_{22}) \\ M_4 &= A_{22} (B_{21} - B_{11}) \\ M_5 &= (A_{11} + A_{12}) B_{22} \\ M_6 &= (A_{21} - A_{11})(B_{11} + B_{12}) \\ M_7 &= (A_{12} - A_{22})(B_{21} + B_{22}) \end{aligned}\]

Then the output blocks are:

\[\begin{aligned} C_{11} &= M_1 + M_4 - M_5 + M_7 \\ C_{12} &= M_3 + M_5 \\ C_{21} &= M_2 + M_4 \\ C_{22} &= M_1 - M_2 + M_3 + M_6 \end{aligned}\]

Each $M_i$ requires one matrix multiplication and 0-2 matrix additions/subtractions. The final assembly requires 8 additions. Total: 7 multiplications and 18 additions instead of 8 multiplications and 4 additions

Implementation

We need helper functions for matrix addition, subtraction, and copying. These are straightforward but need to handle different strides since submatrices have the parent’s stride:

static inline
void addMat(
    float *C, int C_stride,
    const float *A, int A_stride,
    const float *B, int B_stride,
    int n
) {
    #pragma omp parallel for collapse(2)
    for(int j=0; j<n; j++) {
        for(int i=0; i<n; i++) {
            C[i + j * C_stride] = A[i + j * A_stride] + B[i + j * B_stride];
        }
    }
}

static inline
void subMat(
    float *C, int C_stride,
    const float *A, int A_stride,
    const float *B, int B_stride,
    int n
) {
    #pragma omp parallel for collapse(2)
    for(int j=0; j<n; j++) {
        for(int i=0; i<n; i++) {
            C[i + j * C_stride] = A[i + j * A_stride] - B[i + j * B_stride];
        }
    }
}

static inline
void loadMat(
    float *C, int C_stride,
    const float *A, int A_stride,
    int n
) {
    #pragma omp parallel for collapse(2)
    for(int j=0; j<n; j++) {
        for(int i=0; i<n; i++) {
            C[i + j * C_stride] = A[i + j * A_stride];
        }
    }
}

The main Strassen function partitions A and B into quadrants, allocates temporary matrices for M1-M7 and two scratch buffers T1/T2, computes each product recursively, then assembles C:

static inline
void matmul_kernel(
    float *C, const float *A, const float *B,
    int n, int stride
) {
    #pragma omp parallel for collapse(2) schedule(dynamic, 1)
    for(int ic=0; ic<n; ic+= MC) for(int jc=0; jc<n; jc+=NC) {
        int Mb = std::min(MC, n - ic);
        int Nb = std::min(NC, n - jc);
        for (int kc = 0; kc < n; kc += KC) {
            int Kb = std::min(KC, n - kc);

            for (int ib = 0; ib < Mb; ib += 16) {
                for (int jb = 0; jb < Nb; jb += 32) {

                    __m512 psum[2][16] = {};

                    const float *blocki = A + (ic + ib) + kc * stride;
                    const float *blockj = B + (jc + jb) + kc * stride;

                    for (int k = 0; k < Kb; k++) {

                        __m512 b0 = _mm512_load_ps(blockj + k * stride);
                        __m512 b1 = _mm512_load_ps(blockj + k * stride + NUM_LOADS);
                        for(int ik=0; ik<16; ik++) {
                            __m512 a = _mm512_set1_ps(*(blocki + ik + k * stride));
                            psum[0][ik] = _mm512_fmadd_ps(
                                b0,
                                a,
                                psum[0][ik]
                            );
                            psum[1][ik] = _mm512_fmadd_ps(
                                b1,
                                a,
                                psum[1][ik]
                            );
                        }

                    }

                    for(int ik=0; ik<16; ik++) {
                        float *loc_ptr = C + (ic + ib + ik) * stride + jc + jb;
                        _mm512_store_ps(
                            loc_ptr,
                            _mm512_add_ps(_mm512_load_ps(loc_ptr), psum[0][ik])
                        );
                        _mm512_store_ps(
                            loc_ptr + NUM_LOADS,
                            _mm512_add_ps(_mm512_load_ps(loc_ptr + NUM_LOADS), psum[1][ik])
                        );
                    }

                }
            }

        }

    }
}

static inline
void strassenMatmul(
    float *C,
    const float *A,
    const float *B,
    int n, int stride,
    int level, int MAX_DEPTH
) {

    assert((n % 2) == 0 && "n must be a multiple of 2");

    if(level >= MAX_DEPTH) {
        matmul_kernel(C, A, B, n, stride);
        return;
    }

    const float *A11 = A;
    const float *A12 = A + ( n / 2 ) * stride;
    const float *A21 = A + ( n / 2 );
    const float *A22 = A + ( n / 2 ) + ( n / 2 ) * stride;
  
    const float *B11 = B;
    const float *B12 = B + ( n / 2 );
    const float *B21 = B + ( n / 2 ) * stride;
    const float *B22 = B + ( n / 2 ) + ( n / 2 ) * stride;

    float *M1 = (float *)aligned_alloc(ALIGNED_BYTES, n * n / 4 * sizeof(float));
    float *M2 = (float *)aligned_alloc(ALIGNED_BYTES, n * n / 4 * sizeof(float));
    float *M3 = (float *)aligned_alloc(ALIGNED_BYTES, n * n / 4 * sizeof(float));
    float *M4 = (float *)aligned_alloc(ALIGNED_BYTES, n * n / 4 * sizeof(float));
    float *M5 = (float *)aligned_alloc(ALIGNED_BYTES, n * n / 4 * sizeof(float));
    float *M6 = (float *)aligned_alloc(ALIGNED_BYTES, n * n / 4 * sizeof(float));
    float *M7 = (float *)aligned_alloc(ALIGNED_BYTES, n * n / 4 * sizeof(float));

    #pragma omp parallel for collapse(2)
    for(int i=0; i<n / 2; i++) for(int j=0; j<n / 2; j++) {
        M1[i * ( n / 2 ) + j] = 0.0f;
        M2[i * ( n / 2 ) + j] = 0.0f;
        M3[i * ( n / 2 ) + j] = 0.0f;
        M4[i * ( n / 2 ) + j] = 0.0f;
        M5[i * ( n / 2 ) + j] = 0.0f;
        M6[i * ( n / 2 ) + j] = 0.0f;
        M7[i * ( n / 2 ) + j] = 0.0f;
    }

    float *T1 = (float *)aligned_alloc(ALIGNED_BYTES, n * n / 4 * sizeof(float));
    float *T2 = (float *)aligned_alloc(ALIGNED_BYTES, n * n / 4 * sizeof(float));

    // M1 = (A11 + A22) * (B11 + B22)
    addMat(T1, n/2, A11, stride, A22, stride, n/2);
    addMat(T2, n/2, B11, stride, B22, stride, n/2);
    strassenMatmul(M1, T1, T2, n/2, n/2, level+1, MAX_DEPTH);

    // M2 = (A21 + A22) * B11
    addMat(T1, n/2, A21, stride, A22, stride, n/2);
    loadMat(T2, n/2, B11, stride, n/2);
    strassenMatmul(M2, T1, T2, n/2, n/2, level+1, MAX_DEPTH);

    // M3 = A11 * (B12 - B22)
    loadMat(T1, n/2, A11, stride, n/2);
    subMat(T2, n/2, B12, stride, B22, stride, n/2);
    strassenMatmul(M3, T1, T2, n/2, n/2, level+1, MAX_DEPTH);

    // M4 = A22 * (B21 - B11)
    loadMat(T1, n/2, A22, stride, n/2);
    subMat(T2, n/2, B21, stride, B11, stride, n/2);
    strassenMatmul(M4, T1, T2, n/2, n/2, level+1, MAX_DEPTH);

    // M5 = (A11 + A12) * B22
    addMat(T1, n/2, A11, stride, A12, stride, n/2);
    loadMat(T2, n/2, B22, stride, n/2);
    strassenMatmul(M5, T1, T2, n/2, n/2, level+1, MAX_DEPTH);

    // M6 = (A21 - A11) * (B11 + B12)
    subMat(T1, n/2, A21, stride, A11, stride, n/2);
    addMat(T2, n/2, B11, stride, B12, stride, n/2);
    strassenMatmul(M6, T1, T2, n/2, n/2, level+1, MAX_DEPTH);

    // M7 = (A12 - A22) * (B21 + B22)
    subMat(T1, n/2, A12, stride, A22, stride, n/2);
    addMat(T2, n/2, B21, stride, B22, stride, n/2);
    strassenMatmul(M7, T1, T2, n/2, n/2, level+1, MAX_DEPTH);

    // Assemble C from M1-M7
    #pragma omp parallel for collapse(2)
    for(int i=0; i<n / 2; i++) for(int j=0; j<n / 2; j++) {
        // C11 = M1 + M4 - M5 + M7
        C[i * stride + j] += M1[i*(n/2)+j] + M4[i*(n/2)+j] - M5[i*(n/2)+j] + M7[i*(n/2)+j];
        // C12 = M3 + M5
        C[i * stride + (j + n/2)] += M3[i*(n/2)+j] + M5[i*(n/2)+j];
        // C21 = M2 + M4
        C[(i + n/2) * stride + j] += M2[i*(n/2)+j] + M4[i*(n/2)+j];
        // C22 = M1 - M2 + M3 + M6
        C[(i + n/2) * stride + (j + n/2)] += M1[i*(n/2)+j] - M2[i*(n/2)+j] + M3[i*(n/2)+j] + M6[i*(n/2)+j];
    }

    free(M1); free(M2); free(M3); free(M4); free(M5); free(M6); free(M7);
    free(T1); free(T2);
}

The base case when level >= MAX_DEPTH calls our optimized AVX-512 kernel from the previous post. Submatrices are addressed using pointer arithmetic: A21 is at offset n/2 (down one block of rows), A12 is at offset (n/2) * stride (right one block of columns)

Choosing the Recursion Depth

The recursion depth MAX_DEPTH controls where we switch from Strassen to the direct kernel. Too shallow means we don’t benefit from the reduced complexity. Too deep means the overhead of allocating M1-M7 and doing the additions dominates

At each Strassen level we allocate 9 temporary matrices of size $(n/2)^2$ floats each. For MAX_DEPTH = 3 on an 8192×8192 matrix the leaf problems are 1024×1024. The memory overhead at the top level is:

\[9 \times \frac{8192^2}{4} \times 4 \text{ bytes} = 603 \text{ MB}\]

The recursion also needs n to be divisible by $2^{\text{MAX_DEPTH}} \times 32$ to ensure the leaf problems are multiples of our 16×32 register blocking. I pad matrices to the nearest valid size:

int Px = 2 * NUM_LOADS * (1 << MAX_DEPTH);  // 32 * 2^MAX_DEPTH
int Py = 2 * NUM_LOADS * (1 << MAX_DEPTH);

nxp = ((nx + Px - 1) / Px) * Px;
nyp = ((ny + Py - 1) / Py) * Py;
nxp = nyp = std::max(nxp, nyp);  // keep square for simplicity

For MAX_DEPTH = 3 this means padding to multiples of 256

Benchmarks

Benchmarked on Intel Xeon W-2295 (18 cores, 3.2 GHz under AVX-512):

n	Direct Kernel	Strassen (depth=2)	Strassen (depth=3)	Speedup
2048	0.025 s	0.028 s	0.031 s	0.81×
4096	0.189 s	0.178 s	0.162 s	1.17×
8192	1.48 s	1.21 s	1.08 s	1.37×
16384	11.7 s	8.9 s	7.6 s	1.54×

For small n the allocation and addition overhead makes Strassen slower. The crossover happens around n~3000~4000 on this hardware. At n=16384 Strassen with depth 3 is 1.54× faster

The theoretical speedup from depth d is $(8/7)^d$. For d=3 that’s 1.49×, close to our measured 1.54×. The slight advantage over theory comes from better cache behavior when operating on smaller submatrices

Also strassen’s seems to have numerical implications. The extra additions accumulated rounding errors while I was benchmarking the kernels. Though I wonder if such rounding errors can be contained by hierarchically choosing the depth in each split?