<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://darshanmakwana412.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://darshanmakwana412.github.io/" rel="alternate" type="text/html" /><updated>2026-05-06T00:41:54+05:30</updated><id>https://darshanmakwana412.github.io/feed.xml</id><title type="html">Darshan Makwana</title><subtitle>code for my public webpage: https://darshanmakwana412.github.io</subtitle><author><name>Darshan Makwana</name></author><entry><title type="html">After 3 months of standing desk</title><link href="https://darshanmakwana412.github.io/2026/05/standing-desk/" rel="alternate" type="text/html" title="After 3 months of standing desk" /><published>2026-05-06T00:00:00+05:30</published><updated>2026-05-06T00:00:00+05:30</updated><id>https://darshanmakwana412.github.io/2026/05/standing-desk</id><content type="html" xml:base="https://darshanmakwana412.github.io/2026/05/standing-desk/"><![CDATA[<p>Since the last 2 months I had to change my environment in which I used to do my long hours of deep work. One of the things that was part of the change was a standing desk. Actually it was not a desk per se but just a platform on top of which I can keep my laptop and just start coding away, the platform was just tall enough that I can’t sit and do work and slightly short enought that I had to bend my knees, so I had to add a wooden plank on top of the platform to level it up.</p>

<p>So after this I just went away with this setup. The first couple of days were not bad I mean my legs were already in pretty good shape as I have a regular running habit. I wasn’t feeling fatigued or lethargic at all, I would say I was more aware of my own physical body or how much energy/force it has to exert while doing a task which I used to do just sitting. I was just slightly distracted by this fact but not distracted enough to not be able to do any work at all. After the first week it became natural to do long hours of work in this new setup of mine.</p>

<p>I have observed that the speed at which I am approaching and getting things done has slightly increased. I have also become more proactive and more communicative. I now love standing while working. I have now kept two 10L water jugs alongside my desk and I have been experimenting with it doing bicep curls/hammer curls/triceps in every 30mins interval, I have also experimented with adding pushups in this combo, it has been working out fantastic for me. I think this is clear connection between productivity and slight pressure to body to keep it in order, every time we lift weights or make our muscles work under pressure we are unconsciously reminding our mind that we are control of our physical selves and our actions does have meaningful impact in the world that we live in</p>

<p>I remain energetic even after long hours of deep work, I don’t get tired after a long day which I usually felt previously. I don’t think weather the gamified setup of exercise + work or the standing desk is the causation of this, but I can confirm there is definitely a correlation and it has been working out fanstastic for me</p>

<p>I gave a couple of interviews in the last months and I cleared a majority of them while standing. I now speak more energetically and aggressively while I am standing. I have noticed that when I don’t get exercise or my body does not undergo any form of activity I simply cannot concentrate and if I am in chair then end up spending my day chaining positions</p>]]></content><author><name>Darshan Makwana</name></author><category term="productivity" /><category term="health" /><category term="lifestyle" /><summary type="html"><![CDATA[Since the last 2 months I had to change my environment in which I used to do my long hours of deep work. One of the things that was part of the change was a standing desk. Actually it was not a desk per se but just a platform on top of which I can keep my laptop and just start coding away, the platform was just tall enough that I can’t sit and do work and slightly short enought that I had to bend my knees, so I had to add a wooden plank on top of the platform to level it up.]]></summary></entry><entry><title type="html">Gaussian Splatting for Dummies</title><link href="https://darshanmakwana412.github.io/2026/04/gaussian-splatting/" rel="alternate" type="text/html" title="Gaussian Splatting for Dummies" /><published>2026-04-12T00:00:00+05:30</published><updated>2026-04-12T00:00:00+05:30</updated><id>https://darshanmakwana412.github.io/2026/04/gaussian-splatting</id><content type="html" xml:base="https://darshanmakwana412.github.io/2026/04/gaussian-splatting/"><![CDATA[<p>Gaussian Splatting is a fascinating scene reconstruction technique introduced by <a href="https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/">INRIA</a>  and last year I had a lot of fun tinkering with it while on my semex. I recently discovered some of my notes related to it and decided to digitize it this weekend, along the way I reimplemented the forward rasterization pass in rust and decided it would be fun to write a tutorial explaining gaussian splatting to everyone, so here it is</p>
<h2 id="what-is-a-gaussian-splat">what is a gaussian splat?</h2>

<p>a 3D Gaussian splat is an oriented ellipsoid in space that carries some color and opacity. you can think of it as a fuzzy colored blob. a scene is made of hundreds of thousands of these blobs, and when you look at them from a particular viewpoint, they overlap and blend to form the final image</p>

<p>We represent each gaussian with these attributes:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">struct</span> <span class="n">Splat</span> <span class="p">{</span>
    <span class="k">pub</span> <span class="n">pos</span><span class="p">:</span> <span class="n">Vec3</span><span class="p">,</span>      <span class="c1">// center position in world space</span>
    <span class="k">pub</span> <span class="n">scale</span><span class="p">:</span> <span class="n">Vec3</span><span class="p">,</span>    <span class="c1">// size along each local axis</span>
    <span class="k">pub</span> <span class="n">rot</span><span class="p">:</span> <span class="n">Quat</span><span class="p">,</span>      <span class="c1">// orientation as a unit quaternion</span>
    <span class="k">pub</span> <span class="n">color</span><span class="p">:</span> <span class="n">Vec3</span><span class="p">,</span>    <span class="c1">// RGB color (already decoded from spherical harmonics)</span>
    <span class="k">pub</span> <span class="n">opacity</span><span class="p">:</span> <span class="nb">f32</span><span class="p">,</span>   <span class="c1">// how opaque this blob is, in [0, 1]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>scales are stored in log space, opacities as logits, colors as spherical harmonics coefficients and quaternions are normzlied to unit length to ensure the values lie within their respective range</p>

<p><img src="/assets/images/posts/Pasted image 20260412233548.png" alt="Pasted image 20260412233548.png" /></p>

<p>spherical harmonic (SH) coefficients are just a frequency-domain representation of a color function defined over the unit sphere, now why spherical harmonics? because in the real world, the color of a surface depends on the viewing direction. SH coefficients encode this view-dependent appearance compactly.</p>

<p>SH functions are organized in bands (like octaves in music), as you go higher up in the bands you have more coefficients and thus they capture more finer details, the INRIA 3DGS format stores up to band 3 (48 coefficients per splat for RGB)</p>

<p>To decode bash 0, the band-0 SH basis function is $Y_0^0 = \frac{1}{2\sqrt{\pi}} \approx 0.282$. the conversion from SH coefficient to RGB is:</p>

\[\text{color} = \text{clamp}\left(0.5 + C_0 \cdot f_{dc},\ 0,\ 1\right)\]

<p>where $C_0 = Y_0^0$ and $f_{dc}$ is the 3-component DC coefficient from the file.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">const</span> <span class="n">SH_C0</span><span class="p">:</span> <span class="nb">f32</span> <span class="o">=</span> <span class="mf">0.28209479177387814</span><span class="p">;</span>

<span class="k">pub</span> <span class="k">fn</span> <span class="nf">sh_band0_to_rgb</span><span class="p">(</span><span class="n">f_dc</span><span class="p">:</span> <span class="n">Vec3</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="n">Vec3</span> <span class="p">{</span>
    <span class="p">(</span><span class="nn">Vec3</span><span class="p">::</span><span class="nf">splat</span><span class="p">(</span><span class="mf">0.5</span><span class="p">)</span> <span class="o">+</span> <span class="n">SH_C0</span> <span class="o">*</span> <span class="n">f_dc</span><span class="p">)</span><span class="nf">.clamp</span><span class="p">(</span><span class="nn">Vec3</span><span class="p">::</span><span class="n">ZERO</span><span class="p">,</span> <span class="nn">Vec3</span><span class="p">::</span><span class="n">ONE</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="the-forward-pass-pipeline">the forward pass pipeline</h2>

<p>the forward pass turns a list of 3D Gaussians + a camera into a 2D image. here is an overview of the rendering pipeline:</p>

<p><img src="/assets/images/posts/Pasted image 20260412233610.png" alt="Pasted image 20260412233610.png" /></p>

<h2 id="step-1-projecting-splats">Step 1: Projecting Splats</h2>

<h3 id="11-building-the-3d-covariance-matrix">1.1: building the 3D covariance matrix</h3>

<p>for each splat given the raw <code class="language-plaintext highlighter-rouge">(scale, rotation)</code> pairs we need to construct a 3D covariance matrix $\Sigma$ that describes the shape and orientation of the Gaussian in world space. the formula is:</p>

\[\Sigma = R \cdot S \cdot S^T \cdot R^T\]

<p>where R is the 3×3 rotation matrix from the quaternion, and S is a diagonal matrix of scales. if we let M = R·S, this simplifies to:</p>

\[\Sigma = M \cdot M^T\]

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">r_mat</span> <span class="o">=</span> <span class="nn">Mat3</span><span class="p">::</span><span class="nf">from_quat</span><span class="p">(</span><span class="n">s</span><span class="py">.rot</span><span class="p">);</span>
<span class="k">let</span> <span class="n">s_mat</span> <span class="o">=</span> <span class="nn">Mat3</span><span class="p">::</span><span class="nf">from_diagonal</span><span class="p">(</span><span class="n">s</span><span class="py">.scale</span><span class="p">);</span>
<span class="k">let</span> <span class="n">m</span> <span class="o">=</span> <span class="n">r_mat</span> <span class="o">*</span> <span class="n">s_mat</span><span class="p">;</span>
<span class="k">let</span> <span class="n">cov3d</span> <span class="o">=</span> <span class="n">m</span> <span class="o">*</span> <span class="n">m</span><span class="nf">.transpose</span><span class="p">();</span>
</code></pre></div></div>

<hr />

<p><strong>Note: why dowe decompose the covariance this way?</strong></p>

<p>Covariance matrices have physical meaning only when they are <strong>positive semi-definite</strong>. gradient descent cannot easily be constrained to produce valid matrices, by expressing the covariance as $M \cdot M^T$, it is guaranteed to be positive semi-definite, a matrix of the form $A^T A$ always is. this is a reparametrization trick: we optimize <code class="language-plaintext highlighter-rouge">scale</code> and <code class="language-plaintext highlighter-rouge">rotation</code> separately, which are unconstrained, and the covariance we derive from them is always valid</p>

<hr />

<p>what does this matrix actually look like? for a splat with <code class="language-plaintext highlighter-rouge">scale = (0.1, 0.05, 0.02)</code> and identity rotation:</p>

\[\Sigma =
\begin{pmatrix}
0.01 &amp; 0 &amp; 0 \\
0 &amp; 0.0025 &amp; 0 \\
0 &amp; 0 &amp; 0.0004
\end{pmatrix}
\quad
\begin{aligned}
&amp;= \text{diag}(0.1^2,\; 0.05^2,\; 0.02^2)
\end{aligned}\]

<p>with identity rotation, it is just the squared scales on the diagonal, an axis-aligned ellipsoid</p>

<h3 id="12-transforming-into-view-space">1.2: transforming into view space</h3>

<p>the 3D covariance we just computed lives in world space. to project it onto the camera’s image plane, we first need to rotate it into view space, the coordinate system where the camera is at the origin, looking down −z</p>

<p>for the splat center, this is just a matrix-vector multiply with the 4×4 view matrix:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">p_view4</span> <span class="o">=</span> <span class="n">view</span> <span class="o">*</span> <span class="nn">Vec4</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">s</span><span class="py">.pos.x</span><span class="p">,</span> <span class="n">s</span><span class="py">.pos.y</span><span class="p">,</span> <span class="n">s</span><span class="py">.pos.z</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">);</span>
<span class="k">let</span> <span class="n">p_view</span> <span class="o">=</span> <span class="nn">Vec3</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">p_view4</span><span class="py">.x</span><span class="p">,</span> <span class="n">p_view4</span><span class="py">.y</span><span class="p">,</span> <span class="n">p_view4</span><span class="py">.z</span><span class="p">);</span>
<span class="k">if</span> <span class="n">p_view</span><span class="py">.z</span> <span class="o">&gt;</span> <span class="o">-</span><span class="n">znear</span> <span class="p">||</span> <span class="n">p_view</span><span class="py">.z</span> <span class="o">&lt;</span> <span class="o">-</span><span class="n">zfar</span> <span class="p">{</span>
    <span class="k">return</span> <span class="nb">None</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">let</span> <span class="n">zc</span> <span class="o">=</span> <span class="o">-</span><span class="n">p_view</span><span class="py">.z</span><span class="p">;</span>
</code></pre></div></div>

<p>note <code class="language-plaintext highlighter-rouge">zc = -p_view.z</code>. our view space is right-handed with the camera looking down <strong>−z</strong>, so points in front of the camera have negative z. we use <code class="language-plaintext highlighter-rouge">zc</code> (positive in front) as the depth for sorting and projection.</p>

<p>for the covariance, we rotate it by the 3×3 part of the view matrix W:</p>

\[\Sigma_{view} = W \cdot \Sigma \cdot W^T\]

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">w_mat</span> <span class="o">=</span> <span class="nn">Mat3</span><span class="p">::</span><span class="nf">from_mat4</span><span class="p">(</span><span class="n">view</span><span class="p">);</span>
<span class="k">let</span> <span class="n">w_mat_t</span> <span class="o">=</span> <span class="n">w_mat</span><span class="nf">.transpose</span><span class="p">();</span>

<span class="k">let</span> <span class="n">cov3d_view</span> <span class="o">=</span> <span class="n">w_mat</span> <span class="o">*</span> <span class="n">cov3d</span> <span class="o">*</span> <span class="n">w_mat_t</span><span class="p">;</span>
</code></pre></div></div>

<p>this is just the standard basis-change formula for a covariance matrix. the shape of the ellipsoid does not change due to this infact we are only re-expressing it in the camera’s coordinate system.</p>

<h3 id="13-projecting-to-2d">1.3: projecting to 2D</h3>

<p>now we have a 3D Gaussian in view space and we need to project it onto the 2D image plane. the projection is perspective, which means a 3D Gaussian does not project to an exact 2D Gaussian because perspective is a nonlinear transform. but we can locally linearize it using the Jacobian of the projection function, and the result is close enough.</p>

<p>the projection function maps a 3D point (x, y, z) in view space to pixel coordinates (u, v):</p>

<p>\(u = f_x \cdot \frac{x}{z_c} + c_x\)
\(v = f_y \cdot \frac{y}{z_c} + c_y\)</p>

<p>where $f_x, f_y$ are the focal lengths and $c_x, c_y$ are the principal point (image center). for simplicity of calcuations we can also assume, $f_x = f_y$</p>

<p>the Jacobian J of this projection evaluated at the splat center is:</p>

\[J = \begin{bmatrix} \frac{f_x}{z_c} &amp; 0 &amp; \frac{f_x \cdot x_v}{z_c^2} \\ 0 &amp; \frac{f_y}{z_c} &amp; \frac{f_y \cdot y_v}{z_c^2} \\ 0 &amp; 0 &amp; 0 \end{bmatrix}\]

<p>the structure of this matrix is very sparse, only 4 of the 9 entries are nonzero.we can just do the full $JC J^T$ with two 3×3 matrix multiplies (~54 scalar multiplies). but because we only need the top-left $2\times 2$ of $J$ Cov3D_view $J^T$, since the third row of $J$ is all zeros. and the first two rows of J each have only two nonzero entries. so instead of two full matrix multiplies, we can compute the 2D covariance with ~20 scalar multiplies by expanding the product by hand:</p>

<hr />

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">c</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">cov3d_view</span><span class="p">;</span>
<span class="k">let</span> <span class="n">inv_zc</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="o">/</span> <span class="n">zc</span><span class="p">;</span>
<span class="k">let</span> <span class="n">inv_zc2</span> <span class="o">=</span> <span class="n">inv_zc</span> <span class="o">*</span> <span class="n">inv_zc</span><span class="p">;</span>

<span class="k">let</span> <span class="n">j00</span> <span class="o">=</span> <span class="n">fx</span> <span class="o">*</span> <span class="n">inv_zc</span><span class="p">;</span>
<span class="k">let</span> <span class="n">j02</span> <span class="o">=</span> <span class="n">fx</span> <span class="o">*</span> <span class="n">xv</span> <span class="o">*</span> <span class="n">inv_zc2</span><span class="p">;</span>
<span class="k">let</span> <span class="n">j11</span> <span class="o">=</span> <span class="n">fy</span> <span class="o">*</span> <span class="n">inv_zc</span><span class="p">;</span>
<span class="k">let</span> <span class="n">j12</span> <span class="o">=</span> <span class="n">fy</span> <span class="o">*</span> <span class="n">yv</span> <span class="o">*</span> <span class="n">inv_zc2</span><span class="p">;</span>

<span class="c1">// Row 0 of J * C: [j00*c00 + j02*c20, j00*c01 + j02*c21, j00*c02 + j02*c22]</span>
<span class="k">let</span> <span class="n">t0x</span> <span class="o">=</span> <span class="n">j00</span> <span class="o">*</span> <span class="n">c</span><span class="py">.x_axis.x</span> <span class="o">+</span> <span class="n">j02</span> <span class="o">*</span> <span class="n">c</span><span class="py">.z_axis.x</span><span class="p">;</span>
<span class="k">let</span> <span class="n">t0y</span> <span class="o">=</span> <span class="n">j00</span> <span class="o">*</span> <span class="n">c</span><span class="py">.y_axis.x</span> <span class="o">+</span> <span class="n">j02</span> <span class="o">*</span> <span class="n">c</span><span class="py">.z_axis.y</span><span class="p">;</span>
<span class="k">let</span> <span class="n">t0z</span> <span class="o">=</span> <span class="n">j00</span> <span class="o">*</span> <span class="n">c</span><span class="py">.x_axis.z</span> <span class="o">+</span> <span class="n">j02</span> <span class="o">*</span> <span class="n">c</span><span class="py">.z_axis.z</span><span class="p">;</span>

<span class="c1">// Row 1 of J * C: [j11*c10 + j12*c20, j11*c11 + j12*c21, j11*c12 + j12*c22]</span>
<span class="k">let</span> <span class="n">t1y</span> <span class="o">=</span> <span class="n">j11</span> <span class="o">*</span> <span class="n">c</span><span class="py">.y_axis.y</span> <span class="o">+</span> <span class="n">j12</span> <span class="o">*</span> <span class="n">c</span><span class="py">.z_axis.y</span><span class="p">;</span>
<span class="k">let</span> <span class="n">t1z</span> <span class="o">=</span> <span class="n">j11</span> <span class="o">*</span> <span class="n">c</span><span class="py">.y_axis.z</span> <span class="o">+</span> <span class="n">j12</span> <span class="o">*</span> <span class="n">c</span><span class="py">.z_axis.z</span><span class="p">;</span>

<span class="c1">// 2D cov = (J*C) * J^T, top-left 2x2:</span>
<span class="k">let</span> <span class="n">cov2d_00</span> <span class="o">=</span> <span class="n">t0x</span> <span class="o">*</span> <span class="n">j00</span> <span class="o">+</span> <span class="n">t0z</span> <span class="o">*</span> <span class="n">j02</span> <span class="o">+</span> <span class="n">eps2d</span><span class="p">;</span>
<span class="k">let</span> <span class="n">cov2d_01</span> <span class="o">=</span> <span class="n">t0y</span> <span class="o">*</span> <span class="n">j11</span> <span class="o">+</span> <span class="n">t0z</span> <span class="o">*</span> <span class="n">j12</span><span class="p">;</span>
<span class="k">let</span> <span class="n">cov2d_11</span> <span class="o">=</span> <span class="n">t1y</span> <span class="o">*</span> <span class="n">j11</span> <span class="o">+</span> <span class="n">t1z</span> <span class="o">*</span> <span class="n">j12</span> <span class="o">+</span> <span class="n">eps2d</span><span class="p">;</span>
</code></pre></div></div>

<p>notice the eps2d on the diagonal entries. that is a small dilation (default 0.3) added for numerical stability, it just ensures the 2D covariance is strictly positive definite (not just semi-definite), which means it is always invertible.</p>

<hr />

<p><strong>Note: why the eps2d trick works</strong></p>

<p>by construction, the 2D covariance $JCJ^T$ is only positive semi-definite ($A^T A$ form). but we need to invert it later (for evaluating the Gaussian at each pixel). a singular matrix is not invertible.</p>

<p>adding eps2d to the diagonal means adding $\lambda I$ to the matrix. for any vector x:</p>

\[x^T \cdot (A^T A + \lambda I) \cdot x = \|Ax\|^2 + \lambda \|x\|^2 &gt; 0\]

<p>this is strictly positive for any nonzero x, which is the definition of positive definite, invertible, with all eigenvalues strictly positive</p>

<hr />

<p>next we invert the $2\times 2$ covariance. for a $2\times 2$ matrix the inverse has a closed form:</p>

\[\begin{bmatrix} a &amp; b \\ b &amp; d \end{bmatrix}^{-1} = \frac{1}{ad - b^2} \begin{bmatrix} d &amp; -b \\ -b &amp; a \end{bmatrix}\]

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">det</span> <span class="o">=</span> <span class="n">cov2d_00</span> <span class="o">*</span> <span class="n">cov2d_11</span> <span class="o">-</span> <span class="n">cov2d_01</span> <span class="o">*</span> <span class="n">cov2d_01</span><span class="p">;</span>
<span class="k">if</span> <span class="n">det</span> <span class="o">&lt;=</span> <span class="mf">0.0</span> <span class="p">{</span>
    <span class="k">return</span> <span class="nb">None</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">let</span> <span class="n">inv_det</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="o">/</span> <span class="n">det</span><span class="p">;</span>
<span class="k">let</span> <span class="n">cov2d_inv</span> <span class="o">=</span> <span class="nn">Mat2</span><span class="p">::</span><span class="nf">from_cols</span><span class="p">(</span>
    <span class="nn">Vec2</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">cov2d_11</span> <span class="o">*</span> <span class="n">inv_det</span><span class="p">,</span> <span class="o">-</span><span class="n">cov2d_01</span> <span class="o">*</span> <span class="n">inv_det</span><span class="p">),</span>
    <span class="nn">Vec2</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="o">-</span><span class="n">cov2d_01</span> <span class="o">*</span> <span class="n">inv_det</span><span class="p">,</span> <span class="n">cov2d_00</span> <span class="o">*</span> <span class="n">inv_det</span><span class="p">),</span>
<span class="p">);</span>
</code></pre></div></div>

<p>and the screen position is a standard perspective divide:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">sx</span> <span class="o">=</span> <span class="n">fx</span> <span class="o">*</span> <span class="n">xv</span> <span class="o">*</span> <span class="n">inv_zc</span> <span class="o">+</span> <span class="n">cx</span><span class="p">;</span>
<span class="k">let</span> <span class="n">sy</span> <span class="o">=</span> <span class="n">fy</span> <span class="o">*</span> <span class="n">yv</span> <span class="o">*</span> <span class="n">inv_zc</span> <span class="o">+</span> <span class="n">cy</span><span class="p">;</span>
</code></pre></div></div>

<p>at this point, for each splat we have: screen position <code class="language-plaintext highlighter-rouge">(sx, sy)</code>, depth <code class="language-plaintext highlighter-rouge">zc</code>, and the inverse 2D covariance matrix <code class="language-plaintext highlighter-rouge">cov2d_inv</code>. this is everything we need to evaluate the Gaussian at any pixel on screen.</p>

<h2 id="step-2-computing-bounding-boxes">step 2: computing bounding boxes</h2>

<p>We would now have to evalute the splat at every pixel in the screen because the 2D gaussian has infinite support, it never truly reaches zero, but we can compute a bounding box that encloses the region where the gaussian has any visible effect and only evalute pixels inside this bounding box</p>

<p>the original 3DGS code computes the two eigenvalues $\lambda_1$, $\lambda_2$ of the 2D covariance (the variances along the two principal axes of the ellipse), takes $r = 3\sqrt{max(\lambda_1, \lambda_2)}$ (the 3-sigma rule, covering 99.7% of the Gaussian), and uses a circle of that radius as the bounding box</p>

<p>this is simple but wasteful. when a Gaussian is elongated (one eigenvalue much larger than the other), the bounding circle will also include a lot of empty space</p>

<p><img src="/assets/images/posts/Pasted image 20260412233627.png" alt="Pasted image 20260412233627.png" /></p>

<p>We can create a tighter bounding box by observing that:</p>

<ol>
  <li>
    <p>the extent along each axis is $k\sqrt{\Sigma_{ii}}$ where $\Sigma_{ii}$ is the diagonal entry of the 2D covariance for that axis. for an elongated ellipse, the short-axis extent is much smaller than the long-axis extent, so the bounding box is tighter.</p>
  </li>
  <li>
    <p>the classic $3\sigma$ rule is conservative. a faint splat (low opacity) does not need $3\sigma$ because its contribution drops below the visibility threshold much sooner. the cutoff distance $k$ can be computed when our splat’s opacity falls below certain threshold $\tau$</p>
  </li>
</ol>

<p>based on this each pixel gets</p>

\[\alpha = \text{opacity} \cdot \exp(-\tfrac{1}{2} \cdot \mathbf{d}^T \Sigma^{-1} \mathbf{d})\]

<p>we want $\alpha \geq \tau$, which rearranges to:</p>

\[\mathbf{d}^T \Sigma^{-1} \mathbf{d} \leq 2 \ln\left(\frac{\text{opacity}}{\tau}\right) = k^2\]

<p>for a near-opaque splat (opacity = 1), $k^2 = 2ln(255) = 11.1$, which is close to the $3\sigma$ value of 9. for a faint splat (opacity = 0.1), $k^2 = 2ln(255) = 11.1$, so the box and computation shrinks substantially.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">s</span><span class="py">.opacity</span> <span class="o">&lt;=</span> <span class="n">alpha_threshold</span> <span class="p">{</span>
    <span class="k">return</span> <span class="nb">None</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">let</span> <span class="n">k2</span> <span class="o">=</span> <span class="p">(</span><span class="mf">2.0</span> <span class="o">*</span> <span class="p">(</span><span class="n">s</span><span class="py">.opacity</span> <span class="o">/</span> <span class="n">alpha_threshold</span><span class="p">)</span><span class="nf">.ln</span><span class="p">())</span><span class="nf">.min</span><span class="p">(</span><span class="n">max_k2</span><span class="p">);</span>
<span class="k">if</span> <span class="o">!</span><span class="p">(</span><span class="n">k2</span> <span class="o">&gt;</span> <span class="mf">0.0</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="nb">None</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">let</span> <span class="n">rx_f</span> <span class="o">=</span> <span class="p">(</span><span class="n">k2</span> <span class="o">*</span> <span class="n">cov2d_00</span><span class="p">)</span><span class="nf">.sqrt</span><span class="p">();</span>
<span class="k">let</span> <span class="n">ry_f</span> <span class="o">=</span> <span class="p">(</span><span class="n">k2</span> <span class="o">*</span> <span class="n">cov2d_11</span><span class="p">)</span><span class="nf">.sqrt</span><span class="p">();</span>
<span class="k">if</span> <span class="o">!</span><span class="n">rx_f</span><span class="nf">.is_finite</span><span class="p">()</span> <span class="p">||</span> <span class="o">!</span><span class="n">ry_f</span><span class="nf">.is_finite</span><span class="p">()</span> <span class="p">||</span> <span class="n">rx_f</span> <span class="o">&lt;</span> <span class="mf">1.0</span> <span class="p">||</span> <span class="n">ry_f</span> <span class="o">&lt;</span> <span class="mf">1.0</span> <span class="p">{</span>
    <span class="k">return</span> <span class="nb">None</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">let</span> <span class="n">x0</span> <span class="o">=</span> <span class="p">(</span><span class="n">sx</span> <span class="o">-</span> <span class="n">rx_f</span><span class="p">)</span><span class="nf">.floor</span><span class="p">()</span> <span class="k">as</span> <span class="nb">i32</span><span class="p">;</span>
<span class="k">let</span> <span class="n">y0</span> <span class="o">=</span> <span class="p">(</span><span class="n">sy</span> <span class="o">-</span> <span class="n">ry_f</span><span class="p">)</span><span class="nf">.floor</span><span class="p">()</span> <span class="k">as</span> <span class="nb">i32</span><span class="p">;</span>
<span class="k">let</span> <span class="n">x1</span> <span class="o">=</span> <span class="p">(</span><span class="n">sx</span> <span class="o">+</span> <span class="n">rx_f</span><span class="p">)</span><span class="nf">.ceil</span><span class="p">()</span> <span class="k">as</span> <span class="nb">i32</span><span class="p">;</span>
<span class="k">let</span> <span class="n">y1</span> <span class="o">=</span> <span class="p">(</span><span class="n">sy</span> <span class="o">+</span> <span class="n">ry_f</span><span class="p">)</span><span class="nf">.ceil</span><span class="p">()</span> <span class="k">as</span> <span class="nb">i32</span><span class="p">;</span>

<span class="c1">// Clip to framebuffer.</span>
<span class="k">let</span> <span class="n">x0</span> <span class="o">=</span> <span class="n">x0</span><span class="nf">.max</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="k">let</span> <span class="n">y0</span> <span class="o">=</span> <span class="n">y0</span><span class="nf">.max</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="k">let</span> <span class="n">x1</span> <span class="o">=</span> <span class="n">x1</span><span class="nf">.min</span><span class="p">(</span><span class="n">w_i</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
<span class="k">let</span> <span class="n">y1</span> <span class="o">=</span> <span class="n">y1</span><span class="nf">.min</span><span class="p">(</span><span class="n">h_i</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
<span class="k">if</span> <span class="n">x0</span> <span class="o">&gt;</span> <span class="n">x1</span> <span class="p">||</span> <span class="n">y0</span> <span class="o">&gt;</span> <span class="n">y1</span> <span class="p">{</span>
    <span class="k">return</span> <span class="nb">None</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Also because each splat is independent of each other the projection computation is embarassingly parallel, we use rayon’s <code class="language-plaintext highlighter-rouge">par_iter().filter_map()</code> to project all splats across all cores:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">fn</span> <span class="nf">project</span><span class="p">(</span>
    <span class="n">splats</span><span class="p">:</span> <span class="o">&amp;</span><span class="p">[</span><span class="n">Splat</span><span class="p">],</span>
    <span class="n">camera</span><span class="p">:</span> <span class="o">&amp;</span><span class="n">OrbitCamera</span><span class="p">,</span>
    <span class="n">params</span><span class="p">:</span> <span class="o">&amp;</span><span class="n">RenderParams</span><span class="p">,</span>
    <span class="n">pool</span><span class="p">:</span> <span class="o">&amp;</span><span class="nb">Option</span><span class="o">&lt;</span><span class="nn">rayon</span><span class="p">::</span><span class="n">ThreadPool</span><span class="o">&gt;</span><span class="p">,</span>
<span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">Projected</span><span class="o">&gt;</span> <span class="p">{</span>
    <span class="c1">// ... precompute view matrix, intrinsics, etc (once per frame) ...</span>

    <span class="k">let</span> <span class="n">do_project</span> <span class="o">=</span> <span class="p">||</span> <span class="p">{</span>
        <span class="n">splats</span>
            <span class="nf">.par_iter</span><span class="p">()</span>
            <span class="nf">.filter_map</span><span class="p">(|</span><span class="n">s</span><span class="p">|</span> <span class="p">{</span>
                <span class="c1">// ... all the math above, returning Some(Projected) or None ...</span>
            <span class="p">})</span>
            <span class="nf">.collect</span><span class="p">()</span>
    <span class="p">};</span>

    <span class="k">match</span> <span class="n">pool</span><span class="nf">.as_ref</span><span class="p">()</span> <span class="p">{</span>
        <span class="nf">Some</span><span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="k">=&gt;</span> <span class="n">p</span><span class="nf">.install</span><span class="p">(</span><span class="n">do_project</span><span class="p">),</span>
        <span class="nb">None</span> <span class="k">=&gt;</span> <span class="nf">do_project</span><span class="p">(),</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>we can now just pack everything into a <code class="language-plaintext highlighter-rouge">Projected</code> struct</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">struct</span> <span class="n">Projected</span> <span class="p">{</span>
    <span class="k">pub</span> <span class="n">screen</span><span class="p">:</span> <span class="n">Vec2</span><span class="p">,</span>       <span class="c1">// pixel center (sx, sy)</span>
    <span class="k">pub</span> <span class="n">depth</span><span class="p">:</span> <span class="nb">f32</span><span class="p">,</span>         <span class="c1">// zc (positive in front)</span>
    <span class="k">pub</span> <span class="n">cov2d_inv</span><span class="p">:</span> <span class="n">Mat2</span><span class="p">,</span>    <span class="c1">// inverse 2D covariance</span>
    <span class="k">pub</span> <span class="n">bbox</span><span class="p">:</span> <span class="p">[</span><span class="nb">i32</span><span class="p">;</span> <span class="mi">4</span><span class="p">],</span>     <span class="c1">// inclusive: x0, y0, x1, y1</span>
    <span class="k">pub</span> <span class="n">color</span><span class="p">:</span> <span class="n">Vec3</span><span class="p">,</span>        <span class="c1">// RGB</span>
    <span class="k">pub</span> <span class="n">opacity</span><span class="p">:</span> <span class="nb">f32</span><span class="p">,</span>       <span class="c1">// [0, 1]</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="step-3-depth-sort">step 3: depth sort</h2>

<p>For alpha compositing we use <a href="https://en.wikipedia.org/wiki/Painter's_algorithm">painter’s algorithm in reverse</a>. this means we need the projected splats sorted by depth before compositing. the order matters here, if a close splat occludes a far one, it must be composited first so it has more weight in the blending process. because all depths are positive (we culled anything behind the camera), so the bit pattern of an <code class="language-plaintext highlighter-rouge">f32</code> preserves float ordering when reinterpreted as a <code class="language-plaintext highlighter-rouge">u32</code>. this lets us use <code class="language-plaintext highlighter-rouge">depth.to_bits()</code> as a sort key</p>

<p>for the actual sorting, we use a simple 2-pass 16-bit radix sort. this is faster than comparison-based sorting (like Rust’s <code class="language-plaintext highlighter-rouge">sort_unstable_by_key</code>) for our input sizes (~100k–200k elements) and it runs in O(n) time with small constants. 2 passes of 16 bits seemed to have worked better instead of 4 passes of 8 bits, fewer passes means fewer data traversals and also the 65536-entry histograms (256KB each) fit comfortably in my laptops L2 cache. at 200k splats this is consistently 2 times faster than Rust’s comparison sort.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">fn</span> <span class="nf">sort_by_depth</span><span class="p">(</span><span class="n">projected</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="p">[</span><span class="n">Projected</span><span class="p">],</span> <span class="n">scratch</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="n">ScratchBuffers</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">let</span> <span class="n">n</span> <span class="o">=</span> <span class="n">projected</span><span class="nf">.len</span><span class="p">();</span>
    <span class="k">if</span> <span class="n">n</span> <span class="o">&lt;=</span> <span class="mi">1</span> <span class="p">{</span>
        <span class="k">return</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="n">scratch</span><span class="py">.sort_aux</span><span class="nf">.clear</span><span class="p">();</span>
    <span class="n">scratch</span><span class="py">.sort_aux</span><span class="nf">.reserve</span><span class="p">(</span><span class="n">n</span><span class="nf">.saturating_sub</span><span class="p">(</span><span class="n">scratch</span><span class="py">.sort_aux</span><span class="nf">.capacity</span><span class="p">()));</span>
    <span class="k">unsafe</span> <span class="p">{</span> <span class="n">scratch</span><span class="py">.sort_aux</span><span class="nf">.set_len</span><span class="p">(</span><span class="n">n</span><span class="p">);</span> <span class="p">}</span>
    <span class="k">let</span> <span class="n">aux</span> <span class="o">=</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="n">scratch</span><span class="py">.sort_aux</span><span class="p">;</span>

    <span class="c1">// Both histograms can be computed in a single pass over the keys.</span>
    <span class="c1">// Stack-allocated to avoid heap alloc.</span>
    <span class="k">let</span> <span class="k">mut</span> <span class="n">counts_lo</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0u32</span><span class="p">;</span> <span class="mi">65536</span><span class="p">];</span>
    <span class="k">let</span> <span class="k">mut</span> <span class="n">counts_hi</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0u32</span><span class="p">;</span> <span class="mi">65536</span><span class="p">];</span>
    <span class="k">for</span> <span class="n">p</span> <span class="k">in</span> <span class="n">projected</span><span class="nf">.iter</span><span class="p">()</span> <span class="p">{</span>
        <span class="k">let</span> <span class="n">k</span> <span class="o">=</span> <span class="n">p</span><span class="py">.depth</span><span class="nf">.to_bits</span><span class="p">();</span>
        <span class="n">counts_lo</span><span class="p">[(</span><span class="n">k</span> <span class="o">&amp;</span> <span class="mi">0xFFFF</span><span class="p">)</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
        <span class="n">counts_hi</span><span class="p">[(</span><span class="n">k</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">)</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="c1">// Pass 1: sort by low 16 bits</span>
    <span class="p">{</span>
        <span class="k">let</span> <span class="k">mut</span> <span class="n">offsets</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0u32</span><span class="p">;</span> <span class="mi">65536</span><span class="p">];</span>
        <span class="k">let</span> <span class="k">mut</span> <span class="n">sum</span> <span class="o">=</span> <span class="mi">0u32</span><span class="p">;</span>
        <span class="k">for</span> <span class="n">i</span> <span class="k">in</span> <span class="mi">0</span><span class="o">..</span><span class="mi">65536</span> <span class="p">{</span>
            <span class="n">offsets</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">sum</span><span class="p">;</span>
            <span class="n">sum</span> <span class="o">+=</span> <span class="n">counts_lo</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
        <span class="p">}</span>
        <span class="k">for</span> <span class="n">p</span> <span class="k">in</span> <span class="n">projected</span><span class="nf">.iter</span><span class="p">()</span> <span class="p">{</span>
            <span class="k">let</span> <span class="n">bucket</span> <span class="o">=</span> <span class="p">(</span><span class="n">p</span><span class="py">.depth</span><span class="nf">.to_bits</span><span class="p">()</span> <span class="o">&amp;</span> <span class="mi">0xFFFF</span><span class="p">)</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">;</span>
            <span class="k">let</span> <span class="n">pos</span> <span class="o">=</span> <span class="n">offsets</span><span class="p">[</span><span class="n">bucket</span><span class="p">]</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">;</span>
            <span class="n">offsets</span><span class="p">[</span><span class="n">bucket</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
            <span class="n">aux</span><span class="p">[</span><span class="n">pos</span><span class="p">]</span> <span class="o">=</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="c1">// Pass 2: sort by high 16 bits</span>
    <span class="p">{</span>
        <span class="k">let</span> <span class="k">mut</span> <span class="n">offsets</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0u32</span><span class="p">;</span> <span class="mi">65536</span><span class="p">];</span>
        <span class="k">let</span> <span class="k">mut</span> <span class="n">sum</span> <span class="o">=</span> <span class="mi">0u32</span><span class="p">;</span>
        <span class="k">for</span> <span class="n">i</span> <span class="k">in</span> <span class="mi">0</span><span class="o">..</span><span class="mi">65536</span> <span class="p">{</span>
            <span class="n">offsets</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">sum</span><span class="p">;</span>
            <span class="n">sum</span> <span class="o">+=</span> <span class="n">counts_hi</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
        <span class="p">}</span>
        <span class="k">for</span> <span class="n">p</span> <span class="k">in</span> <span class="n">aux</span><span class="nf">.iter</span><span class="p">()</span> <span class="p">{</span>
            <span class="k">let</span> <span class="n">bucket</span> <span class="o">=</span> <span class="p">(</span><span class="n">p</span><span class="py">.depth</span><span class="nf">.to_bits</span><span class="p">()</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">)</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">;</span>
            <span class="k">let</span> <span class="n">pos</span> <span class="o">=</span> <span class="n">offsets</span><span class="p">[</span><span class="n">bucket</span><span class="p">]</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">;</span>
            <span class="n">offsets</span><span class="p">[</span><span class="n">bucket</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
            <span class="n">projected</span><span class="p">[</span><span class="n">pos</span><span class="p">]</span> <span class="o">=</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">fn</span> <span class="nf">sort_by_depth_parallel</span><span class="p">(</span>
    <span class="n">projected</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="p">[</span><span class="n">Projected</span><span class="p">],</span>
    <span class="n">scratch</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="n">ScratchBuffers</span><span class="p">,</span>
    <span class="n">pool</span><span class="p">:</span> <span class="o">&amp;</span><span class="nb">Option</span><span class="o">&lt;</span><span class="nn">rayon</span><span class="p">::</span><span class="n">ThreadPool</span><span class="o">&gt;</span><span class="p">,</span>
<span class="p">)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="n">projected</span><span class="nf">.len</span><span class="p">()</span> <span class="o">&lt;</span> <span class="mi">50_000</span> <span class="p">||</span> <span class="n">pool</span><span class="nf">.is_none</span><span class="p">()</span> <span class="p">{</span>
        <span class="nf">sort_by_depth</span><span class="p">(</span><span class="n">projected</span><span class="p">,</span> <span class="n">scratch</span><span class="p">);</span>
        <span class="k">return</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">pool</span><span class="nf">.as_ref</span><span class="p">()</span><span class="nf">.unwrap</span><span class="p">()</span><span class="nf">.install</span><span class="p">(||</span> <span class="p">{</span>
        <span class="n">projected</span><span class="nf">.par_sort_unstable_by_key</span><span class="p">(|</span><span class="n">p</span><span class="p">|</span> <span class="n">p</span><span class="py">.depth</span><span class="nf">.to_bits</span><span class="p">());</span>
    <span class="p">});</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="step-4-tile-binning">step 4: tile binning</h2>

<p>we now have a depth-sorted list of projected splats. the naive approach to compositing would be: for each splat, iterate every pixel in its bounding box and accumulate color. this works but it is cache-hostile because different splats touch overlapping pixel regions in unpredictable order.</p>

<p>the original 3DGS implementation divides the image into tiles (16×16 pixel blocks) and builds an index that tells each tile exactly which splats overlap it. then each tile composites only its own splats, in order, touching only its own pixels</p>

<p>the binning uses a count-then-scatter two-pass approach. in pass 1 for each splat, count how many tiles its bbox touches then build a per-tile count array. in pass 2, prefix-sum the counts into offsets (so <code class="language-plaintext highlighter-rouge">offsets[i]</code> = start of tile i’s bucket). then scatter each splat’s index into its tile buckets using a cursor that advances. now because we iterate splats in depth-sorted order, each tile’s bucket ends up automatically depth-sorted so no additional per-tile sort needed</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">fn</span> <span class="nf">bin_splats</span><span class="p">(</span>
    <span class="n">projected</span><span class="p">:</span> <span class="o">&amp;</span><span class="p">[</span><span class="n">Projected</span><span class="p">],</span>
    <span class="n">width</span><span class="p">:</span> <span class="nb">u32</span><span class="p">,</span>
    <span class="n">height</span><span class="p">:</span> <span class="nb">u32</span><span class="p">,</span>
    <span class="n">bins</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="n">TileBins</span><span class="p">,</span>
<span class="p">)</span> <span class="p">{</span>
    <span class="k">let</span> <span class="n">num_tiles_x</span> <span class="o">=</span> <span class="p">((</span><span class="n">width</span> <span class="k">as</span> <span class="nb">i32</span><span class="p">)</span> <span class="o">+</span> <span class="n">TILE_W</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="n">TILE_W</span><span class="p">;</span>
    <span class="k">let</span> <span class="n">num_tiles_y</span> <span class="o">=</span> <span class="p">((</span><span class="n">height</span> <span class="k">as</span> <span class="nb">i32</span><span class="p">)</span> <span class="o">+</span> <span class="n">TILE_H</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="n">TILE_H</span><span class="p">;</span>
    <span class="k">let</span> <span class="n">num_tiles</span> <span class="o">=</span> <span class="p">(</span><span class="n">num_tiles_x</span> <span class="o">*</span> <span class="n">num_tiles_y</span><span class="p">)</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">;</span>

    <span class="c1">// offsets is used first as a count array, then prefix-summed in place.</span>
    <span class="n">bins</span><span class="py">.offsets</span><span class="nf">.clear</span><span class="p">();</span>
    <span class="n">bins</span><span class="py">.offsets</span><span class="nf">.resize</span><span class="p">(</span><span class="n">num_tiles</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>

    <span class="c1">// Pass 1: count tile touches per splat.</span>
    <span class="k">for</span> <span class="n">p</span> <span class="k">in</span> <span class="n">projected</span> <span class="p">{</span>
        <span class="k">let</span> <span class="p">[</span><span class="n">x0</span><span class="p">,</span> <span class="n">y0</span><span class="p">,</span> <span class="n">x1</span><span class="p">,</span> <span class="n">y1</span><span class="p">]</span> <span class="o">=</span> <span class="n">p</span><span class="py">.bbox</span><span class="p">;</span>
        <span class="k">let</span> <span class="n">tx0</span> <span class="o">=</span> <span class="p">(</span><span class="n">x0</span> <span class="o">/</span> <span class="n">TILE_W</span><span class="p">)</span><span class="nf">.max</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
        <span class="k">let</span> <span class="n">ty0</span> <span class="o">=</span> <span class="p">(</span><span class="n">y0</span> <span class="o">/</span> <span class="n">TILE_H</span><span class="p">)</span><span class="nf">.max</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
        <span class="k">let</span> <span class="n">tx1</span> <span class="o">=</span> <span class="p">(</span><span class="n">x1</span> <span class="o">/</span> <span class="n">TILE_W</span><span class="p">)</span><span class="nf">.min</span><span class="p">(</span><span class="n">num_tiles_x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
        <span class="k">let</span> <span class="n">ty1</span> <span class="o">=</span> <span class="p">(</span><span class="n">y1</span> <span class="o">/</span> <span class="n">TILE_H</span><span class="p">)</span><span class="nf">.min</span><span class="p">(</span><span class="n">num_tiles_y</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
        <span class="k">if</span> <span class="n">tx0</span> <span class="o">&gt;</span> <span class="n">tx1</span> <span class="p">||</span> <span class="n">ty0</span> <span class="o">&gt;</span> <span class="n">ty1</span> <span class="p">{</span>
            <span class="k">continue</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="k">for</span> <span class="n">ty</span> <span class="k">in</span> <span class="n">ty0</span><span class="o">..=</span><span class="n">ty1</span> <span class="p">{</span>
            <span class="k">let</span> <span class="n">row</span> <span class="o">=</span> <span class="p">(</span><span class="n">ty</span> <span class="o">*</span> <span class="n">num_tiles_x</span><span class="p">)</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">;</span>
            <span class="k">for</span> <span class="n">tx</span> <span class="k">in</span> <span class="n">tx0</span><span class="o">..=</span><span class="n">tx1</span> <span class="p">{</span>
                <span class="n">bins</span><span class="py">.offsets</span><span class="p">[</span><span class="n">row</span> <span class="o">+</span> <span class="n">tx</span> <span class="k">as</span> <span class="nb">usize</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="c1">// Prefix sum: offsets[i] = start of bucket i.</span>
    <span class="k">for</span> <span class="n">i</span> <span class="k">in</span> <span class="mi">1</span><span class="o">..=</span><span class="n">num_tiles</span> <span class="p">{</span>
        <span class="n">bins</span><span class="py">.offsets</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+=</span> <span class="n">bins</span><span class="py">.offsets</span><span class="p">[</span><span class="n">i</span> <span class="o">-</span> <span class="mi">1</span><span class="p">];</span>
    <span class="p">}</span>
    <span class="k">let</span> <span class="n">total</span> <span class="o">=</span> <span class="n">bins</span><span class="py">.offsets</span><span class="p">[</span><span class="n">num_tiles</span><span class="p">]</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">;</span>

    <span class="n">bins</span><span class="py">.splat_indices</span><span class="nf">.clear</span><span class="p">();</span>
    <span class="n">bins</span><span class="py">.splat_indices</span><span class="nf">.resize</span><span class="p">(</span><span class="n">total</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>

    <span class="c1">// Cursor starts at each bucket's begin and advances as we scatter.</span>
    <span class="n">bins</span><span class="py">.cursor</span><span class="nf">.clear</span><span class="p">();</span>
    <span class="n">bins</span><span class="py">.cursor</span><span class="nf">.extend_from_slice</span><span class="p">(</span><span class="o">&amp;</span><span class="n">bins</span><span class="py">.offsets</span><span class="p">[</span><span class="o">..</span><span class="n">num_tiles</span><span class="p">]);</span>

    <span class="c1">// Pass 2: scatter splat indices into their tile buckets.</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">idx</span><span class="p">,</span> <span class="n">p</span><span class="p">)</span> <span class="k">in</span> <span class="n">projected</span><span class="nf">.iter</span><span class="p">()</span><span class="nf">.enumerate</span><span class="p">()</span> <span class="p">{</span>
        <span class="k">let</span> <span class="p">[</span><span class="n">x0</span><span class="p">,</span> <span class="n">y0</span><span class="p">,</span> <span class="n">x1</span><span class="p">,</span> <span class="n">y1</span><span class="p">]</span> <span class="o">=</span> <span class="n">p</span><span class="py">.bbox</span><span class="p">;</span>
        <span class="k">let</span> <span class="n">tx0</span> <span class="o">=</span> <span class="p">(</span><span class="n">x0</span> <span class="o">/</span> <span class="n">TILE_W</span><span class="p">)</span><span class="nf">.max</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
        <span class="k">let</span> <span class="n">ty0</span> <span class="o">=</span> <span class="p">(</span><span class="n">y0</span> <span class="o">/</span> <span class="n">TILE_H</span><span class="p">)</span><span class="nf">.max</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
        <span class="k">let</span> <span class="n">tx1</span> <span class="o">=</span> <span class="p">(</span><span class="n">x1</span> <span class="o">/</span> <span class="n">TILE_W</span><span class="p">)</span><span class="nf">.min</span><span class="p">(</span><span class="n">num_tiles_x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
        <span class="k">let</span> <span class="n">ty1</span> <span class="o">=</span> <span class="p">(</span><span class="n">y1</span> <span class="o">/</span> <span class="n">TILE_H</span><span class="p">)</span><span class="nf">.min</span><span class="p">(</span><span class="n">num_tiles_y</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
        <span class="k">if</span> <span class="n">tx0</span> <span class="o">&gt;</span> <span class="n">tx1</span> <span class="p">||</span> <span class="n">ty0</span> <span class="o">&gt;</span> <span class="n">ty1</span> <span class="p">{</span>
            <span class="k">continue</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="k">for</span> <span class="n">ty</span> <span class="k">in</span> <span class="n">ty0</span><span class="o">..=</span><span class="n">ty1</span> <span class="p">{</span>
            <span class="k">let</span> <span class="n">row</span> <span class="o">=</span> <span class="p">(</span><span class="n">ty</span> <span class="o">*</span> <span class="n">num_tiles_x</span><span class="p">)</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">;</span>
            <span class="k">for</span> <span class="n">tx</span> <span class="k">in</span> <span class="n">tx0</span><span class="o">..=</span><span class="n">tx1</span> <span class="p">{</span>
                <span class="k">let</span> <span class="n">tile</span> <span class="o">=</span> <span class="n">row</span> <span class="o">+</span> <span class="n">tx</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">;</span>
                <span class="k">let</span> <span class="n">pos</span> <span class="o">=</span> <span class="n">bins</span><span class="py">.cursor</span><span class="p">[</span><span class="n">tile</span><span class="p">]</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">;</span>
                <span class="n">bins</span><span class="py">.splat_indices</span><span class="p">[</span><span class="n">pos</span><span class="p">]</span> <span class="o">=</span> <span class="n">idx</span> <span class="k">as</span> <span class="nb">u32</span><span class="p">;</span>
                <span class="n">bins</span><span class="py">.cursor</span><span class="p">[</span><span class="n">tile</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>let me show what the data structure looks like for a small example. say we have 4 splats (A, B, C, D) then after depth sorting we will get something like this</p>

<p><img src="/assets/images/posts/Pasted image 20260412233641.png" alt="Pasted image 20260412233641.png" /></p>

<h2 id="step-5-alpha-compositing">step 5: alpha compositing</h2>

<p>now for each tile, we iterate its splats front-to-back, and for each pixel in the splat’s bounding box (clipped to the tile), we evaluate the 2D Gaussian and blend the color.</p>

<p>for a pixel at position $(px, py)$ and a splat centered at $(sx, sy)$, the displacement is $d = (px - sx, py - sy)$. the Gaussian exponent is:</p>

\[\text{power} = -\tfrac{1}{2} \cdot \mathbf{d}^T \cdot \Sigma^{-1} \cdot \mathbf{d}\]

<p>expanding for a 2×2 symmetric inverse covariance with entries $(a, b; b, d)$:</p>

\[\text{power} = -\tfrac{1}{2} \cdot (a \cdot dx^2 + 2b \cdot dx \cdot dy + d \cdot dy^2)\]

<p>the alpha for this splat at this pixel is:</p>

\[\alpha = \text{opacity} \cdot \exp(\text{power})\]

<p>we now composite front-to-back. the framebuffer stores <code class="language-plaintext highlighter-rouge">(rgb_accum, alpha_accum)</code> per pixel, both starting at zero. for each splat, the contribution is:</p>

<p>\(T = 1 - \alpha_{accum}\)
\(\text{contrib} = T \cdot \alpha\)
\(\text{rgb}_{accum} \mathrel{+}= \text{contrib} \cdot \text{color}\)
\(\alpha_{accum} \mathrel{+}= \text{contrib}\)</p>

<p>here $T$ is the transmittance i.e how much light can still pass through. once <code class="language-plaintext highlighter-rouge">alpha_accum</code> reaches the saturation threshold (0.999), the pixel is considered opaque and we skip further splats on it.</p>

<p>the inner pixel loop is the hottest code in the entire rasterizer. to make it as fast as possible, we hoist row-constant terms outside the inner (column) loop. for a fixed row $p_y$, $d_y = p_y - s_y$ is constant, so we precompute everything else and then the inner loop becomes a simple quadratic in dx with all coefficients precomputed</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">composite_splat</span><span class="p">(</span><span class="n">p</span><span class="p">:</span> <span class="o">&amp;</span><span class="n">Projected</span><span class="p">,</span> <span class="n">fb</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="p">[(</span><span class="n">Vec3</span><span class="p">,</span> <span class="nb">f32</span><span class="p">)],</span> <span class="n">w</span><span class="p">:</span> <span class="nb">usize</span><span class="p">,</span> <span class="n">params</span><span class="p">:</span> <span class="o">&amp;</span><span class="n">RenderParams</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">let</span> <span class="p">[</span><span class="n">x0</span><span class="p">,</span> <span class="n">y0</span><span class="p">,</span> <span class="n">x1</span><span class="p">,</span> <span class="n">y1</span><span class="p">]</span> <span class="o">=</span> <span class="n">p</span><span class="py">.bbox</span><span class="p">;</span>

    <span class="c1">// Extract the inverse covariance matrix elements.</span>
    <span class="k">let</span> <span class="n">a</span> <span class="o">=</span> <span class="n">p</span><span class="py">.cov2d_inv.x_axis.x</span><span class="p">;</span> <span class="c1">// (0,0)</span>
    <span class="k">let</span> <span class="n">b</span> <span class="o">=</span> <span class="n">p</span><span class="py">.cov2d_inv.x_axis.y</span><span class="p">;</span> <span class="c1">// (0,1) = (1,0) since symmetric</span>
    <span class="k">let</span> <span class="n">d</span> <span class="o">=</span> <span class="n">p</span><span class="py">.cov2d_inv.y_axis.y</span><span class="p">;</span> <span class="c1">// (1,1)</span>

    <span class="c1">// Coefficients for the decomposed quadratic.</span>
    <span class="k">let</span> <span class="n">dx_coeff</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.5</span> <span class="o">*</span> <span class="n">a</span><span class="p">;</span>
    <span class="k">let</span> <span class="n">dy_coeff</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.5</span> <span class="o">*</span> <span class="n">d</span><span class="p">;</span>
    <span class="k">let</span> <span class="n">cross_coeff</span> <span class="o">=</span> <span class="o">-</span><span class="n">b</span><span class="p">;</span> <span class="c1">// -0.5 * 2 * b</span>

    <span class="k">let</span> <span class="n">saturation</span> <span class="o">=</span> <span class="n">params</span><span class="py">.saturation</span><span class="p">;</span>
    <span class="k">let</span> <span class="n">alpha_threshold</span> <span class="o">=</span> <span class="n">params</span><span class="py">.alpha_threshold</span><span class="p">;</span>
    <span class="k">let</span> <span class="n">opacity</span> <span class="o">=</span> <span class="n">p</span><span class="py">.opacity</span><span class="p">;</span>
    <span class="k">let</span> <span class="n">color</span> <span class="o">=</span> <span class="n">p</span><span class="py">.color</span><span class="p">;</span>
    <span class="k">let</span> <span class="n">sx</span> <span class="o">=</span> <span class="n">p</span><span class="py">.screen.x</span><span class="p">;</span>
    <span class="k">let</span> <span class="n">sy</span> <span class="o">=</span> <span class="n">p</span><span class="py">.screen.y</span><span class="p">;</span>

    <span class="k">for</span> <span class="n">py</span> <span class="k">in</span> <span class="n">y0</span><span class="o">..=</span><span class="n">y1</span> <span class="p">{</span>
        <span class="k">let</span> <span class="n">dy</span> <span class="o">=</span> <span class="n">py</span> <span class="k">as</span> <span class="nb">f32</span> <span class="o">-</span> <span class="n">sy</span><span class="p">;</span>
        <span class="k">let</span> <span class="n">row_base</span> <span class="o">=</span> <span class="n">dy_coeff</span> <span class="o">*</span> <span class="n">dy</span> <span class="o">*</span> <span class="n">dy</span><span class="p">;</span>
        <span class="k">let</span> <span class="n">row_slope</span> <span class="o">=</span> <span class="n">cross_coeff</span> <span class="o">*</span> <span class="n">dy</span><span class="p">;</span>
        <span class="k">let</span> <span class="n">row_offset</span> <span class="o">=</span> <span class="n">py</span> <span class="k">as</span> <span class="nb">usize</span> <span class="o">*</span> <span class="n">w</span><span class="p">;</span>

        <span class="k">for</span> <span class="n">px</span> <span class="k">in</span> <span class="n">x0</span><span class="o">..=</span><span class="n">x1</span> <span class="p">{</span>
            <span class="k">let</span> <span class="n">idx</span> <span class="o">=</span> <span class="n">row_offset</span> <span class="o">+</span> <span class="n">px</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">;</span>
            <span class="k">let</span> <span class="n">cell</span> <span class="o">=</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="n">fb</span><span class="p">[</span><span class="n">idx</span><span class="p">];</span>
            <span class="k">if</span> <span class="n">cell</span><span class="na">.1</span> <span class="o">&gt;=</span> <span class="n">saturation</span> <span class="p">{</span>
                <span class="k">continue</span><span class="p">;</span>
            <span class="p">}</span>
            <span class="k">let</span> <span class="n">dx</span> <span class="o">=</span> <span class="n">px</span> <span class="k">as</span> <span class="nb">f32</span> <span class="o">-</span> <span class="n">sx</span><span class="p">;</span>
            <span class="k">let</span> <span class="n">power</span> <span class="o">=</span> <span class="n">dx_coeff</span> <span class="o">*</span> <span class="n">dx</span> <span class="o">*</span> <span class="n">dx</span> <span class="o">+</span> <span class="n">row_slope</span> <span class="o">*</span> <span class="n">dx</span> <span class="o">+</span> <span class="n">row_base</span><span class="p">;</span>
            <span class="k">if</span> <span class="n">power</span> <span class="o">&gt;</span> <span class="mf">0.0</span> <span class="p">{</span>
                <span class="k">continue</span><span class="p">;</span>
            <span class="p">}</span>
            <span class="k">let</span> <span class="n">alpha</span> <span class="o">=</span> <span class="p">(</span><span class="n">opacity</span> <span class="o">*</span> <span class="nf">fast_exp</span><span class="p">(</span><span class="n">power</span><span class="p">))</span><span class="nf">.min</span><span class="p">(</span><span class="mf">0.999</span><span class="p">);</span>
            <span class="k">if</span> <span class="n">alpha</span> <span class="o">&lt;</span> <span class="n">alpha_threshold</span> <span class="p">{</span>
                <span class="k">continue</span><span class="p">;</span>
            <span class="p">}</span>
            <span class="k">let</span> <span class="n">t</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="o">-</span> <span class="n">cell</span><span class="na">.1</span><span class="p">;</span>
            <span class="k">let</span> <span class="n">contrib</span> <span class="o">=</span> <span class="n">t</span> <span class="o">*</span> <span class="n">alpha</span><span class="p">;</span>
            <span class="n">cell</span><span class="na">.0</span> <span class="o">+=</span> <span class="n">contrib</span> <span class="o">*</span> <span class="n">color</span><span class="p">;</span>
            <span class="n">cell</span><span class="na">.1</span> <span class="o">+=</span> <span class="n">contrib</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="tiled-parallel-compositing">tiled parallel compositing</h3>

<p>now because each tile owns a disjoint pixel sets so we can composite all tiles in parallel with zero synchronization, no atomics and no locks, this becomes very effective on CPU</p>

<p>we use rayon’s <code class="language-plaintext highlighter-rouge">par_iter</code> over tile indices, and raw-pointer writes to give each tile direct access to its pixel rectangle without violating Rust’s <code class="language-plaintext highlighter-rouge">&amp;mut</code> aliasing rules:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">fn</span> <span class="nf">composite_tiled</span><span class="p">(</span>
    <span class="n">projected</span><span class="p">:</span> <span class="o">&amp;</span><span class="p">[</span><span class="n">Projected</span><span class="p">],</span>
    <span class="n">bins</span><span class="p">:</span> <span class="o">&amp;</span><span class="n">TileBins</span><span class="p">,</span>
    <span class="n">fb</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="p">[(</span><span class="n">Vec3</span><span class="p">,</span> <span class="nb">f32</span><span class="p">)],</span>
    <span class="n">width</span><span class="p">:</span> <span class="nb">u32</span><span class="p">,</span>
    <span class="n">height</span><span class="p">:</span> <span class="nb">u32</span><span class="p">,</span>
    <span class="n">params</span><span class="p">:</span> <span class="o">&amp;</span><span class="n">RenderParams</span><span class="p">,</span>
    <span class="n">pool</span><span class="p">:</span> <span class="o">&amp;</span><span class="nb">Option</span><span class="o">&lt;</span><span class="nn">rayon</span><span class="p">::</span><span class="n">ThreadPool</span><span class="o">&gt;</span><span class="p">,</span>
<span class="p">)</span> <span class="p">{</span>
    <span class="k">let</span> <span class="n">w</span> <span class="o">=</span> <span class="n">width</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">;</span>
    <span class="k">let</span> <span class="n">h_i</span> <span class="o">=</span> <span class="n">height</span> <span class="k">as</span> <span class="nb">i32</span><span class="p">;</span>
    <span class="k">let</span> <span class="n">w_i</span> <span class="o">=</span> <span class="n">width</span> <span class="k">as</span> <span class="nb">i32</span><span class="p">;</span>
    <span class="k">let</span> <span class="n">num_tiles_x</span> <span class="o">=</span> <span class="n">bins</span><span class="py">.num_tiles_x</span><span class="p">;</span>
    <span class="k">let</span> <span class="n">num_tiles</span> <span class="o">=</span> <span class="n">bins</span><span class="nf">.num_tiles</span><span class="p">();</span>
    <span class="k">let</span> <span class="n">fb_ptr</span> <span class="o">=</span> <span class="nf">FbPtr</span><span class="p">(</span><span class="n">fb</span><span class="nf">.as_mut_ptr</span><span class="p">());</span>

    <span class="k">let</span> <span class="n">do_composite</span> <span class="o">=</span> <span class="p">||</span> <span class="p">{</span>
        <span class="p">(</span><span class="mi">0</span><span class="o">..</span><span class="n">num_tiles</span><span class="p">)</span><span class="nf">.into_par_iter</span><span class="p">()</span><span class="nf">.for_each</span><span class="p">(</span><span class="k">move</span> <span class="p">|</span><span class="n">tile_idx</span><span class="p">|</span> <span class="p">{</span>
            <span class="k">let</span> <span class="n">fbp</span> <span class="o">=</span> <span class="n">fb_ptr</span><span class="p">;</span>
            <span class="k">let</span> <span class="n">tile_x</span> <span class="o">=</span> <span class="p">(</span><span class="n">tile_idx</span> <span class="k">as</span> <span class="nb">i32</span><span class="p">)</span> <span class="o">%</span> <span class="n">num_tiles_x</span><span class="p">;</span>
            <span class="k">let</span> <span class="n">tile_y</span> <span class="o">=</span> <span class="p">(</span><span class="n">tile_idx</span> <span class="k">as</span> <span class="nb">i32</span><span class="p">)</span> <span class="o">/</span> <span class="n">num_tiles_x</span><span class="p">;</span>
            <span class="k">let</span> <span class="n">px0</span> <span class="o">=</span> <span class="n">tile_x</span> <span class="o">*</span> <span class="n">TILE_W</span><span class="p">;</span>
            <span class="k">let</span> <span class="n">py0</span> <span class="o">=</span> <span class="n">tile_y</span> <span class="o">*</span> <span class="n">TILE_H</span><span class="p">;</span>
            <span class="k">let</span> <span class="n">px1</span> <span class="o">=</span> <span class="p">(</span><span class="n">px0</span> <span class="o">+</span> <span class="n">TILE_W</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span><span class="nf">.min</span><span class="p">(</span><span class="n">w_i</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
            <span class="k">let</span> <span class="n">py1</span> <span class="o">=</span> <span class="p">(</span><span class="n">py0</span> <span class="o">+</span> <span class="n">TILE_H</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span><span class="nf">.min</span><span class="p">(</span><span class="n">h_i</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
            <span class="k">if</span> <span class="n">px0</span> <span class="o">&gt;</span> <span class="n">px1</span> <span class="p">||</span> <span class="n">py0</span> <span class="o">&gt;</span> <span class="n">py1</span> <span class="p">{</span>
                <span class="k">return</span><span class="p">;</span>
            <span class="p">}</span>

            <span class="k">let</span> <span class="n">start</span> <span class="o">=</span> <span class="n">bins</span><span class="py">.offsets</span><span class="p">[</span><span class="n">tile_idx</span><span class="p">]</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">;</span>
            <span class="k">let</span> <span class="n">end</span> <span class="o">=</span> <span class="n">bins</span><span class="py">.offsets</span><span class="p">[</span><span class="n">tile_idx</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">;</span>
            <span class="k">if</span> <span class="n">start</span> <span class="o">==</span> <span class="n">end</span> <span class="p">{</span>
                <span class="k">return</span><span class="p">;</span>
            <span class="p">}</span>
            <span class="k">let</span> <span class="n">splat_ids</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">bins</span><span class="py">.splat_indices</span><span class="p">[</span><span class="n">start</span><span class="o">..</span><span class="n">end</span><span class="p">];</span>

            <span class="k">for</span> <span class="o">&amp;</span><span class="n">sid</span> <span class="k">in</span> <span class="n">splat_ids</span> <span class="p">{</span>
                <span class="k">let</span> <span class="n">p</span> <span class="o">=</span> <span class="k">unsafe</span> <span class="p">{</span> <span class="n">projected</span><span class="nf">.get_unchecked</span><span class="p">(</span><span class="n">sid</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">)</span> <span class="p">};</span>
                <span class="k">let</span> <span class="p">[</span><span class="n">bx0</span><span class="p">,</span> <span class="n">by0</span><span class="p">,</span> <span class="n">bx1</span><span class="p">,</span> <span class="n">by1</span><span class="p">]</span> <span class="o">=</span> <span class="n">p</span><span class="py">.bbox</span><span class="p">;</span>
                <span class="k">let</span> <span class="n">x0</span> <span class="o">=</span> <span class="n">bx0</span><span class="nf">.max</span><span class="p">(</span><span class="n">px0</span><span class="p">);</span>
                <span class="k">let</span> <span class="n">y0</span> <span class="o">=</span> <span class="n">by0</span><span class="nf">.max</span><span class="p">(</span><span class="n">py0</span><span class="p">);</span>
                <span class="k">let</span> <span class="n">x1</span> <span class="o">=</span> <span class="n">bx1</span><span class="nf">.min</span><span class="p">(</span><span class="n">px1</span><span class="p">);</span>
                <span class="k">let</span> <span class="n">y1</span> <span class="o">=</span> <span class="n">by1</span><span class="nf">.min</span><span class="p">(</span><span class="n">py1</span><span class="p">);</span>
                <span class="k">if</span> <span class="n">x0</span> <span class="o">&gt;</span> <span class="n">x1</span> <span class="p">||</span> <span class="n">y0</span> <span class="o">&gt;</span> <span class="n">y1</span> <span class="p">{</span>
                    <span class="k">continue</span><span class="p">;</span>
                <span class="p">}</span>
                <span class="k">unsafe</span> <span class="p">{</span>
                    <span class="nf">composite_splat_region</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">fbp</span><span class="na">.0</span><span class="p">,</span> <span class="n">w</span><span class="p">,</span> <span class="n">x0</span><span class="p">,</span> <span class="n">y0</span><span class="p">,</span> <span class="n">x1</span><span class="p">,</span> <span class="n">y1</span><span class="p">,</span> <span class="n">params</span><span class="p">);</span>
                <span class="p">}</span>
            <span class="p">}</span>
        <span class="p">});</span>
    <span class="p">};</span>

    <span class="k">match</span> <span class="n">pool</span><span class="nf">.as_ref</span><span class="p">()</span> <span class="p">{</span>
        <span class="nf">Some</span><span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="k">=&gt;</span> <span class="n">p</span><span class="nf">.install</span><span class="p">(</span><span class="n">do_composite</span><span class="p">),</span>
        <span class="nb">None</span> <span class="k">=&gt;</span> <span class="nf">do_composite</span><span class="p">(),</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<hr />
<p><strong>Note: the <code class="language-plaintext highlighter-rouge">FbPtr</code> trick</strong></p>

<p>Rust’s borrow checker does not let multiple threads hold <code class="language-plaintext highlighter-rouge">&amp;mut</code> references to the same buffer. but we know that each tile writes to a disjoint pixel rectangle the geometry guarantees it. so we wrap the raw pointer in a <code class="language-plaintext highlighter-rouge">Send + Sync</code> newtype and give each tile direct access:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nf">FbPtr</span><span class="p">(</span><span class="o">*</span><span class="k">mut</span> <span class="p">(</span><span class="n">Vec3</span><span class="p">,</span> <span class="nb">f32</span><span class="p">));</span>
<span class="k">unsafe</span> <span class="k">impl</span> <span class="nb">Send</span> <span class="k">for</span> <span class="n">FbPtr</span> <span class="p">{}</span>
<span class="k">unsafe</span> <span class="k">impl</span> <span class="nb">Sync</span> <span class="k">for</span> <span class="n">FbPtr</span> <span class="p">{}</span>
</code></pre></div></div>

<p>this is the one place in the rasterizer where we need <code class="language-plaintext highlighter-rouge">unsafe</code>. the safety argument is spatial here and tile <code class="language-plaintext highlighter-rouge">(tx, ty)</code> only writes to pixels <code class="language-plaintext highlighter-rouge">[tx*16..(tx+1)*16, ty*16..(ty+1)*16]</code>, and no two tiles share the same <code class="language-plaintext highlighter-rouge">(tx, ty)</code>.</p>

<hr />

<h2 id="optimizations">optimizations</h2>

<p>Here are some additional optimizations that can be applied to make it fast for realtime use on CPU</p>

<h3 id="fast-approximate-exp">fast approximate exp</h3>

<p>the <code class="language-plaintext highlighter-rouge">exp()</code> function in the inner loop is by far the most expensive operation — it dominates everything else, if we don’t need 64-bit libm precision. the Schraudolph trick reinterprets a scaled float as an IEEE 754 bit pattern:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">fast_exp</span><span class="p">(</span><span class="n">x</span><span class="p">:</span> <span class="nb">f32</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">f32</span> <span class="p">{</span>
    <span class="k">let</span> <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="nf">.max</span><span class="p">(</span><span class="o">-</span><span class="mf">87.0</span><span class="p">);</span>
    <span class="k">let</span> <span class="n">v</span> <span class="o">=</span> <span class="p">(</span><span class="mf">12102203.0f32</span> <span class="o">*</span> <span class="n">x</span> <span class="o">+</span> <span class="mf">1065353216.0</span><span class="p">)</span> <span class="k">as</span> <span class="nb">i32</span><span class="p">;</span>
    <span class="nn">f32</span><span class="p">::</span><span class="nf">from_bits</span><span class="p">(</span><span class="n">v</span> <span class="k">as</span> <span class="nb">u32</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="row-level-early-out">row-level early-out</h3>

<p>in the tiled compositor, before entering the inner pixel loop for a row, we check whether the peak of the Gaussian along that row is below the visibility threshold. the Gaussian exponent along a row is a concave quadratic in dx:</p>

\[\text{power}(dx) = \text{dx\_coeff} \cdot dx^2 + \text{row\_slope} \cdot dx + \text{row\_base}\]

<p>the peak of this quadratic (at the vertex) is:</p>

\[\text{row\_peak} = \text{row\_base} + \frac{\text{row\_slope}^2}{2 \cdot a}\]

<p>if even the peak is below <code class="language-plaintext highlighter-rouge">ln(alpha_threshold / opacity)</code>, the entire row is invisible and we skip it:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">row_peak_cutoff</span> <span class="o">=</span> <span class="p">(</span><span class="n">alpha_threshold</span> <span class="o">/</span> <span class="n">opacity</span><span class="p">)</span><span class="nf">.ln</span><span class="p">();</span>
<span class="k">let</span> <span class="n">inv_a</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="o">/</span> <span class="n">a</span><span class="p">;</span>

<span class="k">for</span> <span class="n">py</span> <span class="k">in</span> <span class="n">y0</span><span class="o">..=</span><span class="n">y1</span> <span class="p">{</span>
    <span class="k">let</span> <span class="n">dy</span> <span class="o">=</span> <span class="n">py</span> <span class="k">as</span> <span class="nb">f32</span> <span class="o">-</span> <span class="n">sy</span><span class="p">;</span>
    <span class="k">let</span> <span class="n">row_base</span> <span class="o">=</span> <span class="n">dy_coeff</span> <span class="o">*</span> <span class="n">dy</span> <span class="o">*</span> <span class="n">dy</span><span class="p">;</span>
    <span class="k">let</span> <span class="n">row_slope</span> <span class="o">=</span> <span class="n">cross_coeff</span> <span class="o">*</span> <span class="n">dy</span><span class="p">;</span>
    <span class="k">let</span> <span class="n">row_peak</span> <span class="o">=</span> <span class="n">row_base</span> <span class="o">+</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="n">row_slope</span> <span class="o">*</span> <span class="n">row_slope</span> <span class="o">*</span> <span class="n">inv_a</span><span class="p">;</span>
    <span class="k">if</span> <span class="n">row_peak</span> <span class="o">&lt;</span> <span class="n">row_peak_cutoff</span> <span class="p">{</span>
        <span class="k">continue</span><span class="p">;</span>    <span class="c1">// entire row is below threshold, skip it</span>
    <span class="p">}</span>
    <span class="c1">// ... inner pixel loop ...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>a subtlety: the peak is at the vertex of the quadratic, not at $dx = 0$. using <code class="language-plaintext highlighter-rouge">row_base</code> alone (which is the value at $dx = 0$) would miss peaks that sit off-center and could incorrectly skip visible rows for elongated, tilted Gaussians.</p>

<h3 id="scratch-buffer-reuse">scratch buffer reuse</h3>

<p>the radix sort needs an auxiliary buffer the same size as the input. allocating this every frame would be wasteful. instead we allocate it once and reuse it across frames via a <code class="language-plaintext highlighter-rouge">ScratchBuffers</code> struct:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">struct</span> <span class="n">ScratchBuffers</span> <span class="p">{</span>
    <span class="k">pub</span> <span class="n">sort_aux</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">Projected</span><span class="o">&gt;</span><span class="p">,</span>
    <span class="k">pub</span> <span class="n">tiles</span><span class="p">:</span> <span class="n">TileBins</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>

<p>the <code class="language-plaintext highlighter-rouge">sort_aux</code> vec grows to fit the first frame and never shrinks this enables us to do zero per-frame heap allocation once the system has warmed up.</p>]]></content><author><name>Darshan Makwana</name></author><category term="splatting" /><category term="rendering" /><category term="3d" /><category term="reconstruction" /><summary type="html"><![CDATA[Gaussian Splatting is a fascinating scene reconstruction technique introduced by INRIA and last year I had a lot of fun tinkering with it while on my semex. I recently discovered some of my notes related to it and decided to digitize it this weekend, along the way I reimplemented the forward rasterization pass in rust and decided it would be fun to write a tutorial explaining gaussian splatting to everyone, so here it is what is a gaussian splat?]]></summary></entry><entry><title type="html">What Happens When You SFT a Human on an LLM</title><link href="https://darshanmakwana412.github.io/2026/04/sft-a-human-on-an-llm/" rel="alternate" type="text/html" title="What Happens When You SFT a Human on an LLM" /><published>2026-04-08T00:00:00+05:30</published><updated>2026-04-08T00:00:00+05:30</updated><id>https://darshanmakwana412.github.io/2026/04/sft-a-human-on-an-llm</id><content type="html" xml:base="https://darshanmakwana412.github.io/2026/04/sft-a-human-on-an-llm/"><![CDATA[<p>We have spent the last few years training LLMs on human conversation data with the explicit goal of making them sound like us, but we have also been training ourselves on them in return. I have spent a lot of hours talking to claude over the past couple of months, my juniors grew up doing their assignments with gpt, and there are kids in undergrad right now whose first real intellectual sparring partner was a language model instead of another human. This is not a one way street. The gradients are running in both directions, they are doing SGD on us and we are doing something like SFT on them, and the effect is small enough per interaction that you don’t notice it until one day you start hearing “Yes absolutely I can help you with that” from freinds and you realize you have never heard them say that phrase in your life. There is a <a href="https://news.ycombinator.com/item?id=47673541">USC study</a> that measures this drift in spoken youtube videos after chatgpt, words like delve and meticulous and underscore spiking in frequency in actual human speech, not copy pasted text but mouths, and the comments under the post are full of people confessing that they now say “you’re absolutely right” out loud in real conversations, which is maybe the most claude coded sentence a human being can utter and also the kind of thing you would only notice if someone else said it to you first</p>

<p>The usual response I get when I bring this up with friends is that tools have always shaped the people using them, the printing press did it, radio did it, standardized spelling did it, twitter did it, so what is the big deal about this one. And I think the big deal is that the loop is closed now and the loop is short. Previous tools did not update, the printing press in 1500 was the same printing press in 1600, twitter is roughly the same twitter it was five years ago. But claude 4.6 is not claude 3.5 and it will not be claude 5 in six months, and every retraining cycle grounds on an updated human interaction, and the 2026 internet is written by humans who have spent the last two years being subtly finetuned by the 2024 version of the model, which was itself trained on humans that had been talking to the 2022 version. The model distribution and the human distribution are idk sort of converging, and maybe at one point the reference you would use to even measure the drift, stops existing because nobody is untouched anymore. I don’t know if this converges into something stable or diverges into something strange but either way the concept of “what unassisted human thinking sounds like” quietly diminishes, and I don’t think we are going to notice the day that happens</p>

<p>I heard the other day that people have stopped open sourcing work that they would have happily published a few years ago, specifically because they know it will end up inside a training corpus and the edge they got from writing it will evaporate. The funny second order thing here is that the models got good in the first place because people were generous in public, and the models being good is now the exact reason people are not being generous in public anymore. The commons is being fed into a thing that makes continuing to contribute to the commons feel like a bad trade, and I don’t see any version of that where the commons ends up richer</p>

<p>The other thought which I came across while writing this post is a quiet shift in what humans are actually for in this whole arrangement. I met a second year undergrad a couple weeks ago building a startup, smart kid, and he walked me through his workflow with real pride. He maintains this enormous knowledge base about his company and the agent acts as the CEO in his framing, and he is the hands, the agent drafts the email and he clicks send, the agent preps the talking points and he goes and says them to the investor, the agent reads the contract and tells him where to sign, and I could not figure out in the moment how to tell him that from where I was sitting he had just described himself as a peripheral, an io device with legs, a face and a bank account rented out to something else’s cognition. And the same shape is showing up in people who route feedback and delegation and even performance reviews through a model, so your manager is not writing your review and your response will also not be written by you, and the two models are going to have a little conversation about your career while the two humans pretend to have opinions in the meeting afterwards. The core limiting factor of most industries used to be intelligence, and the story of the last century is basically the story of that factor getting commoditized in stages, first physical labour, then access to knowledge, and now the parsing of that knowledge itself, which is the thing we used to call being smart. What you cannot hand off yet is the legal identity, the bank account, the face on the zoom call, the body that walks into the room, and so that is quietly what humans are being optimized into in this new stack. Not the thinker anymore, the thing that makes the agent’s output count as a real action in the real world, maybe humans will become just an extended version of tools to interact with the world</p>

<p>What if all this becomes the norm and we all shrug and get along with it. In an earlier post I wrote about <a href="/2026/03/claude-code-chronicles/">asking claude to apply for my visa</a> and I framed it as a fun anecdote because it genuinely was, but there is a darker version of that same story and it is the one where my juniors are firing up a claude and asking it to apply to 400 jobs a night while recruiters on the other side are firing up their own claude to screen the same 400 applications, and the only humans in the loop are the ones who wrote the prompts on either end. I already see it happening, my juniors are not landing interviews and I do not think they are less capable than I was at their age, the signal to noise ratio of the application pool has just collapsed because the noise is free to generate and the signal costs the same it always did. I do not want to live in a world where two models negotiate the hiring decision in the middle while the humans on both sides wait around to be told the outcome, and I am not sure if we can get a vote on this</p>]]></content><author><name>Darshan Makwana</name></author><category term="ai" /><category term="llm" /><category term="philosophy" /><category term="society" /><summary type="html"><![CDATA[We have spent the last few years training LLMs on human conversation data with the explicit goal of making them sound like us, but we have also been training ourselves on them in return. I have spent a lot of hours talking to claude over the past couple of months, my juniors grew up doing their assignments with gpt, and there are kids in undergrad right now whose first real intellectual sparring partner was a language model instead of another human. This is not a one way street. The gradients are running in both directions, they are doing SGD on us and we are doing something like SFT on them, and the effect is small enough per interaction that you don’t notice it until one day you start hearing “Yes absolutely I can help you with that” from freinds and you realize you have never heard them say that phrase in your life. There is a USC study that measures this drift in spoken youtube videos after chatgpt, words like delve and meticulous and underscore spiking in frequency in actual human speech, not copy pasted text but mouths, and the comments under the post are full of people confessing that they now say “you’re absolutely right” out loud in real conversations, which is maybe the most claude coded sentence a human being can utter and also the kind of thing you would only notice if someone else said it to you first]]></summary></entry><entry><title type="html">Claude Code Chronicles</title><link href="https://darshanmakwana412.github.io/2026/03/claude-code-chronicles/" rel="alternate" type="text/html" title="Claude Code Chronicles" /><published>2026-03-29T00:00:00+05:30</published><updated>2026-03-29T00:00:00+05:30</updated><id>https://darshanmakwana412.github.io/2026/03/claude-code-chronicles</id><content type="html" xml:base="https://darshanmakwana412.github.io/2026/03/claude-code-chronicles/"><![CDATA[<p>A couple of months back I was facing a kernel panic “Unable to mount root fs” while rebooting by ubuntu with normal mode, I found a quick workaround via recovery mode by dropping to root shell and patching some systemd files to boot with an older kernel version. This is how I had been restarting my computer since then. Today I saw by brightness keys where somehow not functioning at all, I dropped into a root shell and asked claudecode to fix it, left for lunch and then cameback an hour later I noticed all my terminals where gone and my browser closed</p>

<p>I opened up a new terminal and ran <code class="language-plaintext highlighter-rouge">claude -c</code> and started reading what claude essentially did, so it fixed the brigtness issue,  but then it also noticed that I am in recovery mode, questioned why was I in recovery mode, ran some bash commands and later deduced that <code class="language-plaintext highlighter-rouge">initramfs</code> files where never generated and thus NVMe storage drivers couldn’t be loaded in normal mode so kernel would always panik, it fixed the issue by generating the files and then rebooted my computer, the last command which was registered in my <code class="language-plaintext highlighter-rouge">~/.bash_history</code> was <code class="language-plaintext highlighter-rouge">reboot</code></p>

<hr />

<p>Asked claude code to apply for my visa application(inside a headed browser so I can monitor alongside). Used a <a href="https://github.com/vercel-labs/agent-browser">browser cli skill</a> for interacting with chromium which has all my profiles and session ids logged in. Requires credentials for the visa portal log in. Claude starts searching for credentials, opened my obsidian vault, greped for files with visa in the name, there was indeed a file where I had stored my visa credentials. Took the credentials and then logged in as usual and started filling my application. Ig claude’s <a href="https://code.claude.com/docs/en/memory">automemory</a> feature remembered how I use obsidian and store my notes and probably figured I might have stored them somewhere for him to use</p>]]></content><author><name>Darshan Makwana</name></author><category term="claudecode" /><category term="ai" /><category term="productivity" /><category term="anthropic" /><summary type="html"><![CDATA[A couple of months back I was facing a kernel panic “Unable to mount root fs” while rebooting by ubuntu with normal mode, I found a quick workaround via recovery mode by dropping to root shell and patching some systemd files to boot with an older kernel version. This is how I had been restarting my computer since then. Today I saw by brightness keys where somehow not functioning at all, I dropped into a root shell and asked claudecode to fix it, left for lunch and then cameback an hour later I noticed all my terminals where gone and my browser closed]]></summary></entry><entry><title type="html">Quantization, Floating Points and TurboQuant</title><link href="https://darshanmakwana412.github.io/2026/03/quantization-float-points-turboquant/" rel="alternate" type="text/html" title="Quantization, Floating Points and TurboQuant" /><published>2026-03-28T00:00:00+05:30</published><updated>2026-03-28T00:00:00+05:30</updated><id>https://darshanmakwana412.github.io/2026/03/quantization-float-points-turboquant</id><content type="html" xml:base="https://darshanmakwana412.github.io/2026/03/quantization-float-points-turboquant/"><![CDATA[<p>A lot of effort is spent to make LLM inference cheaper and performant. <a href="https://huggingface.co/docs/optimum/en/concept_guides/quantization">Quantization</a> is the standard way to do this, where we reduce model’s size by representing it with parameters with fewer bits so they take up less memory and move faster through the memory hierarchy. The progression from 32-bit -&gt; mixed precision -&gt; 16-bit -&gt; 8-bit -&gt; 4-bit formats has been one of the most impactful practical developments in LLM inference</p>
<h2 id="floating-point-formats">Floating Point Formats</h2>

<p>A <a href="https://en.wikipedia.org/wiki/Floating-point_arithmetic">floating point</a> number consists of a sign bit, $E$ exponent bits and $M$ mantissa bits. If $e$ is the value of the exponent bits (potentially <a href="https://en.wikipedia.org/wiki/Exponent_bias">biased</a>) and $m$ is the value of the mantissa bits, the represented value is</p>

\[f = \text{sign} \cdot 2^e \cdot \left(1 + \frac{m}{2^M}\right)\]

<p>The exponent determines the rough scale of the number and the mantissa determines the precise value within that scale. Standard <a href="https://en.wikipedia.org/wiki/Single-precision_floating-point_format">float32</a> uses $E = 8, M = 23$ for 32 bits total. This is the reference precision for most LLM training</p>

<p>For 16-bit inference it has become popular to use <a href="https://en.wikipedia.org/wiki/Bfloat16_floating-point_format">bfloat16</a> ($E = 8, M = 7$) over traditional <a href="https://en.wikipedia.org/wiki/Half-precision_floating-point_format">float16</a> ($E = 5, M = 10$). The key reason is that bfloat16 preserves the same exponent range as float32, so quantizing from float32 to bfloat16 is straightforward, we can just truncate the mantissa. Having a wider dynamic range matters more than fine grained precision for ML workloads where gradients and activations can span several orders of magnitude.</p>

<h2 id="nvfp4-and-the-limits-of-scalar-quantization">NVFP4 and the Limits of Scalar Quantization</h2>

<p>Things gets interesting when we go below 8 bits, with 4 bits we can only represent 16 distinct values. At that point it is not even obvious that anything useful can be preserved. <a href="https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/">NVFP4</a> is NVIDIA’s answer to this, and it pushes scalar quantization to the extreme</p>

<p>NVFP4 is not really a standalone data type. It is a format for an entire <a href="https://docs.pytorch.org/docs/stable/tensors.html">tensor</a>. Each element is stored as a 4-bit float ($E = 2, M = 1$) but the tensor also carries an 8-bit scaling factor for every 16 elements and a single 32-bit scaling factor for the whole tensor. The per-block scale captures the local magnitude distribution and the per-tensor scale captures the global one. Together they compensate for the limited range of 4 raw bits.</p>

<p>The overhead works out to about 0.5 extra bits per element (the 8-bit scale amortized over 16 values), bringing effective storage to 4.5 bits per weight. NVIDIA built hardware support for this into the <a href="https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/">Blackwell architecture</a> so the complexity is an abstraction for CUDA kernel developers.</p>
<h2 id="turboquant">TurboQuant</h2>

<p>TurboQuant is a recently popular vector quantization algorithm. It takes a vector of numbers and produces a quantized version that uses less memory. At its core the idea is pretty simple, before apply quantization apply a random rotation in an n-dimensional vector space that it lives in and for dequantization apply the corresponding reverse quantization. This rotation is not learned, not input-dependent, not sampled from some special distribution. It is just a random orthogonal transformation. And it dramatically improves quantization quality</p>

<p>TurboQuant then uses a second correction step for eliminating bias in the attention block computation, I haven’t read about it entirely yet but they use <a href="https://arxiv.org/abs/2504.19874">Quantized Johnson-Lindenstrauss</a> transform to preserve dot products accurately, and provide theoretical guarantess to support it</p>

<p><strong>References:</strong></p>

<ol>
  <li><a href="https://arxiv.org/abs/2504.19874">TurboQuant: Online Vector Quantization (Zandieh et al. 2025)</a></li>
  <li><a href="https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/">NVFP4 for Efficient Low-Precision Inference - NVIDIA</a></li>
</ol>]]></content><author><name>Darshan Makwana</name></author><category term="llm" /><category term="quantization" /><category term="ml" /><category term="inference" /><summary type="html"><![CDATA[A lot of effort is spent to make LLM inference cheaper and performant. Quantization is the standard way to do this, where we reduce model’s size by representing it with parameters with fewer bits so they take up less memory and move faster through the memory hierarchy. The progression from 32-bit -&gt; mixed precision -&gt; 16-bit -&gt; 8-bit -&gt; 4-bit formats has been one of the most impactful practical developments in LLM inference Floating Point Formats]]></summary></entry><entry><title type="html">A System of Journaling</title><link href="https://darshanmakwana412.github.io/2026/03/a-system-of-journaling/" rel="alternate" type="text/html" title="A System of Journaling" /><published>2026-03-06T00:00:00+05:30</published><updated>2026-03-06T00:00:00+05:30</updated><id>https://darshanmakwana412.github.io/2026/03/a-system-of-journaling</id><content type="html" xml:base="https://darshanmakwana412.github.io/2026/03/a-system-of-journaling/"><![CDATA[<p>One of the biggest hurdles I faced in consistently maintaining a blog like this site is having to manually copy paste my notes into my github.io directory as markdown files. This friction compounded over time and I would end up with a backlog of drafts that never made it to this site. So I decided to tinker around this a bit and create a more automated solution</p>

<h1 id="the-markdown-era">The Markdown Era</h1>

<p>Before any of this I used to maintain a single markdown file named <code class="language-plaintext highlighter-rouge">journal.md</code> to log everything like passwords, things I am currently working on, upcoming deadlines, calendar. The system was dead simple, I had several sections in the file like</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gu">## ToDo</span>
<span class="p">-</span> [ ] ...
<span class="p">-</span> [ ] ...

<span class="gu">## Deadlines</span>
<span class="p">-</span> ...
<span class="p">-</span> ...
  
<span class="gu">## some stuff</span>
<span class="p">-</span> ...
<span class="p">-</span> ...
</code></pre></div></div>

<p>Whenever I had to log something I used to create a section for it in the journal and add a couple of checkpoints and bullets around it for context, when I had to put it in priority it goes to the top in the #ToDo section. This kind of worked because I wasn’t logging a lot of things, then when I started journaling this system just wasn’t going to scale</p>

<h1 id="age-of-google-docs">Age of Google Docs</h1>

<p>So then to scale this I started maintaining a google doc named “Agency” and used it as a journal where I jotted down my daily thoughts, reflections, notes from the books I am reading, learnings, plans for new year resolutions and everything that usually happens around ones life</p>

<p><img src="/assets/images/posts/Pasted%20image%2020260306195036.png" alt="Pasted image 20260306195036.png" /></p>

<p>I used this method of journaling extensively in 2025 but it grew so enormous in size, overtime it had more than 80 tabs in a single google doc, that it was becoming hard for me to refactor and review everything. I broke it down into categories of months and added cross links between tabs so each month had a single tab and other pages were linked from a single index page. This worked for a while but when the number of tabs reached 150+ this system still broke down. I had to remember the title of the tab where I had stored some info which I wanted to revisit. The lack of a global search feature to search across tabs and across tab titles made it even worse</p>

<p>This was around the same time that I got my hands on <a href="https://cursor.com/">cursor</a> which is an AI assisted IDE. Cursor was the tool which exposed me to AI agents and their scaffolding around codebases. It was my first hand experience of feeling the <a href="https://www.anthropic.com/research/labor-market-impacts">impact of AI in the labour market</a>. This was also around the same time that I wanted to tinker with the idea of using these agents to traverse this mesh of thoughts that I have and render them transparent so <a href="https://docs.google.com/document/d/19-ajYTp2hwOW9WcirY9OIoSvMdfyivup8LMI92PkS20/edit?tab=t.0">I can ask them clarifying questions on them</a>. This meant going back to the <a href="2026-03-6-a-system-of-journaling#^878953">markdown era</a> but with the organizational system that I had already developed for myself around google docs</p>

<h1 id="enter-obsidian">Enter Obsidian</h1>

<p>I discovered <a href="https://obsidian.md/">obsidian</a> around the same time. It’s a markdown based editor that comes with a lot of functionalities and plugins that are created for exactly the same reasons which I was considering to leave google docs for. It has builtin support for search across notes, ability to cross link notes with <code class="language-plaintext highlighter-rouge">[wikilinks](wikilinks)</code>, templates, calendar support, workspace layouts and awesome themes. I have customized my obsidian to look like the dark version of <a href="https://www.lesswrong.com/">lesswrong</a> hehe</p>

<table>
  <thead>
    <tr>
      <th><img src="/assets/images/posts/Pasted%20image%2020260306202438.png" alt="Pasted image 20260306202438.png" /></th>
      <th><img src="/assets/images/posts/Pasted%20image%2020260306202538.png" alt="Pasted image 20260306202538.png" /></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td> </td>
      <td> </td>
    </tr>
  </tbody>
</table>

<p>It also comes with a graph viewer which renders your entire vault as an interconnected graph where connections naturally emerge when you cross link and cross reference things. I just started using obsidian a month ago and have been absolutely loving it</p>

<h1 id="publishing-system">Publishing System</h1>

<p>So now I had a great writing environment but the original problem remained, how do I get stuff from obsidian to my github pages site without manually copying things around? I needed a way to sync my vault content to a remote repository where it gets rendered as a static site</p>

<p>This site uses <a href="https://jekyllrb.com/">Jekyll</a> which is a static site generator that github pages natively supports. This site uses a custom css theme derived from <a href="https://github.com/jekyll/minima">minima</a>. Jekyll expects markdown files with YAML frontmatter in a specific directory structure (<code class="language-plaintext highlighter-rouge">_posts/</code> for blog posts, root for standalone pages) but Obsidian writes markdown but with its own conventions  <code class="language-plaintext highlighter-rouge">[wikilinks](wikilinks)</code> for cross references, <code class="language-plaintext highlighter-rouge">![embeds](/assets/images/embeds)</code> for images, and <a href="https://github.com/blacktree/obsidian-dataview">dataview</a> queries for dynamic content. So I also had to ensure that this conversion process is also handled correctly</p>

<h1 id="trying-quartz-syncer">Trying Quartz Syncer</h1>

<p>My first attempt was <a href="https://github.com/jackyzha0/quartz">Quartz</a>, a batteries included static site generator that transforms markdown content into fully functional websites. Quartz already handles obsidian flavored markdown so the translation problem goes away. All I needed was a way to sync content from obsidian to the remote quartz repository</p>

<p>I tried <a href="https://github.com/saberzero1/quartz-syncer">quartz-syncer</a>, an obsidian plugin built exactly for this. I set everything up following the <a href="https://saberzero1.github.io/quartz-syncer-docs/Guides/GitHub-Setup">documentation</a>, added my github repo, and the connection showed successful
<img src="/assets/images/posts/Pasted%20image%2020260306212144.png" alt="Pasted image 20260306212144.png" />
I wrote a dummy page with <code class="language-plaintext highlighter-rouge">publish: true</code> in the frontmatter and it correctly showed up as unpublished in the publication center. When I hit publish it said publication successful. But there were no commits in my repo. Nothing actually happened
<img src="/assets/images/posts/Pasted%20image%2020260306212204.png" alt="Pasted image 20260306212204.png" />
I <a href="https://github.com/saberzero1/quartz-syncer/issues/110">opened an issue</a> about this, tried giving all repo access to my classic token thinking permission access could be the reason, but it still failed to create commits. I spent a good amount of time debugging this and tweaking settings but couldn’t get it to work</p>

<h1 id="writing-my-own-script">Writing My Own Script</h1>

<p>At that point I decided to just write it myself. The requirements were simple enough:</p>
<ol>
  <li>Crawl specific folders and files from my vault based on a config</li>
  <li>Convert obsidian markdown to jekyll compatible markdown</li>
  <li>Handle image embeds and wikilinks</li>
  <li>Push to the upstream repo</li>
</ol>

<p>This is the system design of the entire setup that I put together
<img src="/assets/images/posts/Pasted%20image%2020260306212258.png" alt="Pasted image 20260306212258.png" />
I put together a <a href="https://gist.github.com/darshanmakwana412/287883670407b5f8880d159c45ac6571">python script</a> that does exactly this. The setup is driven by a <code class="language-plaintext highlighter-rouge">publish.md</code> config file in the vault root where I specify the vault path, the site repo path and which paths to sync:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="na">site_repo</span><span class="pi">:</span> <span class="s">/path/to/darshanmakwana412.github.io</span>
<span class="na">obsidian_vault</span><span class="pi">:</span> <span class="s">/path/to/obsidian</span>
<span class="na">paths</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="s">posts/</span>
  <span class="pi">-</span> <span class="s">bookmarks.md</span>
  <span class="pi">-</span> <span class="s">birding.md</span>
  <span class="pi">-</span> <span class="s">Bookshelf.md</span>
<span class="nn">---</span>
</code></pre></div></div>

<p>Directories ending with <code class="language-plaintext highlighter-rouge">/</code> are synced as jekyll blog posts into <code class="language-plaintext highlighter-rouge">_posts/</code>, standalone files are synced as pages at the root. The script handles the obsidian to jekyll translation  <code class="language-plaintext highlighter-rouge">![image.png](/assets/images/image.png)</code> becomes <code class="language-plaintext highlighter-rouge">![image.png](/assets/images/...)</code>, <code class="language-plaintext highlighter-rouge">[wikilinks](wikilinks)</code> become standard markdown links, and it copies over all the image assets to the right places</p>

<p>Running <code class="language-plaintext highlighter-rouge">./publish.py sync</code> crawls the vault and syncs everything. <code class="language-plaintext highlighter-rouge">./publish.py push</code> commits and pushes to remote. <code class="language-plaintext highlighter-rouge">./publish.py publish</code> does both in sequence. There is also a <code class="language-plaintext highlighter-rouge">./publish.py watch</code> mode that uses <a href="https://github.com/gorakhargosh/watchdog">watchdog</a> to detect file changes in the vault and auto syncs, which is pretty nice when you are actively writing and want to see changes reflected quickly. So just usually keep this script running in the background all the time</p>

<h1 id="some-glue-work">Some Glue Work</h1>

<p>The one thing that still needs work is <a href="https://github.com/blacktree/obsidian-dataview">dataview</a> parsing. Dataview is an obsidian plugin that lets you query your vault like a database. I use it on my bookshelf page to render a table of books I am currently reading:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>TABLE WITHOUT ID
	link(file.link, title) AS Book,
	author AS Author,
	"&lt;progress value='" + chapters_read + "' max='" + total_chapters + "'&gt;&lt;/progress&gt; " + chapters_read + "/" + total_chapters AS Progress,
	notes as Notes,
	embed(link(meta(cover).path)) AS Cover
FROM #book
WHERE status = "reading"
SORT file.mtime DESC
</code></pre></div></div>

<p><img src="/assets/images/posts/Pasted%20image%2020260306212555.png" alt="Pasted image 20260306212555.png" />
The script has a basic dataview resolver that parses <code class="language-plaintext highlighter-rouge">FROM #tag</code> and <code class="language-plaintext highlighter-rouge">WHERE</code> clauses and renders them as markdown tables. It works for simple queries but anything with computed columns like the progress bar or cover embeds needs more work. You can see how it currently renders at <a href="https://darshanmakwana412.github.io/bookshelf/">darshanmakwana412.github.io/bookshelf</a></p>

<p><img src="/assets/images/posts/Pasted%20image%2020260306202538.png" alt="Pasted image 20260306202538.png" /></p>]]></content><author><name>Darshan Makwana</name></author><category term="strategy" /><category term="philosophy" /><category term="obsidian" /><category term="life" /><summary type="html"><![CDATA[One of the biggest hurdles I faced in consistently maintaining a blog like this site is having to manually copy paste my notes into my github.io directory as markdown files. This friction compounded over time and I would end up with a backlog of drafts that never made it to this site. So I decided to tinker around this a bit and create a more automated solution]]></summary></entry><entry><title type="html">2 ways to bet on a Trillion Dollar Market</title><link href="https://darshanmakwana412.github.io/2026/02/two_ways_to_bet_on_a_trillion_dollar_market/" rel="alternate" type="text/html" title="2 ways to bet on a Trillion Dollar Market" /><published>2026-02-17T00:00:00+05:30</published><updated>2026-02-17T00:00:00+05:30</updated><id>https://darshanmakwana412.github.io/2026/02/two_ways_to_bet_on_a_trillion_dollar_market</id><content type="html" xml:base="https://darshanmakwana412.github.io/2026/02/two_ways_to_bet_on_a_trillion_dollar_market/"><![CDATA[<p>I was listening to <a href="https://www.youtube.com/watch?v=n1E9IZfvGMA">Dario Amodei’s interview with dwarkesh patel</a> and found his insights into how anthropic plans their capex investments and path to profitability quite fascinating. They need to balance their risks into how much compute to build for the next 2 years in advance based on current demands because the data centers take 2 years to build. If they overestimate their demand then they won’t have enough profit in the next years and will go bankrupt while if they underestimate it they won’t be able to match the demand and will risk losing their customers to their competitors, this is what he calls their cone of uncertainty. This sentiment felt weird to me because openai seems to aggressively bullish on their capex investments, infact sam altman disclosed they will be <a href="https://tomtunguz.com/openai-hardware-spending-2025-2035/">spending $1 trillion on compute infra across microsoft, oracle, nvidia and coreweave between 2025 and 2035</a> while also <a href="https://openai.com/index/cerebras-partnership/">partnerring with cerebras</a>, so why do these 2 AI companies have completely different capex investment strategies?</p>

<p>The business model of anthropic relies on building things that enterprises will pay for and use that to build a path of profitability. As dario said anthropic has 10X it’s revenue every year since 2023, but it can only continue for so long as the GDP is only finite and once majority of the value is catpured it will start showing diminishing returns, If growth slows down to 5X and not 10X and they purchased compute based 10X multiplier prediction they will go bankrupt. They are cautious of capex and only focusing on a few things and getting them right</p>

<p>Throughout our history whenever a technological revolution occurs it brings together an enormous value generation with it which can be measured in terms of productivity, GDP increase, increase in standard of living, etc. But dario mentions that AI hasn’t fully diffused as it’s effects are yet to be seen in economic growth<sup><a href="#fn1" id="ref1">1</a></sup>. This also means that there is a lot of value that is untouched and yet to be generated, openai seems to be in a strategy where they want to dominate the market and capture all this value. They seem to be like facebook in this sense, aggressively grow, capture the entire market and thus capture all the value that will be generated by the technology and a path of profitability will emerge later. This is pretty much consistent with their investments across every possible field where AI is yet to emerge</p>
<ul>
  <li>They <a href="https://www.bloomberg.com/news/articles/2025-05-06/openai-reaches-agreement-to-buy-startup-windsurf-for-3-billion">wanted to acquire Windsurf for $3 billion</a> to own the developer IDE stack (they got out competed by google at the last minute)</li>
  <li>They <a href="https://openai.com/index/openai-to-acquire-neptune/">acquired Neptune</a> to strengthen their internal model training infrastructure</li>
  <li>They <a href="https://openai.com/sam-and-jony/">merged with io Products</a> bringing Jony Ive into the fold to build hardware products</li>
  <li>They <a href="https://openai.com/index/introducing-prism/">launched Prism</a>, a colloborative and AI assisted latex editor</li>
  <li>They <a href="https://openai.com/index/openai-for-healthcare/">introduced OpenAI for Healthcare</a></li>
  <li>They <a href="https://openai.com/index/gpt-5-lowers-protein-synthesis-cost/">used GPT 5 to autonomously design cell synthesis recipies</a></li>
  <li>They <a href="https://openai.com/index/chatgpt-shopping-research/">built shopping research into ChatGPT</a></li>
  <li>They <a href="https://openai.com/index/investing-in-merge-labs/">invested in building Human Brain Computer Interfaces</a></li>
  <li>and just this month they acqui hired<sup><a href="#fn2" id="ref2">2</a></sup>, <a href="https://techcrunch.com/2026/02/15/openclaw-creator-peter-steinberger-joins-openai/">Peter Steinberger and OpenClaw</a> to drive the next generation of personal agents</li>
  <li>and many other things which I might (definitely) have missed</li>
</ul>

<p>They are trying to have a hold of everything that AI will have impact on</p>

<p>Anthropic is trying to survive by focusing only on a few bets while openai needs at least some of their many bets to work to eventually get the payouts. I don’t know which strategy will workout for either of them, but it would be surely fun to come back to this note after 5, 10, or 20 years and figure out how things eventually turned out for them. This brings me to another thing which I found fascinating from the interview which is the <a href="https://www.darioamodei.com/essay/the-adolescence-of-technology">“country of geniuses”</a> and by dario’s predictions will take 2 years in the best scenario and 10 years in the worst case. So if the combined market is going to be so enormous then ig it does not make sense which strategy you are using as long as you don’t die (by die I mean you go bankrupt), there will be enough value to capture for everyone</p>

<p><img src="/assets/images/posts/ai_summit_ai_summit.png" alt="ai_summit_ai_summit.png" /></p>

<hr />

<div class="footnotes">
<p id="fn1"><sup>1</sup> There is an argument to be made here as to why has this diffusion has been slow, infact it is quite fast compared to other technological revolutions, but it appears to be slow because the improvement in models capabilities are far exceeding than it's adoption can keep up. If you were an enterprise and someone used to do workflow x or y, you realize the models can now do it, but someone still has to program it to do it, provide the relevant context around it, handle the edge cases where someone else was correlated with workflow x and y via z, you update the context and it's harness and realize that the model is now smart enough to handle it on it's own but someone has to now test it, and yade yada the list goes on, and also don't forget by the time you did all this <a href="https://www.anthropic.com/news/claude-opus-4-6">Opus 4.6</a> dropped which can now one shot workflow x and y via z : ) <a href="#ref1">↩</a></p>

<p id="fn2"><sup>2</sup> Aqui hire is a relatively new term that was extensively used in 2025 (<a href="https://groq.com/newsroom/groq-and-nvidia-enter-non-exclusive-inference-technology-licensing-agreement-to-accelerate-ai-inference-at-global-scale">nvidia aqui hired groq</a>, <a href="https://www.forbes.com/sites/janakirammsv/2025/06/23/meta-invests-14-billion-in-scale-ai-to-strengthen-model-training/">Meta acqui hires Scale AI</a>, <a href="https://www.cnbc.com/2025/07/14/cognition-to-buy-ai-startup-windsurf-days-after-google-poached-ceo.html">Google acqui hired windsurf</a>), this happens when companies want to acquire a competitor but cannot go through the legal route because of anti trust lawsuits and DOJ, so instead they hire out all the founders, key people (and the domain expertise) who built the underlying technology stack and replicate in their business. The company is then left as a husk just to survive with it's employees unfair value returns on their stock options, again quite fascinating to read, <a href="https://en.wikipedia.org/wiki/Acqui-hiring">https://en.wikipedia.org/wiki/Acqui-hiring</a> <a href="#ref2">↩</a></p>
</div>]]></content><author><name>Darshan Makwana</name></author><category term="ai" /><category term="business" /><category term="strategy" /><category term="anthropic" /><category term="openai" /><summary type="html"><![CDATA[I was listening to Dario Amodei’s interview with dwarkesh patel and found his insights into how anthropic plans their capex investments and path to profitability quite fascinating. They need to balance their risks into how much compute to build for the next 2 years in advance based on current demands because the data centers take 2 years to build. If they overestimate their demand then they won’t have enough profit in the next years and will go bankrupt while if they underestimate it they won’t be able to match the demand and will risk losing their customers to their competitors, this is what he calls their cone of uncertainty. This sentiment felt weird to me because openai seems to aggressively bullish on their capex investments, infact sam altman disclosed they will be spending $1 trillion on compute infra across microsoft, oracle, nvidia and coreweave between 2025 and 2035 while also partnerring with cerebras, so why do these 2 AI companies have completely different capex investment strategies?]]></summary></entry><entry><title type="html">It takes very high agency</title><link href="https://darshanmakwana412.github.io/2026/02/it-takes-high-agency/" rel="alternate" type="text/html" title="It takes very high agency" /><published>2026-02-08T00:00:00+05:30</published><updated>2026-02-08T00:00:00+05:30</updated><id>https://darshanmakwana412.github.io/2026/02/it-takes-high-agency</id><content type="html" xml:base="https://darshanmakwana412.github.io/2026/02/it-takes-high-agency/"><![CDATA[<p>It takes very high agency</p>
<ul>
  <li>To realize when you are in the wrong place</li>
  <li>To realize the world is drifting apart from the old rules and new rules are being written, the old rules which you learned no longer apply</li>
  <li>To realize the rewards tremendously outweigh the costs of consistently putting efforts on updating your mental models</li>
  <li>To realize that people don’t act with incentives as much as they think</li>
  <li>To see the world with clarity and take it as it is and act on it</li>
  <li>Games of money, power, hierarchies are not worth playing and avoid people who are playing them</li>
  <li>Conviction is a hell of power drug, it makes you do things that you would never do otherwise</li>
</ul>]]></content><author><name>Darshan Makwana</name></author><category term="life" /><category term="philosophy" /><summary type="html"><![CDATA[It takes very high agency To realize when you are in the wrong place To realize the world is drifting apart from the old rules and new rules are being written, the old rules which you learned no longer apply To realize the rewards tremendously outweigh the costs of consistently putting efforts on updating your mental models To realize that people don’t act with incentives as much as they think To see the world with clarity and take it as it is and act on it Games of money, power, hierarchies are not worth playing and avoid people who are playing them Conviction is a hell of power drug, it makes you do things that you would never do otherwise]]></summary></entry><entry><title type="html">How to Forge Your Conviction</title><link href="https://darshanmakwana412.github.io/2026/01/how-to-forge-your-conviction/" rel="alternate" type="text/html" title="How to Forge Your Conviction" /><published>2026-01-25T00:00:00+05:30</published><updated>2026-01-25T00:00:00+05:30</updated><id>https://darshanmakwana412.github.io/2026/01/how-to-forge-your-conviction</id><content type="html" xml:base="https://darshanmakwana412.github.io/2026/01/how-to-forge-your-conviction/"><![CDATA[<p>In the later years of my undergraduate studies I was lucky enough to have surrounded myself with mentors and peers who taught me not to chase after <a href="https://www.joanwestenberg.com/thin-desires-are-eating-your-life/">thin desires</a>, to <a href="https://armeet.bearblog.dev/becoming-the-machine/">not become the machine</a>, to cultivate confidence, invest time in forging relationships, invest in updating your <a href="https://fs.blog/mental-models/">mental models</a> and be <a href="https://en.wikipedia.org/wiki/Equanimity">equanimious</a>. By giving me frequent feedback I was able to learn more quickly than I could have starting out on my own, not so unsurprisingly an analogy of this exists in agentic coding models like <a href="https://claude.com/product/claude-code">claude code</a>, <a href="https://opencode.ai/">opencode</a>, <a href="https://openai.com/codex/">codex</a>, etc where agents get execution trace of the programs as context allowing them to reflect on their actions and perform much better. When you do this enough times you build a world model of the people that you interact with either consciously or unconsciously, some thought processes like “How would vignesh think through these trade offs?”, “How would nikhil approach this?”, etc starts to go around when you confront difficult choices in your life</p>

<p>A couple of years passed by and I had to invariably face the fact that I no longer have access to my mentors and had to rely on my own mental models as we had split ways and I choose the local optima and became a part of a team where I come across as a very technically opinionated engineer in absolute terms. My team’s primary tasks were cost optimization and codebase improvement, so I was assigned to improve our infrastructure for STT models. This means I had to make a couple of high stakes decisions on my own but because the stakes are higher it takes a long time to see results, all these decisions require conviction: a belief that your ideas are worth spending the time and effort on, you need to be very confident before you actually commit to your actions. I had a lot of ideas to work on in this but I did not had access to the verifiers/reward models that I had in the past. I started spending a significant amount of time thinking through each of my ideas and actually implemented each of them. Some of them turned out to be failures like the <a href="/2025/10/confidence-aware-router/">confidence aware router</a> using an LLM models internal activation signals to access the models confidence in answering a query but some of them worked surprisingly great like building a better scheduler for scheduling STT queries based on expected completion time. The important thing that happened consciously or unconsciously is I started building world models of the problems and not my mentors. I believe this cycle repeated across multiple iterations is what gives rise to/develops conviction</p>

<p>I have to do a lot more thinking when making architectural decisions myself rather than relying on more experienced engineers in my team, turns out this is very effective in internalising the mistakes and learnings in my decision making process to ensure I never repeat it again. Also unsurprisingly another similar analogy exists when training LLMs, progress stagnates with SFT where you train the model on full generations of human conversation, imitation can only take you so far (oppenheimer!), and you have to do RL post training where you enable the model to learn which of it’s generations actually worked and which ones did not</p>

<p>Before the world can realize your ideas, you have to realize it in your subjective reality, this belief is not something that arises as an eureka but has to be wielded. You have to earn the right to have your conviction. This kind of conviction can’t be learned from mentors, people or books it has to learned from one’s own mistakes. Now it turns out this is very hard to do. It’s very hard to actually sit down and face the chaos in your head and churn them into clarity and conviction which you can act on. People by nature take the path of least resistance, don’t be like them. It is very easy to build things at the surface or have a surface understanding but the right infrastructure, mental models or investment compounds in weird and interesting patterns that gives rise to long term moats. It is that by doing hard things that you go deeper and gain massive advantage over others</p>]]></content><author><name>Darshan Makwana</name></author><category term="conviction" /><category term="mental models" /><category term="world models" /><summary type="html"><![CDATA[In the later years of my undergraduate studies I was lucky enough to have surrounded myself with mentors and peers who taught me not to chase after thin desires, to not become the machine, to cultivate confidence, invest time in forging relationships, invest in updating your mental models and be equanimious. By giving me frequent feedback I was able to learn more quickly than I could have starting out on my own, not so unsurprisingly an analogy of this exists in agentic coding models like claude code, opencode, codex, etc where agents get execution trace of the programs as context allowing them to reflect on their actions and perform much better. When you do this enough times you build a world model of the people that you interact with either consciously or unconsciously, some thought processes like “How would vignesh think through these trade offs?”, “How would nikhil approach this?”, etc starts to go around when you confront difficult choices in your life]]></summary></entry><entry><title type="html">Strassen’s matmul with AVX 512 kernel</title><link href="https://darshanmakwana412.github.io/2026/01/strassen-matrix-multiplication/" rel="alternate" type="text/html" title="Strassen’s matmul with AVX 512 kernel" /><published>2026-01-18T00:00:00+05:30</published><updated>2026-01-18T00:00:00+05:30</updated><id>https://darshanmakwana412.github.io/2026/01/strassen-matrix-multiplication</id><content type="html" xml:base="https://darshanmakwana412.github.io/2026/01/strassen-matrix-multiplication/"><![CDATA[<p>In the previous <a href="/2026/01/17/avx512-matrix-multiplication.html">matmul with avx512 and loop tiling</a> note we managed to build a cpu kernel that achieves 92% of peak FLOPS. I wanted to check if we can do better by using Strassen’s algorithm. <a href="https://en.wikipedia.org/wiki/Strassen_algorithm">Strassen’s algorithm</a> from 1969 showed that matrix multiplication can be done in $O(n^{2.807})$ by trading multiplications for additions. The main idea of this post is to use strassens for high level recursions and fallback to our highly optimized kernel for lower levels</p>

<p><strong>Table of Contents:</strong></p>
<ul>
  <li><a href="#the-standard-algorithm">The Standard Algorithm</a></li>
  <li><a href="#strassens-algorithm">Strassen’s Algorithm</a></li>
  <li><a href="#implementation">Implementation</a></li>
  <li><a href="#choosing-the-recursion-depth">Choosing the Recursion Depth</a></li>
  <li><a href="#benchmarks">Benchmarks</a></li>
</ul>

<h2 id="the-standard-algorithm">The Standard Algorithm</h2>

<p>Standard matrix multiplication for $C = A \times B$ where all matrices are $n \times n$ computes:</p>

\[C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}\]

<p>This requires $n^3$ multiplications and $n^3$ additions for a total of $2n^3$ floating point operations. For decades this was assumed optimal until Strassen showed otherwise</p>

<h2 id="strassens-algorithm">Strassen’s Algorithm</h2>

<p>Strassen observed that multiplying two $2 \times 2$ matrices:</p>

\[\begin{bmatrix} C_{11} &amp; C_{12} \\ C_{21} &amp; C_{22} \end{bmatrix} = 
\begin{bmatrix} A_{11} &amp; A_{12} \\ A_{21} &amp; A_{22} \end{bmatrix}
\begin{bmatrix} B_{11} &amp; B_{12} \\ B_{21} &amp; B_{22} \end{bmatrix}\]

<p>normally requires 8 multiplications (each $C_{ij}$ needs 2 products). But with clever grouping we can do it with only 7 multiplications at the cost of more additions</p>

<p>The key insight is that additions are cheap compared to multiplications, especially when the “elements” are themselves large submatrices. If we partition $n \times n$ matrices into four $n/2 \times n/2$ blocks, we can recursively apply this trick. At each level we replace 8 recursive calls with 7, giving the recurrence:</p>

\[T(n) = 7 T(n/2) + O(n^2)\]

<p>The $O(n^2)$ term comes from the matrix additions. Solving this recurrence gives $T(n) = O(n^{\log_2 7}) = O(n^{2.807})$</p>

<p>Strassen’s algorithm computes seven intermediate products M1 through M7:</p>

\[\begin{aligned}
M_1 &amp;= (A_{11} + A_{22})(B_{11} + B_{22}) \\
M_2 &amp;= (A_{21} + A_{22}) B_{11} \\
M_3 &amp;= A_{11} (B_{12} - B_{22}) \\
M_4 &amp;= A_{22} (B_{21} - B_{11}) \\
M_5 &amp;= (A_{11} + A_{12}) B_{22} \\
M_6 &amp;= (A_{21} - A_{11})(B_{11} + B_{12}) \\
M_7 &amp;= (A_{12} - A_{22})(B_{21} + B_{22})
\end{aligned}\]

<p>Then the output blocks are:</p>

\[\begin{aligned}
C_{11} &amp;= M_1 + M_4 - M_5 + M_7 \\
C_{12} &amp;= M_3 + M_5 \\
C_{21} &amp;= M_2 + M_4 \\
C_{22} &amp;= M_1 - M_2 + M_3 + M_6
\end{aligned}\]

<p>Each $M_i$ requires one matrix multiplication and 0-2 matrix additions/subtractions. The final assembly requires 8 additions. Total: 7 multiplications and 18 additions instead of 8 multiplications and 4 additions</p>

<h2 id="implementation">Implementation</h2>

<p>We need helper functions for matrix addition, subtraction, and copying. These are straightforward but need to handle different strides since submatrices have the parent’s stride:</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span>
<span class="kt">void</span> <span class="nf">addMat</span><span class="p">(</span>
    <span class="kt">float</span> <span class="o">*</span><span class="n">C</span><span class="p">,</span> <span class="kt">int</span> <span class="n">C_stride</span><span class="p">,</span>
    <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">A</span><span class="p">,</span> <span class="kt">int</span> <span class="n">A_stride</span><span class="p">,</span>
    <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">B</span><span class="p">,</span> <span class="kt">int</span> <span class="n">B_stride</span><span class="p">,</span>
    <span class="kt">int</span> <span class="n">n</span>
<span class="p">)</span> <span class="p">{</span>
    <span class="cp">#pragma omp parallel for collapse(2)
</span>    <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">j</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">j</span><span class="o">&lt;</span><span class="n">n</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">&lt;</span><span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">C</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="n">j</span> <span class="o">*</span> <span class="n">C_stride</span><span class="p">]</span> <span class="o">=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="n">j</span> <span class="o">*</span> <span class="n">A_stride</span><span class="p">]</span> <span class="o">+</span> <span class="n">B</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="n">j</span> <span class="o">*</span> <span class="n">B_stride</span><span class="p">];</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kr">inline</span>
<span class="kt">void</span> <span class="n">subMat</span><span class="p">(</span>
    <span class="kt">float</span> <span class="o">*</span><span class="n">C</span><span class="p">,</span> <span class="kt">int</span> <span class="n">C_stride</span><span class="p">,</span>
    <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">A</span><span class="p">,</span> <span class="kt">int</span> <span class="n">A_stride</span><span class="p">,</span>
    <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">B</span><span class="p">,</span> <span class="kt">int</span> <span class="n">B_stride</span><span class="p">,</span>
    <span class="kt">int</span> <span class="n">n</span>
<span class="p">)</span> <span class="p">{</span>
    <span class="cp">#pragma omp parallel for collapse(2)
</span>    <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">j</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">j</span><span class="o">&lt;</span><span class="n">n</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">&lt;</span><span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">C</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="n">j</span> <span class="o">*</span> <span class="n">C_stride</span><span class="p">]</span> <span class="o">=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="n">j</span> <span class="o">*</span> <span class="n">A_stride</span><span class="p">]</span> <span class="o">-</span> <span class="n">B</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="n">j</span> <span class="o">*</span> <span class="n">B_stride</span><span class="p">];</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kr">inline</span>
<span class="kt">void</span> <span class="n">loadMat</span><span class="p">(</span>
    <span class="kt">float</span> <span class="o">*</span><span class="n">C</span><span class="p">,</span> <span class="kt">int</span> <span class="n">C_stride</span><span class="p">,</span>
    <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">A</span><span class="p">,</span> <span class="kt">int</span> <span class="n">A_stride</span><span class="p">,</span>
    <span class="kt">int</span> <span class="n">n</span>
<span class="p">)</span> <span class="p">{</span>
    <span class="cp">#pragma omp parallel for collapse(2)
</span>    <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">j</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">j</span><span class="o">&lt;</span><span class="n">n</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">&lt;</span><span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">C</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="n">j</span> <span class="o">*</span> <span class="n">C_stride</span><span class="p">]</span> <span class="o">=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="n">j</span> <span class="o">*</span> <span class="n">A_stride</span><span class="p">];</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The main Strassen function partitions A and B into quadrants, allocates temporary matrices for M1-M7 and two scratch buffers T1/T2, computes each product recursively, then assembles C:</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span>
<span class="kt">void</span> <span class="nf">matmul_kernel</span><span class="p">(</span>
    <span class="kt">float</span> <span class="o">*</span><span class="n">C</span><span class="p">,</span> <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">A</span><span class="p">,</span> <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">B</span><span class="p">,</span>
    <span class="kt">int</span> <span class="n">n</span><span class="p">,</span> <span class="kt">int</span> <span class="n">stride</span>
<span class="p">)</span> <span class="p">{</span>
    <span class="cp">#pragma omp parallel for collapse(2) schedule(dynamic, 1)
</span>    <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">ic</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">ic</span><span class="o">&lt;</span><span class="n">n</span><span class="p">;</span> <span class="n">ic</span><span class="o">+=</span> <span class="n">MC</span><span class="p">)</span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">jc</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">jc</span><span class="o">&lt;</span><span class="n">n</span><span class="p">;</span> <span class="n">jc</span><span class="o">+=</span><span class="n">NC</span><span class="p">)</span> <span class="p">{</span>
        <span class="kt">int</span> <span class="n">Mb</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">min</span><span class="p">(</span><span class="n">MC</span><span class="p">,</span> <span class="n">n</span> <span class="o">-</span> <span class="n">ic</span><span class="p">);</span>
        <span class="kt">int</span> <span class="n">Nb</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">min</span><span class="p">(</span><span class="n">NC</span><span class="p">,</span> <span class="n">n</span> <span class="o">-</span> <span class="n">jc</span><span class="p">);</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">kc</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">kc</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">kc</span> <span class="o">+=</span> <span class="n">KC</span><span class="p">)</span> <span class="p">{</span>
            <span class="kt">int</span> <span class="n">Kb</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">min</span><span class="p">(</span><span class="n">KC</span><span class="p">,</span> <span class="n">n</span> <span class="o">-</span> <span class="n">kc</span><span class="p">);</span>

            <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">ib</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">ib</span> <span class="o">&lt;</span> <span class="n">Mb</span><span class="p">;</span> <span class="n">ib</span> <span class="o">+=</span> <span class="mi">16</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">jb</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">jb</span> <span class="o">&lt;</span> <span class="n">Nb</span><span class="p">;</span> <span class="n">jb</span> <span class="o">+=</span> <span class="mi">32</span><span class="p">)</span> <span class="p">{</span>

                    <span class="n">__m512</span> <span class="n">psum</span><span class="p">[</span><span class="mi">2</span><span class="p">][</span><span class="mi">16</span><span class="p">]</span> <span class="o">=</span> <span class="p">{};</span>

                    <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">blocki</span> <span class="o">=</span> <span class="n">A</span> <span class="o">+</span> <span class="p">(</span><span class="n">ic</span> <span class="o">+</span> <span class="n">ib</span><span class="p">)</span> <span class="o">+</span> <span class="n">kc</span> <span class="o">*</span> <span class="n">stride</span><span class="p">;</span>
                    <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">blockj</span> <span class="o">=</span> <span class="n">B</span> <span class="o">+</span> <span class="p">(</span><span class="n">jc</span> <span class="o">+</span> <span class="n">jb</span><span class="p">)</span> <span class="o">+</span> <span class="n">kc</span> <span class="o">*</span> <span class="n">stride</span><span class="p">;</span>

                    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">k</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">k</span> <span class="o">&lt;</span> <span class="n">Kb</span><span class="p">;</span> <span class="n">k</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>

                        <span class="n">__m512</span> <span class="n">b0</span> <span class="o">=</span> <span class="n">_mm512_load_ps</span><span class="p">(</span><span class="n">blockj</span> <span class="o">+</span> <span class="n">k</span> <span class="o">*</span> <span class="n">stride</span><span class="p">);</span>
                        <span class="n">__m512</span> <span class="n">b1</span> <span class="o">=</span> <span class="n">_mm512_load_ps</span><span class="p">(</span><span class="n">blockj</span> <span class="o">+</span> <span class="n">k</span> <span class="o">*</span> <span class="n">stride</span> <span class="o">+</span> <span class="n">NUM_LOADS</span><span class="p">);</span>
                        <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">ik</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">ik</span><span class="o">&lt;</span><span class="mi">16</span><span class="p">;</span> <span class="n">ik</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
                            <span class="n">__m512</span> <span class="n">a</span> <span class="o">=</span> <span class="n">_mm512_set1_ps</span><span class="p">(</span><span class="o">*</span><span class="p">(</span><span class="n">blocki</span> <span class="o">+</span> <span class="n">ik</span> <span class="o">+</span> <span class="n">k</span> <span class="o">*</span> <span class="n">stride</span><span class="p">));</span>
                            <span class="n">psum</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="n">ik</span><span class="p">]</span> <span class="o">=</span> <span class="n">_mm512_fmadd_ps</span><span class="p">(</span>
                                <span class="n">b0</span><span class="p">,</span>
                                <span class="n">a</span><span class="p">,</span>
                                <span class="n">psum</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="n">ik</span><span class="p">]</span>
                            <span class="p">);</span>
                            <span class="n">psum</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="n">ik</span><span class="p">]</span> <span class="o">=</span> <span class="n">_mm512_fmadd_ps</span><span class="p">(</span>
                                <span class="n">b1</span><span class="p">,</span>
                                <span class="n">a</span><span class="p">,</span>
                                <span class="n">psum</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="n">ik</span><span class="p">]</span>
                            <span class="p">);</span>
                        <span class="p">}</span>

                    <span class="p">}</span>

                    <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">ik</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">ik</span><span class="o">&lt;</span><span class="mi">16</span><span class="p">;</span> <span class="n">ik</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
                        <span class="kt">float</span> <span class="o">*</span><span class="n">loc_ptr</span> <span class="o">=</span> <span class="n">C</span> <span class="o">+</span> <span class="p">(</span><span class="n">ic</span> <span class="o">+</span> <span class="n">ib</span> <span class="o">+</span> <span class="n">ik</span><span class="p">)</span> <span class="o">*</span> <span class="n">stride</span> <span class="o">+</span> <span class="n">jc</span> <span class="o">+</span> <span class="n">jb</span><span class="p">;</span>
                        <span class="n">_mm512_store_ps</span><span class="p">(</span>
                            <span class="n">loc_ptr</span><span class="p">,</span>
                            <span class="n">_mm512_add_ps</span><span class="p">(</span><span class="n">_mm512_load_ps</span><span class="p">(</span><span class="n">loc_ptr</span><span class="p">),</span> <span class="n">psum</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="n">ik</span><span class="p">])</span>
                        <span class="p">);</span>
                        <span class="n">_mm512_store_ps</span><span class="p">(</span>
                            <span class="n">loc_ptr</span> <span class="o">+</span> <span class="n">NUM_LOADS</span><span class="p">,</span>
                            <span class="n">_mm512_add_ps</span><span class="p">(</span><span class="n">_mm512_load_ps</span><span class="p">(</span><span class="n">loc_ptr</span> <span class="o">+</span> <span class="n">NUM_LOADS</span><span class="p">),</span> <span class="n">psum</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="n">ik</span><span class="p">])</span>
                        <span class="p">);</span>
                    <span class="p">}</span>

                <span class="p">}</span>
            <span class="p">}</span>

        <span class="p">}</span>

    <span class="p">}</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kr">inline</span>
<span class="kt">void</span> <span class="n">strassenMatmul</span><span class="p">(</span>
    <span class="kt">float</span> <span class="o">*</span><span class="n">C</span><span class="p">,</span>
    <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">A</span><span class="p">,</span>
    <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">B</span><span class="p">,</span>
    <span class="kt">int</span> <span class="n">n</span><span class="p">,</span> <span class="kt">int</span> <span class="n">stride</span><span class="p">,</span>
    <span class="kt">int</span> <span class="n">level</span><span class="p">,</span> <span class="kt">int</span> <span class="n">MAX_DEPTH</span>
<span class="p">)</span> <span class="p">{</span>

    <span class="n">assert</span><span class="p">((</span><span class="n">n</span> <span class="o">%</span> <span class="mi">2</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span> <span class="o">&amp;&amp;</span> <span class="s">"n must be a multiple of 2"</span><span class="p">);</span>

    <span class="k">if</span><span class="p">(</span><span class="n">level</span> <span class="o">&gt;=</span> <span class="n">MAX_DEPTH</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">matmul_kernel</span><span class="p">(</span><span class="n">C</span><span class="p">,</span> <span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">stride</span><span class="p">);</span>
        <span class="k">return</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">A11</span> <span class="o">=</span> <span class="n">A</span><span class="p">;</span>
    <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">A12</span> <span class="o">=</span> <span class="n">A</span> <span class="o">+</span> <span class="p">(</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">2</span> <span class="p">)</span> <span class="o">*</span> <span class="n">stride</span><span class="p">;</span>
    <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">A21</span> <span class="o">=</span> <span class="n">A</span> <span class="o">+</span> <span class="p">(</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">2</span> <span class="p">);</span>
    <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">A22</span> <span class="o">=</span> <span class="n">A</span> <span class="o">+</span> <span class="p">(</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">2</span> <span class="p">)</span> <span class="o">+</span> <span class="p">(</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">2</span> <span class="p">)</span> <span class="o">*</span> <span class="n">stride</span><span class="p">;</span>
  
    <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">B11</span> <span class="o">=</span> <span class="n">B</span><span class="p">;</span>
    <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">B12</span> <span class="o">=</span> <span class="n">B</span> <span class="o">+</span> <span class="p">(</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">2</span> <span class="p">);</span>
    <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">B21</span> <span class="o">=</span> <span class="n">B</span> <span class="o">+</span> <span class="p">(</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">2</span> <span class="p">)</span> <span class="o">*</span> <span class="n">stride</span><span class="p">;</span>
    <span class="k">const</span> <span class="kt">float</span> <span class="o">*</span><span class="n">B22</span> <span class="o">=</span> <span class="n">B</span> <span class="o">+</span> <span class="p">(</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">2</span> <span class="p">)</span> <span class="o">+</span> <span class="p">(</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">2</span> <span class="p">)</span> <span class="o">*</span> <span class="n">stride</span><span class="p">;</span>

    <span class="kt">float</span> <span class="o">*</span><span class="n">M1</span> <span class="o">=</span> <span class="p">(</span><span class="kt">float</span> <span class="o">*</span><span class="p">)</span><span class="n">aligned_alloc</span><span class="p">(</span><span class="n">ALIGNED_BYTES</span><span class="p">,</span> <span class="n">n</span> <span class="o">*</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">4</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">));</span>
    <span class="kt">float</span> <span class="o">*</span><span class="n">M2</span> <span class="o">=</span> <span class="p">(</span><span class="kt">float</span> <span class="o">*</span><span class="p">)</span><span class="n">aligned_alloc</span><span class="p">(</span><span class="n">ALIGNED_BYTES</span><span class="p">,</span> <span class="n">n</span> <span class="o">*</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">4</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">));</span>
    <span class="kt">float</span> <span class="o">*</span><span class="n">M3</span> <span class="o">=</span> <span class="p">(</span><span class="kt">float</span> <span class="o">*</span><span class="p">)</span><span class="n">aligned_alloc</span><span class="p">(</span><span class="n">ALIGNED_BYTES</span><span class="p">,</span> <span class="n">n</span> <span class="o">*</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">4</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">));</span>
    <span class="kt">float</span> <span class="o">*</span><span class="n">M4</span> <span class="o">=</span> <span class="p">(</span><span class="kt">float</span> <span class="o">*</span><span class="p">)</span><span class="n">aligned_alloc</span><span class="p">(</span><span class="n">ALIGNED_BYTES</span><span class="p">,</span> <span class="n">n</span> <span class="o">*</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">4</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">));</span>
    <span class="kt">float</span> <span class="o">*</span><span class="n">M5</span> <span class="o">=</span> <span class="p">(</span><span class="kt">float</span> <span class="o">*</span><span class="p">)</span><span class="n">aligned_alloc</span><span class="p">(</span><span class="n">ALIGNED_BYTES</span><span class="p">,</span> <span class="n">n</span> <span class="o">*</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">4</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">));</span>
    <span class="kt">float</span> <span class="o">*</span><span class="n">M6</span> <span class="o">=</span> <span class="p">(</span><span class="kt">float</span> <span class="o">*</span><span class="p">)</span><span class="n">aligned_alloc</span><span class="p">(</span><span class="n">ALIGNED_BYTES</span><span class="p">,</span> <span class="n">n</span> <span class="o">*</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">4</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">));</span>
    <span class="kt">float</span> <span class="o">*</span><span class="n">M7</span> <span class="o">=</span> <span class="p">(</span><span class="kt">float</span> <span class="o">*</span><span class="p">)</span><span class="n">aligned_alloc</span><span class="p">(</span><span class="n">ALIGNED_BYTES</span><span class="p">,</span> <span class="n">n</span> <span class="o">*</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">4</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">));</span>

    <span class="cp">#pragma omp parallel for collapse(2)
</span>    <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">&lt;</span><span class="n">n</span> <span class="o">/</span> <span class="mi">2</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">j</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">j</span><span class="o">&lt;</span><span class="n">n</span> <span class="o">/</span> <span class="mi">2</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">M1</span><span class="p">[</span><span class="n">i</span> <span class="o">*</span> <span class="p">(</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">2</span> <span class="p">)</span> <span class="o">+</span> <span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="mf">0.0</span><span class="n">f</span><span class="p">;</span>
        <span class="n">M2</span><span class="p">[</span><span class="n">i</span> <span class="o">*</span> <span class="p">(</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">2</span> <span class="p">)</span> <span class="o">+</span> <span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="mf">0.0</span><span class="n">f</span><span class="p">;</span>
        <span class="n">M3</span><span class="p">[</span><span class="n">i</span> <span class="o">*</span> <span class="p">(</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">2</span> <span class="p">)</span> <span class="o">+</span> <span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="mf">0.0</span><span class="n">f</span><span class="p">;</span>
        <span class="n">M4</span><span class="p">[</span><span class="n">i</span> <span class="o">*</span> <span class="p">(</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">2</span> <span class="p">)</span> <span class="o">+</span> <span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="mf">0.0</span><span class="n">f</span><span class="p">;</span>
        <span class="n">M5</span><span class="p">[</span><span class="n">i</span> <span class="o">*</span> <span class="p">(</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">2</span> <span class="p">)</span> <span class="o">+</span> <span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="mf">0.0</span><span class="n">f</span><span class="p">;</span>
        <span class="n">M6</span><span class="p">[</span><span class="n">i</span> <span class="o">*</span> <span class="p">(</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">2</span> <span class="p">)</span> <span class="o">+</span> <span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="mf">0.0</span><span class="n">f</span><span class="p">;</span>
        <span class="n">M7</span><span class="p">[</span><span class="n">i</span> <span class="o">*</span> <span class="p">(</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">2</span> <span class="p">)</span> <span class="o">+</span> <span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="mf">0.0</span><span class="n">f</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="kt">float</span> <span class="o">*</span><span class="n">T1</span> <span class="o">=</span> <span class="p">(</span><span class="kt">float</span> <span class="o">*</span><span class="p">)</span><span class="n">aligned_alloc</span><span class="p">(</span><span class="n">ALIGNED_BYTES</span><span class="p">,</span> <span class="n">n</span> <span class="o">*</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">4</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">));</span>
    <span class="kt">float</span> <span class="o">*</span><span class="n">T2</span> <span class="o">=</span> <span class="p">(</span><span class="kt">float</span> <span class="o">*</span><span class="p">)</span><span class="n">aligned_alloc</span><span class="p">(</span><span class="n">ALIGNED_BYTES</span><span class="p">,</span> <span class="n">n</span> <span class="o">*</span> <span class="n">n</span> <span class="o">/</span> <span class="mi">4</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">));</span>

    <span class="c1">// M1 = (A11 + A22) * (B11 + B22)</span>
    <span class="n">addMat</span><span class="p">(</span><span class="n">T1</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">A11</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">A22</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">);</span>
    <span class="n">addMat</span><span class="p">(</span><span class="n">T2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">B11</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">B22</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">);</span>
    <span class="n">strassenMatmul</span><span class="p">(</span><span class="n">M1</span><span class="p">,</span> <span class="n">T1</span><span class="p">,</span> <span class="n">T2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">level</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="n">MAX_DEPTH</span><span class="p">);</span>

    <span class="c1">// M2 = (A21 + A22) * B11</span>
    <span class="n">addMat</span><span class="p">(</span><span class="n">T1</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">A21</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">A22</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">);</span>
    <span class="n">loadMat</span><span class="p">(</span><span class="n">T2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">B11</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">);</span>
    <span class="n">strassenMatmul</span><span class="p">(</span><span class="n">M2</span><span class="p">,</span> <span class="n">T1</span><span class="p">,</span> <span class="n">T2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">level</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="n">MAX_DEPTH</span><span class="p">);</span>

    <span class="c1">// M3 = A11 * (B12 - B22)</span>
    <span class="n">loadMat</span><span class="p">(</span><span class="n">T1</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">A11</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">);</span>
    <span class="n">subMat</span><span class="p">(</span><span class="n">T2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">B12</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">B22</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">);</span>
    <span class="n">strassenMatmul</span><span class="p">(</span><span class="n">M3</span><span class="p">,</span> <span class="n">T1</span><span class="p">,</span> <span class="n">T2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">level</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="n">MAX_DEPTH</span><span class="p">);</span>

    <span class="c1">// M4 = A22 * (B21 - B11)</span>
    <span class="n">loadMat</span><span class="p">(</span><span class="n">T1</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">A22</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">);</span>
    <span class="n">subMat</span><span class="p">(</span><span class="n">T2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">B21</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">B11</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">);</span>
    <span class="n">strassenMatmul</span><span class="p">(</span><span class="n">M4</span><span class="p">,</span> <span class="n">T1</span><span class="p">,</span> <span class="n">T2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">level</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="n">MAX_DEPTH</span><span class="p">);</span>

    <span class="c1">// M5 = (A11 + A12) * B22</span>
    <span class="n">addMat</span><span class="p">(</span><span class="n">T1</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">A11</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">A12</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">);</span>
    <span class="n">loadMat</span><span class="p">(</span><span class="n">T2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">B22</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">);</span>
    <span class="n">strassenMatmul</span><span class="p">(</span><span class="n">M5</span><span class="p">,</span> <span class="n">T1</span><span class="p">,</span> <span class="n">T2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">level</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="n">MAX_DEPTH</span><span class="p">);</span>

    <span class="c1">// M6 = (A21 - A11) * (B11 + B12)</span>
    <span class="n">subMat</span><span class="p">(</span><span class="n">T1</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">A21</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">A11</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">);</span>
    <span class="n">addMat</span><span class="p">(</span><span class="n">T2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">B11</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">B12</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">);</span>
    <span class="n">strassenMatmul</span><span class="p">(</span><span class="n">M6</span><span class="p">,</span> <span class="n">T1</span><span class="p">,</span> <span class="n">T2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">level</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="n">MAX_DEPTH</span><span class="p">);</span>

    <span class="c1">// M7 = (A12 - A22) * (B21 + B22)</span>
    <span class="n">subMat</span><span class="p">(</span><span class="n">T1</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">A12</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">A22</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">);</span>
    <span class="n">addMat</span><span class="p">(</span><span class="n">T2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">B21</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">B22</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">);</span>
    <span class="n">strassenMatmul</span><span class="p">(</span><span class="n">M7</span><span class="p">,</span> <span class="n">T1</span><span class="p">,</span> <span class="n">T2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">level</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="n">MAX_DEPTH</span><span class="p">);</span>

    <span class="c1">// Assemble C from M1-M7</span>
    <span class="cp">#pragma omp parallel for collapse(2)
</span>    <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o">&lt;</span><span class="n">n</span> <span class="o">/</span> <span class="mi">2</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">j</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">j</span><span class="o">&lt;</span><span class="n">n</span> <span class="o">/</span> <span class="mi">2</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// C11 = M1 + M4 - M5 + M7</span>
        <span class="n">C</span><span class="p">[</span><span class="n">i</span> <span class="o">*</span> <span class="n">stride</span> <span class="o">+</span> <span class="n">j</span><span class="p">]</span> <span class="o">+=</span> <span class="n">M1</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="p">(</span><span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span><span class="o">+</span><span class="n">j</span><span class="p">]</span> <span class="o">+</span> <span class="n">M4</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="p">(</span><span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span><span class="o">+</span><span class="n">j</span><span class="p">]</span> <span class="o">-</span> <span class="n">M5</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="p">(</span><span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span><span class="o">+</span><span class="n">j</span><span class="p">]</span> <span class="o">+</span> <span class="n">M7</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="p">(</span><span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span><span class="o">+</span><span class="n">j</span><span class="p">];</span>
        <span class="c1">// C12 = M3 + M5</span>
        <span class="n">C</span><span class="p">[</span><span class="n">i</span> <span class="o">*</span> <span class="n">stride</span> <span class="o">+</span> <span class="p">(</span><span class="n">j</span> <span class="o">+</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">)]</span> <span class="o">+=</span> <span class="n">M3</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="p">(</span><span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span><span class="o">+</span><span class="n">j</span><span class="p">]</span> <span class="o">+</span> <span class="n">M5</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="p">(</span><span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span><span class="o">+</span><span class="n">j</span><span class="p">];</span>
        <span class="c1">// C21 = M2 + M4</span>
        <span class="n">C</span><span class="p">[(</span><span class="n">i</span> <span class="o">+</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span> <span class="o">*</span> <span class="n">stride</span> <span class="o">+</span> <span class="n">j</span><span class="p">]</span> <span class="o">+=</span> <span class="n">M2</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="p">(</span><span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span><span class="o">+</span><span class="n">j</span><span class="p">]</span> <span class="o">+</span> <span class="n">M4</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="p">(</span><span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span><span class="o">+</span><span class="n">j</span><span class="p">];</span>
        <span class="c1">// C22 = M1 - M2 + M3 + M6</span>
        <span class="n">C</span><span class="p">[(</span><span class="n">i</span> <span class="o">+</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span> <span class="o">*</span> <span class="n">stride</span> <span class="o">+</span> <span class="p">(</span><span class="n">j</span> <span class="o">+</span> <span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">)]</span> <span class="o">+=</span> <span class="n">M1</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="p">(</span><span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span><span class="o">+</span><span class="n">j</span><span class="p">]</span> <span class="o">-</span> <span class="n">M2</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="p">(</span><span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span><span class="o">+</span><span class="n">j</span><span class="p">]</span> <span class="o">+</span> <span class="n">M3</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="p">(</span><span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span><span class="o">+</span><span class="n">j</span><span class="p">]</span> <span class="o">+</span> <span class="n">M6</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="p">(</span><span class="n">n</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span><span class="o">+</span><span class="n">j</span><span class="p">];</span>
    <span class="p">}</span>

    <span class="n">free</span><span class="p">(</span><span class="n">M1</span><span class="p">);</span> <span class="n">free</span><span class="p">(</span><span class="n">M2</span><span class="p">);</span> <span class="n">free</span><span class="p">(</span><span class="n">M3</span><span class="p">);</span> <span class="n">free</span><span class="p">(</span><span class="n">M4</span><span class="p">);</span> <span class="n">free</span><span class="p">(</span><span class="n">M5</span><span class="p">);</span> <span class="n">free</span><span class="p">(</span><span class="n">M6</span><span class="p">);</span> <span class="n">free</span><span class="p">(</span><span class="n">M7</span><span class="p">);</span>
    <span class="n">free</span><span class="p">(</span><span class="n">T1</span><span class="p">);</span> <span class="n">free</span><span class="p">(</span><span class="n">T2</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The base case when <code class="language-plaintext highlighter-rouge">level &gt;= MAX_DEPTH</code> calls our optimized AVX-512 kernel from the previous post. Submatrices are addressed using pointer arithmetic: A21 is at offset <code class="language-plaintext highlighter-rouge">n/2</code> (down one block of rows), A12 is at offset <code class="language-plaintext highlighter-rouge">(n/2) * stride</code> (right one block of columns)</p>

<h2 id="choosing-the-recursion-depth">Choosing the Recursion Depth</h2>

<p>The recursion depth MAX_DEPTH controls where we switch from Strassen to the direct kernel. Too shallow means we don’t benefit from the reduced complexity. Too deep means the overhead of allocating M1-M7 and doing the additions dominates</p>

<p>At each Strassen level we allocate 9 temporary matrices of size $(n/2)^2$ floats each. For MAX_DEPTH = 3 on an 8192×8192 matrix the leaf problems are 1024×1024. The memory overhead at the top level is:</p>

\[9 \times \frac{8192^2}{4} \times 4 \text{ bytes} = 603 \text{ MB}\]

<p>The recursion also needs n to be divisible by $2^{\text{MAX_DEPTH}} \times 32$ to ensure the leaf problems are multiples of our 16×32 register blocking. I pad matrices to the nearest valid size:</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">Px</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">NUM_LOADS</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="n">MAX_DEPTH</span><span class="p">);</span>  <span class="c1">// 32 * 2^MAX_DEPTH</span>
<span class="kt">int</span> <span class="n">Py</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">NUM_LOADS</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="n">MAX_DEPTH</span><span class="p">);</span>

<span class="n">nxp</span> <span class="o">=</span> <span class="p">((</span><span class="n">nx</span> <span class="o">+</span> <span class="n">Px</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="n">Px</span><span class="p">)</span> <span class="o">*</span> <span class="n">Px</span><span class="p">;</span>
<span class="n">nyp</span> <span class="o">=</span> <span class="p">((</span><span class="n">ny</span> <span class="o">+</span> <span class="n">Py</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="n">Py</span><span class="p">)</span> <span class="o">*</span> <span class="n">Py</span><span class="p">;</span>
<span class="n">nxp</span> <span class="o">=</span> <span class="n">nyp</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">max</span><span class="p">(</span><span class="n">nxp</span><span class="p">,</span> <span class="n">nyp</span><span class="p">);</span>  <span class="c1">// keep square for simplicity</span>
</code></pre></div></div>

<p>For MAX_DEPTH = 3 this means padding to multiples of 256</p>

<h2 id="benchmarks">Benchmarks</h2>

<p>Benchmarked on Intel Xeon W-2295 (18 cores, 3.2 GHz under AVX-512):</p>

<table>
  <thead>
    <tr>
      <th>n</th>
      <th>Direct Kernel</th>
      <th>Strassen (depth=2)</th>
      <th>Strassen (depth=3)</th>
      <th>Speedup</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>2048</td>
      <td>0.025 s</td>
      <td>0.028 s</td>
      <td>0.031 s</td>
      <td>0.81×</td>
    </tr>
    <tr>
      <td>4096</td>
      <td>0.189 s</td>
      <td>0.178 s</td>
      <td>0.162 s</td>
      <td>1.17×</td>
    </tr>
    <tr>
      <td>8192</td>
      <td>1.48 s</td>
      <td>1.21 s</td>
      <td>1.08 s</td>
      <td>1.37×</td>
    </tr>
    <tr>
      <td>16384</td>
      <td>11.7 s</td>
      <td>8.9 s</td>
      <td>7.6 s</td>
      <td>1.54×</td>
    </tr>
  </tbody>
</table>

<p>For small n the allocation and addition overhead makes Strassen slower. The crossover happens around n~3000~4000 on this hardware. At n=16384 Strassen with depth 3 is 1.54× faster</p>

<p>The theoretical speedup from depth d is $(8/7)^d$. For d=3 that’s 1.49×, close to our measured 1.54×. The slight advantage over theory comes from better cache behavior when operating on smaller submatrices</p>

<p>Also strassen’s seems to have numerical implications. The extra additions accumulated rounding errors while I was benchmarking the kernels. Though I wonder if such rounding errors can be contained by hierarchically choosing the depth in each split?</p>]]></content><author><name>Darshan Makwana</name></author><category term="optimization" /><category term="cpp" /><category term="simd" /><category term="avx512" /><category term="algorithms" /><summary type="html"><![CDATA[In the previous matmul with avx512 and loop tiling note we managed to build a cpu kernel that achieves 92% of peak FLOPS. I wanted to check if we can do better by using Strassen’s algorithm. Strassen’s algorithm from 1969 showed that matrix multiplication can be done in $O(n^{2.807})$ by trading multiplications for additions. The main idea of this post is to use strassens for high level recursions and fallback to our highly optimized kernel for lower levels]]></summary></entry></feed>