About

Darshan

Hi, I’m Darshan. I am currently an AI engineer at Riverline AI. Before this I was an ML engineer at Sprinklr working on real time voice AI systems and improving their ASR inference. I was also a Research Assistant at Aalto Vision Lab on 3D Gaussian Splatting during my semester exchange in Finland, and did a summer research internship at TU Munich on quadruped robotics before that

I graduated from IIT Bombay in 2025 with a B.Tech in Mechanical Engineering and a dual minor in Computer Science and Data Science. My interests span efficient ML serving systems, 3D reconstruction and endurance sports. I write semi regularly on this site about things I read, learn, build and occasionally bump my head against. You can find my cv here, I am also on twitter and github

Engineering Work

Let me share two instances where I tackled and solved some of the most ambitious engineering problems I have ever faced

When I had just joined sprinklr I was assigned the task of cost optimization for their releases. In voice team they had a majority of their releases as ASR models and we were using proprietary trained finetunes of the whisper model. I realized 90% of the cost was coming from the whisper model’s hosting. I started experimenting with ways to improve the infrastructure, I did experiments with quantization, GPU benchmarks across T4 to H100 and figuring out the most effective cost to request ratios, figured out there were some bugs in the way they were allocated, etc

While I was doing quantization I wanted to see how much improvement I was getting at the CUDA kernel level, so I profiled the execution trace of the entire whisper model forward pass in vLLM while serving, then visualized it with perfetto UI. I noticed something strange, the individual decoding steps took comparatively less time compared to the e2e latency of the entire forward pass, which was quite strange because the audios I was benchmarking on were 2s, 4s, 5s and 7s with a simulation of poisson distribution, because this matches our client distribution. I read the stack traces and concluded that 80% of the e2e time was being consumed by the encoder of the whisper model which was also strange. Then I started reading the code of the vLLM engine which made me realize the audio is always padded to 30s even if the audio is of 5s, that means the encoder is just computing representations of the padding tokens, that means 83% of the encoder computation is always wasted for us and 66% of the e2e latency can be reduced if we update this logic. I then spent the next couple of weeks patching the vLLM engine to remove redundant paddings and then shipped it to production which resulted in almost a 50% reduction in latency, which means we can effectively double the number of requests we serve per GPU and effectively reduce the number of pods by half. This gained enormous success in the voice team and the VP of engineering congratulated and appreciated me, but one regret I have on it is we failed to capitalize on it by opensourcing it and just around 1st april cohere transcribe opensourced something similar around padding removal in vLLM engine, which made me realize agency and speed is everything in this competitive landscape. If you are not the first then you probably have to be better than the first by a whole margin to capitalize

But leaving that aside, while I was doing all the patching in the vLLM engine for the integration of padding removal, I was reading the scheduler logic and also the scheduler logs of the vLLM engine and observed that a lot of requests of 10s were getting blocked by requests of 1s. This was before I had the vLLM engine patched. After 2 weeks when I had just shipped it to production I revisited it again and had an insight about what if we just use the number of tokens which are going to be generated in advance as a proxy for how much computation it’s going to use. I mean this is a very simple intuition that the audio length is directly proportional to the amount of text tokens which are going to be generated, there is correlation between the amount of information you can pack in an audio while still the audio being sensible for us to comprehend. With this insight I implemented a priority based scheduler inside the vLLM scheduler by patching the whole engine, it took around a month to understand the entire vllm codebase and patch it (this was before I was using cursor or claude code), then I benchmarked it and to my surprise it worked exceptionally well than what I had initially expected. After 2 months we also shipped this priority based scheduler patch to production. After 3 months my manager encouraged me to publish a paper on the scheduler logic. After 5 months the paper was accepted at CAO @ ICLR

Whose work I look up to and why

In general I look up to people like karpathy, neel nanda, jeff dean, richard hamming and david patterson. I always try to read and internalize their tweets, posts, blogs, papers and talks. In my opinion all of them have this incredible ability to see through the chaos in the world and arrive at conclusions with clarity and very strong conviction, and then present them or share them succinctly in a way that is digestable to the general audience outside of their field. All of them have made a very strong impression on my world model and how I think and reason through things. karpathy predicted a lot of things around AI before they happened, neel mentored some of the pioneers of mech interp, and similar cases can be made for dean, hamming, patterson, etc

One of the quotes from hamming’s you and your research which has stuck with me forever goes something like this

If you want to do great work you clearly must work on important problems

Outside of work

I run (slowly building up to a full marathon before the year is out), photograph birds, read more than I sleep, and collect things I find interesting on the internet. If any of this resonates with you or you’d like to chat about ML serving, gaussian splatting, voice agents or anything else, feel free to reach out at darshanmakwana412@gmail.com