Hi, guys. Been a while since a proper blog post - honestly I just fell out of the habit. I figured today would be a good day to come back, because I just hard focused for five hours and fixed a super niche bug that has been causing flakiness in our end-to-end test suite.
The issue was that we have an "edit mode" that is toggled via a switch, and its on a page that has lazily loaded components via HTMX. The edit mode is just those components, but editable. Switching to edit mode, updating a value, and switching back would SOMETIMES (20%) cause a failure when the updated value would not be on the page.
I worked on trying to figure out this issue before, but ended up pretty off the mark. I started out right - I looked through the issue and the context, and tried to puzzle out potential causes. After finding two, I wrote fixes for them, and verified the tests seemed to pass. Obviously, the mistake here is that I could not reproduce a failure, and thus could not actually verify that these worked.
Over the weekend, I tried to reproduce the issue using the first root cause, and found out that though my fix worked the tests were still failing after trying to run them for a hundred iterations. I decided to adopt this into my workflow and try and construct a minimal repro for any suspected root cause.
Today, I set out to try and do this for the other root cause I had found - POST requests were not being sent upon updates to the form, proven by adding a visual success indicator to the component, which didn't show up, but again only sometimes. Even better, when I made Playwright take screenshots of the page for testing, I found that with the screenshots the test would pass and without, it would fail.
At this point, I hit a hard knowledge gap - I knew that we
had hx-trigger on the element, and I knew that it was
supposed to fire on the change event and send a POST request
to the backend, except it wasn't.
I knew the next step would be to dig into the HTMX source or logs from the app but I have like zero knowledge of Javascript. Thankfully, I had just set up OpenCode Zen and had access to Opus 4.6. I gave it my research so far, and told it to look into the HTMX impl and figure out exactly why it wasn't triggering.
~15 mins and $8 later, Mr Claude found the issue. Playwrightwas filling in the form fields, triggering the change event,
BEFORE HTMX had added the event listener to the element.
This happens because HTMX's processNode() runs after a
20ms delay (by default) - it was a race condition between
the two, causing no POST request to ever be sent to the
server and me to be really confused for a few weeks.
Regardless, that's two tests that won't randomly fail and make me have to re-run CI checks to merge a PR now. It was a pretty valuable learning process - I have debugged many things in my life but this is the first time I have really documented it. It's nice to be able to see my pitfalls in hindsight and be able to learn from them in the future - I'd never give a root cause as definite again without verifying with detailed reproduction steps first. Anyway, that's all for today. See ya tomorrow (hopefully)!