Inverse Random Test Auditing

A new random testing technique for the LLM age

2026-02-23 :: software-verification, software-testing, random-testing, emacs, composiphrase,

I’ve developed a new random software testing technique that I believe will become a staple of my software testing going forward. It is both helpful for agentic AI (LLM) software development, where robust software verification is more important than ever, and implemented with the help of LLMs.

Abstract

Skip to the TL;DR

Intro

Several months ago I started a side project to improve the testing of my more-Vim-than-Vim editor Composiphrase (that happens to be implemented in Emacs). As I tend to do, I got it 90% of the way to the finish line, then let it sit for several months as I moved on to other things. But I finally “finished” it over the weekend. While talking with some friends about it, I decided that I should share it more broadly. The technique is easy to use, and while I haven’t yet thoroughly validated the idea, early signs tell me that it is useful.

Background story time

Back in the old days of approximately one year ago, I wrote my editor as quickly as possible in limited time so that I could start using it, and have a demo to communicate the main ideas. But in doing this, I didn’t really test it much, and I’ve only recommended it as “demo-ware” to show the idea, not as an editor that I necessarily recommend to others to use (although I do use it). I’ve wanted to improve the testing to be able to more confidently refactor and fix bugs without adding new bugs (up to completely breaking important functionality, since I’m obviously not going to manually test everything on every change). The largest amount of the code that I wrote revolved around a zoo of “text objects”, and functions for moving and editing based on them. I had a particular vision for how I wanted to write tests for these, to make it easy to read tests and validate that they are correct. So I created a new testing library for emacs editing commands: carettest.

The initial goal of carettest was to make a test format that is easy to read and validate for this particular domain. But I also have a background in random testing¹. So it seemed natural to also give it a generator mode, where it could generate random inputs, run the function under test, then use the inputs and result to generate a (passing) test case.

Random-o, random-o, whyfore art thou, random-o

So what good is automatic generation of random test cases simply capture the current behavior and pass?

Most interestingly², if you have a bunch of test cases, you can review them and see if they look correct. Random generation of tests can generate weird cases that you didn’t think of. But reviewing hundreds or thousands of generated tests is a lot of boring, tedious work. I don’t have time for that.

But you know who does? Our friend³, the LLM.

So the key new idea is that rather than having a test oracle that can tell us whether a random test passes or fails, we turn it on its head. We use no oracle to tell us whether the test passed, because it passes by construction⁴. We already know the answer, but we use an LLM to examine the question. Given the documentation of the function under test, LLMs can tirelessly audit passing test cases to tell us whether or not the test case matches those documented behaviors. Then each test case determined to be incorrect can be inverted — its test can be simply negated, or, more practically, edited (into a failing test case) by revising it to check for the desired behavior instead of the (passing) actual behavior.

Of course, LLMs are nondeterministic and can be faulty. So give the LLM agent instructions to move the suspicious tests into a folder for human review.

TL;DR, the clear explanation that maybe should have been first, but that would ruin the narrative

So to be clear, what is “Inverse Random Test Auditing”?

generate random test inputs
call the function under test with that input
use the input and result to write an automated test. IE output actual code that says “the test passes if you input this and the output is that”, where “input” and “output” can generally include that monster called “state”. Since you are writing a test program, you get to be fancy and call your test case generator a metaprogram.
use your new generator to create what we in the industry call “a whole lot of tests”
have an LLM (in “agent” form or otherwise) read the test case, along with the documentation for the function and any other context you want to give it (or that an agent wants to give it, if you are using an AI agent to drive this process), and flag interesting cases
probably do some human review if what you’re testing really matters
flip the passing test into a failing test, either automatically by negating the condition, or manually by rewriting the test result to match the fuzzy natural language description in the documentation or specification, or automatically by rewriting the test result to match the fuzzy natural language description in the documentation or specification with an LLM.

Super serious evaluation section of the paper^H^H^H^H^H blog post⁵

I just published the library and ran a single initial run of test auditing over the weekend. And then I ran out of tokens, so I don’t have very many results yet. But the test audit did find bad tests that demonstrate known bugs in my editing functions that I haven’t bothered to write tests for. And some more tests that I haven’t yet had time to review, but that probably include bugs. And… also some tests that were perfectly fine, but that Claude flagged as suspicious. But improving my lazy prompt allowed it to eliminate the first round of false positives.

Finding bugs in code that I wrote in a rush with only basic, manual, happy-path testing (and known to be buggy from experience) isn’t exactly a groundbreaking result. But I think this is a pretty promising approach that I’m going to continue to use. And so I decided to name it.

I didn’t find any related work in some low-effort searching before writing this, but after writing the first draft, I gave it to Claude Code and asked it to look for related work, and it came up with:

AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests

@misc{khatib2026assertflipreproducingbugsinversion,
      title={AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests}, 
      author={Lara Khatib and Noble Saji Mathews and Meiyappan Nagappan},
      year={2026},
      eprint={2507.17542},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2507.17542}, 
}

and

METAMON: Finding Inconsistencies between Program Documentation and Behavior using Metamorphic LLM Queries

@misc{lee2025metamonfindinginconsistenciesprogram,
      title={METAMON: Finding Inconsistencies between Program Documentation and Behavior using Metamorphic LLM Queries}, 
      author={Hyeonseok Lee and Gabin An and Shin Yoo},
      year={2025},
      eprint={2502.02794},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2502.02794}, 
}

Let’s see, what does METAMON do? It … uses an LLM to compare passing generated tests with the documentation to identify bugs. Dang it, how dare they scoop me… like 6 months before I had the idea and 12 months before I decided to write a blog post about it!

And AssertFlip? Based on also not fully reading the paper, it’s in a similar vein, but they only scooped me by like a month before I decided to write this, so should I be more or less upset about that?

Well, at least my name for the process is novel.

Coda, D.S. Al

Agentic coding is a pretty big deal. But it demands much higher standards of software verification. It’s made me more grateful than ever before for all of the advanced or unusual testing techniques I learned about in grad school, hanging out with academics who care a lot about software verification. I’ll be thinking more about new and better ways of testing and other aspects of software verification and validation.

That said, the bar for better software testing is typically pretty low, even for “industry-grade” software that billions depend on. You don’t need a PhD to find simple ways to improve testing even in critical software infrastructure. Testing is so often neglected, done as an afterthought, and done poorly.

But in the age of agentic software development, it’s time to reverse that trend.

⁶

⁷

One of my major projects during my PhD was Xsmith, a DSL for writing compiler fuzzers. ↩
Ok, in asking the LLM for editing advice, it actually gave good advice here that I was burying the lede by talking seriously about other ways in which this can be useful before getting to the main point. But I still want to say those things, and this is my blog, LLM, you can’t tell me what to do. So besides using Inverse Random Test Auditing, cheaply generating hundreds or thousands of test cases can be useful for temporarily having tons of extra tests to validate that refactoring preserves behavior. (You could instead do differential testing, but maybe there are reasons you don’t want to refer back to the old version.) Or you can do it as a post-hoc way to seed a test corpus if you didn’t write any tests, or maybe if your predecessor or institution at large didn’t. ↩
… or maybe foe. I find LLMs and surrounding technology incredibly useful, but I have SO MANY worries about our zeitgeist-laden AI friends. I’ve just deleted like 2 paragraphs from this footnote listing practically endless things about modern AI that I find very concerning. But for now let’s keep it, uh, positive. Banging out side projects with agentic coding tools is legitimately awesome! ↩
Of course, occasionally a random test may trigger an unexpected crash or critical exception. So it can still double as a traditional fuzzer to some degree. ↩
I asked an LLM to critique this blog post, and got «"Super serious evaluation section of the paper^H^H^H^H^H blog post" is fun but the ^H joke is dated and requires explanation for most readers today.»… Okay, boomer. Er, zoomer? But after I read that advice, I decided to make it even funnier by exhaustively explaining the joke in this footnote. But somewhere along the line while explaining about the ascii table and VT100 terminals, I decided it was too much (and more importantly taking too much time) and deleted it. But actually I’m only using the ^H joke because I wanted a strikethrough and the markdown processor I’m using for this doesn’t support strikethrough, and at this point I don’t want to rewrite to another format or something. ↩
I did some light editing based on LLM feedback. But I tend to not like most suggestions I get from LLMs about editing. They just tend to be some mix of wrong, or just… not what I would say or want to say. Anyway, here are some gems from AI review: «Credibility: The subtitle claims “A new random testing technique” but then the related work section reveals it isn’t new. Even handled with humor, this could feel like a bait-and-switch to some readers.» It felt like a bait-and-switch to me, too, when I found that my novel idea was less novel than I had hoped. But I wrote the subtitle before I knew about the related work, and I like it, so I’m not going to change it. «The tone swings a lot (technical → sarcastic → serious → footnote tangents). Works for a personal blog, but some readers will find it hard to follow the technical thread.» Thank you, Claude, for validating that it will work for my personal blog. ↩
I had better wrap this up. I told myself that I wanted to sit down and write a blog post in one evening all at once, and now it’s well past bed time. You can discern a rough order in which I actually wrote things based on the fact that I started writing a straight blog post but then as it got later I became increasingly sarcastic. ↩