EP. 07: Why Reasoning Agents Change Everything in Defense

Reasoning agents have quietly crossed a threshold. In just the past few months, models gained the ability to choose and chain tools, turning them from text summarizers into systems that can actually reason, act, and adapt. For defense missions, that shift is enormous.

Vannevar CEO Brett Granberg explains why the O1 –> O3 leap changed everything, how a task that once required 40 analysts a month now runs in 20 minutes, and how to balance fast experimentation with long-term product strategy. We discuss what’s real today, what’s coming next, and what it takes to win when the ground shifts every four weeks.

We cover:

Why tool use is the step-change
Concrete defense use cases
The playbook for building with agents: model-agnostic, tool-first, and mission-driven
How tech-enabled services could upend billions in prime contracts
The unsolved problems: model evaluation, UX, and hallucinations
What an “agent-native” team looks like, and how to build one

Watch on Youtube

Listen wherever you get your podcasts:

Episode Transcript

Hayley

Today, we're talking about agents in defense. I'll start by asking you what you think the big inflection point has been around the uptake of AI in defense.

Brett

I think I think we still have a long way to go in terms of uptake. Think but I think the fundamental technology change that's happened has been reasoning agents in, say, like, between December and March. These new foundation models got really good at being able to use tools to either retrieve data or to take some action. And I think that has opened up the possibility for what mission problems are actually able to be solved just in general.

Hayley

Yeah. So zooming out a little bit, what are reasoning agents?

Brett

Prior to reasoning agents, like, large language models were, you know, the thing, you know, they're in the in the news. What large language models basically are are or how people were using them was finding really creative ways to take subsets of data and jam them into the context of these models and then use that to generate text summaries. That was not every use case, but that was, like, kind of the a large number of use cases in, like, enterprises were geared around that because that's what they were good at. The change in December with models like o1 was you took some of the same concepts of these models, but you it got to the point where people were able to provide tools to these models and the models were able to use these tools in ways that were previously not possible. And when I say tools, mean the ability to like, guess a couple of things. One is, like, these models were able to decide what tools to use. Tools being, like, a way to either gather data or a way to, like, write data or do some action. These models were now, able to decide what tools to use in, like, a pretty useful way, kinda get results from those those tools back, and then iteratively call themselves to adjust the task that they were doing to get better and better results. And that kinda tool use concept is the major change. And that's like a step function change, in my opinion, from like kind of traditional large language model stuff to to reasoning agents.

Hayley

To compare to bots, is it is the distinction that bots will historically, would just kind of use rules based logic, for lack of a better term, to to execute tasks? And then for my understanding from what you said is that reasoning agents actually almost have judgment from LLM models?

Brett

Yeah. So Or LLMs? Yes. Yeah. Yeah. So so for for bots, I think when people say bots, there's like two types of things that people think about. One is sort of like disinformation, you know, like Russian election meddling type bots, and the other is, maybe more like what people call robotic process automation. So sort of, you know, things that if you like display a computer screen to these RPM tools, they'll, like, they know how to drag the thing from Excel to, like, Outlook or whatever and do some task. The old school way for doing those tasks was yet very rules based and deterministic, and so you would write rules for both the RPM systems and for disinformation bots to to say, like, hey, here here are, like, a combination of things that you can do. Combine them in ways that you think are interesting, and then output will come out. But that whole model is, like, kinda deterministic, like, you know, given input what the output is gonna be. Bot's got weirder probably with large language models. But I think this is like a a different paradigm where these models are non deterministic. You provide an input and you do not know which tools it's gonna select or which how how it's gonna reason, how it's gonna kind of structure the output at the end of the task, it's able to make those steps, you know, with with judgment and iteratively to produce output. And that is a it's like a fundamentally different paradigm, I think.

Hayley

What has changed in the past year that all of a sudden made these reasoning agents suddenly deployable?

Brett

Yeah. I think it was really just o1. So there are a few tech companies that have poured billions of dollars into getting humans to basically provide feedback on reinforcement learning. Did this model accomplish the goal for me or not? And so the foundation models, for example, o1 got to a point where the kind of reasoning ability and ability to use tools are just was just sufficiently good that when they deploy know, in ChatGPT and others started deploying these models, they were now able to be, used in a reasonable way. Whereas reasoning and iterative calling of large language models has been a thing for many years now. But nobody used them or really took them seriously because the reasoning component just wasn't there. So it was just huge investment from some of these tech companies into these foundation models is kind of what has caused these to be viable. And it really, in my opinion, is really only like December to March, like o1 to o3 models that made the change here.

Hayley

And what type of mission problem suddenly became solvable with reasoning agents?

Brett

I think we're gonna be able to rethink a lot of different tasks across the different directorates within a DOD unit with reasoning agents in mind. Everything from planning to operational assessment, intelligence analysis, data collection, probably things around logistics. There's just a whole bunch of really ambiguous tasks that you were not able really to write rules based systems to solve that I think in the next like twelve months, you're gonna be able to actually have like agent based systems solved. Not to not all the way end to end. I think what you're gonna see is like you're gonna be building these workflow tools that allow people to do their job like two or three times more effectively and efficiently. But it's still gonna be very human centric and workflow centric. And so having that like user and domain knowledge of like what actually are the things that you're trying to accomplish is gonna be really

Hayley

To ground this in practice a little bit, was the first or first handful of mission problems that you saw where agents were obviously the right answer and did what you're describing?

Brett

The thing that we're figuring out is like, it's not necessarily obvious to us, honestly, initially, like what's gonna work with agents. And so the key thing is just like trying a bunch of things like very quickly and iterating on on them as with like most tech product development, I I guess. But the the key one that I think was most the first use case that was really interesting for us was perception management. I'm gonna use a Russia example where let's say that, you know, The US is trying to deter Russia from, you know, invading more of Europe, for example, Poland. And a key thing that you might care about as The US is do the Russians think that we have some way to defeat their hypersonic missiles, like the Kalibr missile systems that they use quite effectively in Ukraine. And so managing the perception, what do Russians think about our ability to defeat those missile systems and, like, specifically, what technology programs do we have and, like, how are the Russians, you know, thinking about those weapons or technology programs is really important for that mission. The way that you would solve this before is you'd have usually one analyst, well, that probably should have been teams of analysts, usually you'd have one analyst who would hand jam like a really complicated boolean query to try to cover, you know, in like Cyrillic and in English, like every variation of how the Russians might discuss, you know, x, you know, hyper anti counter hypersonic missile system. And then that would return a bunch of results, some in different languages, like probably a lot of Russian, and so you'd have a translation step of taking that all that text data, turning it into English. And then you'd have an analyst go through and read like something like, hunt thousands of results in some cases, depending on how important the problem was. And then they would be doing steps of, like, analysis steps of, okay, like, maybe they're looking for what are the key have there been any significant changes for how the Russians have thought about our counter hypersonic missile systems in the last twelve or twenty four months? If so, like, what caused those changes? Why like, what did we do anything? Was there any reporting that the Russians were picking up on? You know, what's, like, the baseline Russian opinion now? So there's just like a lot of analysis tasks that that need to be done in that problem. And that whole process can take like a month for somebody who's pretty skilled. With agents, we took that problem and we basically took some of the workflow components that we know analysts do on every one of those, analyses, and we wrote an agent for each one of those components. And then we had them all run-in parallel. And we didn't do it just for one system, we actually did it for like 40, you know, systems. And so think about like the you know, and then we just ran it all and it took like twenty minutes to run to run those 40 systems. And that gave us you can kinda generate very quickly then, like, in this example, an assessment of what Russia thinks about every one of our major military systems in, like, you know, like a half hour. It's not like a 100% solution, but that gives you that that would have taken, like, 40 analysts, like, a month, right, to do. So that's the kind of thing where you might have, like, a guess of what's gonna work before you try something, but the there are kind of no rules really into like, nobody really knows how these things are gonna work, and it's like pretty greenfield right now in terms of experimentation. We're the defense primes and like most of the incumbents are not actually like experimenting with any of this stuff. And that's like great for us, but, you know. Yeah.

Hayley

So much there. I wanna dig into the product Yeah. Strategy. But before I do, I wanna pick on one thing that you said about it not being a 100%

Brett

Yeah.

Hayley

Accurate solution. Yeah. So what what do you do with that, you know, five or 10 or 20% that to audit it? Yeah. When when do you is it possible for humans to know which 10 or 20% that is? Yeah. And what does that look like?

Brett

Yeah. There's a couple of pieces of the problem here that people are still figuring out. There aren't actual real, like, solutions, widely accepted solutions yet. But the two things that are really important for that are kind of model evaluation and UX. Like, how do you expose the information that you're giving to users in a way that allows them to figure out what's right quickly and also figure out what's wrong and then edit it quickly. For model evaluation, the trick with these foundation models is there's there definitely is hallucination, but every month or like, literally, it's like two to six weeks, there there's like a new model that, for some of our workflows, beats all the other models. And so you have to be really adaptable about kinda getting these new models in, and those lower the the hallucination rate. The the only way to measure that though is if is if you have some kind of way, some sort of a model evaluation process. That's really tricky for agent models because the the workflows you're covering are often so ambiguous and non deterministic in nature that trying to pick a subset of say 30 examples is like not really sufficient to say, okay, you know, good output on these 30 examples, but like a human generated looks like this. Old model run against those examples, score it in some way. New model, run against those examples, score it some way. If new model better, then good, use that model. That's actually like really hard to do from like an agent standpoint. But when you're thinking about reducing hallucination rates, you're kind of messing with like the tools and the foundation model and how you're setting up these prompts and everything in order to, you know, kind of iteratively drive down what that error rate looks like. But that's like an unsolved problem. We're still trying to figure out how to solve that on our end. But that's like, you know, important. And then the other piece is just UX. Like, do you, as much as you can, force the model to to provide its reasoning and site sources. You can do things like have multi agent models where you have analyst agents being checked by what people call a checker agent, and so and they're checking for hallucinations specifically. Specifically. So So I'm I'm gonna gonna look at every bullet that this agent analyst is providing to me, the checker agent, and I'm gonna go check to see if the source that they're providing is real and if that source matches the content of what that bullet is providing. So you can kinda get fancier and fancier systems for working on hallucination. And at the same time, the technology is changing so fast that anything you implement in either of those two camps is not probably not going to be relevant in like two, three months. And so there's sort of a balance with agents right now. Things are changing so quickly, You have to decide what technology bets you're gonna make and what you're just you need you're not gonna make because you know it's gonna change in like a month or two anyway. It's kind of a balancing act right now.

Hayley

Yeah. I guess, how do you balance that maintaining flexibility with building for specific work flows that are urgent?

Brett

Yeah. So the key thing, and this is not credit to me, this is like some really good senior engineers we have on the team. We invested a bunch in infrastructure for people to, like, to do a few different things. One is allow any engineering team at the company to implement any agent that anyone else has built at the company. So that means you can have non agent expert engineers just take something, build an application, prototype it really quick. That was really, really key. Just basically providing good agent tooling for what we call forward deployed engineers, people that prototype with customers, like, super quickly. That's a good idea. It was not my idea, but that's that's a very good idea. Other things that are important for infrastructure is allowing any engineer in the company to write their own agent and tools. We we collect so much a lot of data, a lot of mission relevant data at Vannevar, but a lot of it is very very heterogeneous. It's kinda covering different mission use cases for different reasons and enabling any engineer to take any dataset, write a tool for a dataset that they know well, and deploy that across any of the models or agents that we have. Because you know that the foundation models are changing every two to four weeks, you need to be able to swap in and out models basically as fast as possible. And so just creating infrastructure to do that so that swapping a model in doesn't require you rewrite all the stuff that you did for tools. That's just being able to swap models in and out was really important. And so I think, yeah, investing in that infrastructure is is like probably one of the probably the most important thing for companies that are experimenting with agents.

Hayley

What does it look like to build that infrastructure to push prototypes to users without an engineer in the loop manually doing it?

Brett

Okay. So this is like a really weird time for product managers, I think, specifically. So1 of the things that's changed along with, just like being able to take agents and kind of take your own data and like write really cool tools to do things for like build applications with is in parallel to that, you also have a kind of big change that's happening in just like how engineering works and how product management works. Where now you can take like a completely nontechnical product manager, for example, myself, and you can have that person vibe code using like clog code and whatever, like pick your any IDE. And you can have that person create a clickable prototype with real data in, like, four hours, you know, with, like, no engineering knowledge. That was not a thing, like, six months ago. You know what I mean? And that kinda creates this, like, new world where for not for, like, production systems. For production systems, you want, like, engineers to, like, think about them and, like, build them in the right way. But for testing ideas and prototyping, you maybe don't need an engineer to spend a week doing something. You maybe can take a nontechnical person and give them four hours to do something. And the the trick there, if you're kind of building those types of systems, for us was we built again, this is not my idea, again, but also a really great idea. We built ways for any nontechnical person to push a prototype that they built and deploy it behind our own authentication system such that, like, anyone with any of our user accounts or whatever credentials could go log in and, like, click on that application. And that, like, button push deploy for, like, a nontechnical person to push a prototype in, like, a minute is pretty crazy. That that's, like, very, very useful. And then the other thing that we're still trying to get right is enabling nontechnical folks to In the same way that our forward deployed engineers are able to hit any of our agent APIs, like use any of the tools with any combination of agents. Doing that same thing for non technical folks, specifically product people, is kind of a game changer. And that's now possible in ways that none of this was possible, know, probably even three months ago, I would say.

Hayley

At Vannevar, you oversee product and you own product. What, broadly speaking, is the right product strategy when the landscape with agents is changing constantly?

Brett

Yeah.

Hayley

And you have engineers and product people all working on the the problem set. But how do you even design that?

Brett

Yeah. I think and I was maybe and I'm being totally wrong, but my current hypothesis with the right product strategy for agencies is number one is, like, mission knowledge and, like, actually understanding that the problem that you're trying to solve and iterating on that problem with users as quickly as possible is still the number one most important thing for, like, any new product development effort. And so, I think especially with agents where you're you're at your ability to impact different missions is actually pretty broad now. Biasing anybody who's working on like a zero to1 product to be hyper focused on what problem are you solving and do you understand that problem better than anybody else is like fundamentally the most important thing. So it's just for product strategy, winning on agents in defense is going to look like you have a team of people that understand the user admission workflows better than people and are creative at like, you know, kind of implementing agents to to solve those problems and shipping quickly. The second thing for agent strategy, I I don't know if that first one's probably pretty uncontroversial. I think the second one maybe is like slightly more controversial though I think. I don't think it should be. Maybe it's not. I don't know. I just don't hear people talking about it as much and that's like tools. So I think the foundation models being able to swap in and out foundation models quickly is really important. But I think those are gonna become commodity in the same way that cloud compute, like AWS, GCP, whatever is commodity. I think what's gonna matter is like the tools that you're building on top of those foundation models. And there are two types of tools that I think about. There are data tools, so collecting actually really the right data for whatever problem you're trying to solve. I think is really important in writing tools that agents can use to kinda get at that data. And the second type of tool that I think is gonna be important is task tools. So telling agents how to write to a database, for example. We built our first tool that wrote to1 of our products maybe four, three, four months ago. I think we're just scratching the surface of what we can do there, but that's a huge step change, I think, in how you think about product development, what does UI, UX look like. Having the agents be able to actually go take actions in in your product is going to be a really important thing. But that's all to say investing in tools for agents, I think is really that's probably the second most important thing and that comes down to having really really good data, better data than other people do, and then more creative ways to kind of farm out tasks to agents with with those tools. And then the last thing, which I guess is related to the first, is just, like, UX. Like, nobody knows what the right user paradigm is in for agents, like, nobody knows. Like, chat like, OpenAI doesn't know. Like, no nobody really knows. It's I think the one thing that I think is definitely true is it's not probably like a chat thing. I don't think like a chat bar, like, solves all the problems for for the government. And so how most software is written is like deterministic. It's kinda like what we're talking about the bots. It's like input, you know what the inputs in are gonna look like, and you know what the outputs are gonna look like, you and kinda know all the steps that are gonna happen in between and then you just write like a UI that facilitates that. That's not how agents work. It's like you don't know what the inputs are gonna be and you kinda know what the outputs are gonna be. Sometimes you can know exactly because you can force the model to do exactly what you want, but it's like a non deterministic system. And so like kind of the normal software UI constructs don't really apply, I don't think. And so I think you're gonna see a lot of invention happening there and it's gonna be pretty hard. Like, I it's people are still it's gonna take a while for people to figure that out.

Hayley

Yeah. Two questions there. One, just on the the last thing you said, what is your vision for that nondeterministic input relating to Vannevar's sensing grid and how those interplay with each other?

Brett

Do you mean the physical sensors or or Yeah. Okay. So we built physical RF sensors that we put in places to do things like ship detection. Is there a certain ship that maybe is doing something malicious in an area that shouldn't be there? That's an example where that's a dataset for us. So we have these sensors in place all around the world. We're constantly picking up radio frequency data from those sensors and then classifying those RF signals into what is this a push to talk radio? Is this an AIS signal? Is this maybe something else that's more exquisite that maybe is a military ship, for example? That's just another dataset that you can write an agent tool on top of and you can imagine doing things and these are some of things that we prototyped of Alright. Well, I'm gonna take And again, we haven't solved this, so I'm not I don't know if this is gonna be But you can, like, say, you can you can have the same agent have access to maybe those RF detections, which maybe have like a latitude and longitude component as and some metadata on like what detection and maybe like a timestamp. You can take that data, you can overlay that maybe with like a map dataset that has like locations of interest. For example, every known, you know, say foreign military base, like naval base or otherwise around around the world. You can overlay that with maybe, if you want, AIS data. You can overlay that with public reporting on what's going on. So let's say you're interested in sensing around I'll just use Taiwan as like an example. Let's say you're interested in vessels kind of doing things around Taiwan. You can go collect data for people reporting, official or otherwise, on like what they are seeing in the maritime landscape around Taiwan. You can mirror up all those datasets, write pools for all of them, give them to an agent, And then you can do things like, hey, I have sensed this RF thing in this location of interest. Can you characterize that activity for me? Not just looking at what the RF metadata says, but go look for public reporting, go tell me if there was a commercial ship that like had its CIS, you know, whatever, that like went to a bad port and then happened to be at this location at the same time. Maybe it's not the same vessel that we detected, but maybe it was co traveling with this vessel. Can that like tell me something about what's going on? And you can kinda layer in all these. It's the same idea for the perception management use case of like, you have one analyst staring at the same problem for forty hours. Or, you know, how do you kind of enable an agent to do, say like, ten hours of work on this maritime sensing use case of, like, trying to I'm gonna go look at all the public reporting, I'm gonna go look at I'm gonna go follow all the ship tracks, I'm gonna and and and distill that down into, a five minute query. That's kinda what we're talking about. And that's now possible, you know? And like, literally, this was not possible like six months ago. It's now possible. If you have the right data and you kind of are providing the tools to the agent in the right way, you know, you can like do all this crazy stuff that you couldn't do before.

Hayley

More follow-up on that. I wanna hit one other thing you said in your three part answer, which is model agnosticism Yeah. And why that's important.

Brett

Yeah. The main reason is the models change every four weeks, pretty much. There's something like that's been our experience for the last six months is there's been something pretty, like, crazy that's happened with the foundation models that has changed that has caused us to have to, like, rewrite or rethink about how we're using these agents. Like what tools are we giving them? How are we how are we kind of like telling them which tools to whatever. There's like kind of a whole bunch of different problems that run up that you come up against. But the the trick is like you just the models are changing so quickly and it's not like one model provider that is winning right now. It's like they're all kind of like pretty close. And so being fast on like, okay, what is state of the art for the tasks that we are working on? What's different about this current foundation model that I'm using versus previous foundation models and how do we kind of take advantage of that as fast as possible is is like not the whole game, but it's like a it's like a decent amount of the game. And I like, just using a recent example, so GPT 5 came out two weeks ago. I don't know. I think two two ish weeks ago. The main difference for enterprises or for, you know, where people like us building, like, government software or whatever, is GPT five is super good at parallel tool calls. And so what it'll do is if you tell it, hey, go find me, you know, like using the RF ship thing, go find me mentions of what's happening around Taiwan, it'll run instead of running one query, waiting for the answer, modifying the query, running another query, it will run like it'll decide, alright, I'm gonna run five queries simultaneously. I'm gonna hit, you know, these sets of tools when I'm doing it. So it'll kick off five queries, it'll get all that data back, it'll decide to use another tool. It's just it's way more sophisticated at using tool calling and it runs these massive parallel queries. And so that's great, but then you have to then think about how am I setting up the tools to enable agent to call the right tool and call if it's running parallel queries, how do I make sure that it's running like parallel queries in like a smart way that doesn't destroy the context window, for example, which is like a common problem for for some of these reasoning agents. And so if you're able to stay on top and swap models and understand what the benefits and pros and cons are for each model quickly and then take full advantage of them, you're able to unlock every month, you're able to unlock a little bit more of what you can do, cover use case wise. For example, that maritime use case I described requires calling a whole bunch of different tools and it requires doing them in parallel in some in some cases. And and so that workflow I described was like not possible three weeks ago, but it was possible two weeks ago, you know?

Hayley

Yeah. Wild.

Brett

Yeah. So it's just you have to like stay that's part of the name of the game right now. Yeah.

Hayley

Weeks ago or three weeks ago, sometimes it's not possible. Two weeks ago, was. Things are gonna change so much in the next few weeks, months, and then, like, I can't even fathom years from now Yeah. The step changes that will occur. What value do you think Vannevar will see out of this technology in the near term and then taking a long term view?

Brett

Yeah. I think the near term focus for us is, like, how do we take our current products and just make them way better? And that that kind of span because what agents can do spans touches a lot of the things that that we've been focused on for the last six years. For example, data collection, how do you use agents to be smarter about how you collect data? Obviously, anything data processing related and then sort of like UI workflow, how do you kind of enable a specific UI or workflow more effectively? Just taking agents and trying to fit them into existing products is not the move. It's more about how do you like fundamentally rebuild these current products with this core technology in mind to not make them like 10% better, but to make them like like three times better. You know what I mean? And that that's like a that's super hard. Just just doing that is like really hard and complicated and but that's like short term, we should figure that out. Also in the short term, but it's more gonna be sort of long term, I think benefit to us is like experimenting with these new workflows, for example, like perception management or planning or some of these things that are fundamentally different than some how we built some of our core products is super important. Because I think my theory is like, I think yeah. I just think that agents are providing an opportunity to, like, go after way more mission use cases and own, like, pretty significant chunks of, like, the mission set for DOD in in ways that were, like, not previously possible. And so kind of trying to bet in these little these missionaries and trying to figure out before other people do how to best use these, what data matters, what tools to use, what's the right UI UX. That's the short term, and then leading into the long term of just trying to own those problem sets.

Hayley

Yeah. Yeah. I think a lot of the focus around agents and AI in general has been around startups and disruption and, like you said, moving very quickly in, you know, within a two or three week time frame to kind of meet the moment. But what do you think that the proliferation of agents means for more traditional, more service based defense primes?

Brett

Yeah. There are a bunch of companies, like two examples are Booze and Khaki, that take analysts basically, just and like sell them by the hour. I don't know Yeah. Describing this or not, like, to to the government. And so they but these are like billions of dollars of contracts in DOD or just kind of these you think about it as consulting or services shops that have, I'm gonna sell the government a 100 analysts, they're gonna work in government spaces, they're gonna do it for a year and we're gonna sell that program for a $100,000,000. And it's no technology, it's purely just people sitting at desks doing the work. I think that there are probably huge chunks of those contracts and entire contracts in certain areas that are not going to look the same. I think one of the thesis that we have as a company is can we take 30 analysts and build workflow tooling for them to make them more productive using agents to kind of do what I'm saying, like take forty hours of work, shrink it into like whatever half day, something like that. And can we can we use, you know, those 30 analysts to then go and win these contracts, not selling it at a $100,000,000 but selling it at say like $50,000,000. So the government gets pays half, you know, less money but gets better mission output by just sort of verticalizing. This is what people call verticalizing that problem and us actually employing both the analysts and building the technology that powers those analysts. So I think that is if that model works, I don't know if it's going to work, there's some precedent of it working in, for example, in legal and other spaces. But if that model works, I think you're going to see these primes that are services focused vastly decrease in size because I don't think they're gonna be able to they can't build technology, usually.

Hayley

Software specifically.

Brett

I think some of the Primes, yes. Some of the Primes build good hardware, some of them don't also don't build good hardware. But but, yeah, I think I think they're gonna have a really tough time if that model works. That model might not work, but I would like to find out if it works, you know? Yeah.

Hayley

Yeah. Absolutely. What do you think the time horizon on something like that is? Again, like, with how quickly things are evolving, is that months, years?

Brett

I think in five years, if if that model actually works that all most of the services contracts within the government are not gonna look the same. I think you're gonna have, like, new companies that are considered the primes for those types of contracts. And I don't think they're gonna be seen as services contracts. I think they're gonna be seen as tech enabled services or more focused on the outcome versus the input. Outcome meaning achieving certain mission objectives versus the input meaning, I'm gonna give you a 100 analysts. There's like sort of a the way that those these contracts are structured, it oftentimes leads to incentivized underperformance, I guess. I don't really know how to describe it. Yeah, so I think in five years, you're gonna see that. I think in like six to twelve months, you're gonna you're gonna get pretty good data on like if that five year reality is gonna happen because there are gonna be companies like us that are just gonna start trying to do these things. And I think you'll you'll see pretty quick, you know, is this model gonna work or not? And at the same time, the foundation model's getting better. And if you're able to figure out, like, the workflow and the UI and UX, you're the odds of this reality happening, I think, are higher. But it's it's too early to kind of really say how it's gonna play out.

Hayley

There's no shortage of companies and startups that are working on AI broadly and then agents specifically. How would you categorize the different approaches?

Brett

Yeah. For just broader agent setup, you have companies that are foundation model focused. These are like kinda like the OpenAI, you know, Facebook is like dumping, sorry Meta, it's dumping like a bunch of money in training foundation models.

Hayley

Talk about a pivot. Yeah.

Brett

Yeah. Yeah. So that's that's that's all really important work. It also costs like billions of dollars and those companies need to succeed. It's really important that they succeed. So you have companies that are focused on that. You have companies that are focused on just sort of like the software layer. How do I build software tools to like that use agents to make people more effective at certain tasks, for example, like website design, like Lovable and Vercel. Some of these companies are in the vein of just making people more productive at a specific task. And then you have vertical companies that are more like, we're probably gonna be a combination of software plus vertical, but how do I not just provide software, but how do I end to end solve the complete problem with agents as a core part of that workflow? So using the services contract instead of just a 100 analysts that have no tools to do their job, can we have 30 analysts that are banner bar hired and trained, but are super charged with technology that makes them faster, more effective, geared towards the workflows that they're doing, that I think is gonna be a class of companies. And so I think it's like, yeah, you've got sort of foundation model companies, you have software then you have software companies that are usually pretty broad in the use cases that they're trying to cover. And then you have vertical companies that are just focusing on single problems and trying to go super super deep. I think you're probably gonna see big companies in all of those three. I think all are valid approaches.

Hayley

I would love to talk about the mission set, specifically as it pertains to national security. What are some examples, even notional, where you think AI agents could mean meaningfully support mission operators?

Brett

This is going to be what's gonna be figured out over the next, like, few years, I would say. I think my guess is operational plan Intel analysis is kind of everything I'm talking about. Think that 100% is going to be pretty agent forward in terms of kind of what you can do to make analyst jobs easier. That's an easy one. I think on the operation side, operational planning, I think is going to be a thing that agents are gonna be able to make a lot. You're not gonna be able to automate the whole process, but I think you're going to enable US military planners to make better plans by using agents to surface and test concepts that they maybe otherwise wouldn't have the ability to test. So I think operational planning is going be a thing. Operational assessment, we did a thing. Did it work? Did it achieve like the, you know, whatever object perception objectives, for example, taking that perception management, you know, Russia use case to achieve the objective that we want in that operation. There's there's a lot more, but I think those are kind of some of the ones that we're thinking about in the near term working on.

Hayley

What is actually available to users right now?

Brett

For all the stuff that I talked about or things that we've like, you know, kinda have built and and like have kind of all the yeah. We we have like, you know, for example, like models like GPT 5 with tools that are kind of custom built for different workflows and UI UX that's kind of attached to those workflows. And so I think the development work for us over the last, like, three or four months has been really we're just testing, experimenting with a whole bunch of different ideas and then trying to productize the things that are good ideas. I think that's what we're going to continue to do over the next twelve months. And my guess is that's where a lot of people are at it. The thing that matters right now is how fast are you at experimentation and trying things. And that's kind of how we're, I think, we're focused at least and probably probably other people are too.

Hayley

We talked a little bit about model agnosticism. How do you think about model selection when it comes to particular mission sets?

Brett

Yeah. So this goes a lot to the model evaluation concept I was talking about that it's not that people haven't figured out yet. In a perfect world, this world does not exist. But if if there were such a world, you would have like a menu of tools, you know, again, think datasets or like actions that the agent can take. You'd have a big menu of models, you know, call it like 10 pools, 10 models, and you would have then a whole bunch of test use cases that, you know, accurately cover all the different variations of things that you want for that workflow. And you would be running experiments where you're just testing every permutation out. The relevant permutations of like those models with those tools against that test use case and judging the results and using that to select the the right combination of model plus agent, you know, wins, basically.

Hayley

Yeah.

Brett

The problem for that comes in the in my opinion, comes in like the defining the test and evaluation cases because you're unless you have like, you know, you're usually for you're trying to achieve something fairly broad, like, again, non deterministic in nature. I I don't necessarily know what the user input is gonna be. And so if you're kind of defining, say, ten, twenty, 30 examples, it's not sufficient in first Some cases it is if if the model's performance is, like, very vast, but it's not really sufficient in sort of, like, these close optimization decisions of, you know, what what the right answer is. Because again, you're trying to cover something so non deterministic and broad. So our approach is like, try to do that the best we can, and then have a human expert spend like an enormous amount of time like going in and testing different permutations of models and then providing that to the engineering team to make a judgment call. I would say none of that has been figured out. We haven't figured it out yet. I think it's like sort sort of an unsolved problem. You That's how you would ideally do it. You have like perfect scenario. You've got all your test cases, you have all your permutations, try a bunch of things, test case, output's the right thing. In practice, I think it's like super manual and expert validated right now. That's our experience at least.

Hayley

You mentioned one or two. Are there any other unsolved problems right now with agent technology and reasoning agents?

Brett

Yeah. So model evaluation, UI UX are kind of the two core ones. And then, like, the hallucination piece that you mentioned is is the third one. How do you and, like, it's kinda tied to both of those pieces, like model evaluation and UI UX, but how do you kind of construct a workspace so that if you even if you know you're you're getting to an answer, it's an 80% answer really quickly, and but you know that there's 20% that needs to be changed. Like, how how do you provide that information to users in a way that lets them quickly understand what they need to modify to get to the 100% answer. Some cases, you may not need that 100% answer, but that's sort of the the unsolved UI UX problem. I think those are kind of the core three that we are at least when we're implementing these things are wrestling with.

Hayley

As this new technology develops and as Vannevar figures out, you know, on a weekly or monthly basis new ways of deploying them, obviously, our team is growing. But the engineers out there and the researchers out like, basically, no1 is agent native in right in this moment Yeah. Because, you know, it's brand new. And so how do you think about building out the team to encompass that that capability set when it's so new?

Brett

Yeah. I think step one is, like, start with kind of the current team. And so, like, the approach that we took when we first started working on agents was started, like, super with a super small set of people and just try to prove to ourselves like, hey, can like, are these things as big a deal as we think they're gonna be? And so we just started with like two very small test cases. The first test case was actually like kind of just taking a dataset that we had that was we thought was gonna be useful, taking that, turning it into a tool, and exposing a, you know, like a reasoning agent to that tool, and then just seeing what outputs we got. And when we did that, we saw, like, in order of magnitude improvement on, like, search and summarization. Some of these tasks that some of our core products have a lot of. And that led us to conviction that this is a thing, and then we sort of, again, like kind of took that and built that sort of perception management prototype that I was talking about. And that team is like special team, you know, to like do kinda zero to1, like really, we don't know what we're doing, we just gotta kinda like figure it out. But once you kinda have those initial proof points, then we use that to sorta galvanize the rest of the company basically around like, hey, these are gonna be a thing. The thing that you wanna focus on internally is like, there are there are skeptics to any new technologies just like a that's just like a fact of human nature. And I I think what I found for us is forcing people that are skeptics to build something with an agent is, like, the most effective and fast way to get people

Hayley

Yeah. Like show, don't tell.

Brett

You know, on the the train of, like and realizing, like, oh my god. And then thinking creatively about how to do things and like thinking creatively like, oh, I can experiment with whatever I want here. And so you're kinda just trying to first start internally and trying to get people in more of like experimenting and testing things like as fast as possible. And then then you think about, okay, like, as you're doing that, you're learning like what skill sets do we need, what things matter, what things don't matter. And I think for us, least on the engineering side, there's like three very diff pretty different skill sets that matter for agents for us. One is like that infrastructure team. So I have like a really important team of people that's like building all the tools for all the engineers and all the non engineers to to actually like use these things like quickly and effectively. Swap out models, write their own tools, deploy these things and applications like super fast. That's a that's a special certain kind of engineer. Then you have four deployed engineers who just wanna like hack things. They like don't wanna work in infrastructure. They don't wanna work on like like stability for products. They just wanna like hack prototypes together. Are really important for agents because we don't actually know what the right answer is. And so you you need like people hacking things very quickly to to get to that answer. And then you have product engineers that once you have something that you have conviction on, you wanna double down and like scale that into like an actual product that works. So you have those three subsets of people and you're you need to hire for all of them. And they're all unique and important. And then separately, need product people that are also kinda weird and down to vibe code and test things and like do, you know, things that typically product managers are not asked to do in a startup. So I think you kinda need like a combination of, yeah, people here.

EP. 07: Why Reasoning Agents Change Everything in Defense | "In the Arena" Podcast

Our mission is urgent

Join us to meet the challenge

Related Posts

EP. 05: Leave Your Ego at the Door | "In the Arena" Podcast

EP. 03: Building Winning BD-Engineering Partnerships in Defense | "In the Arena" Podcast

EP. 02: Hardware vs Software and Building Complete Systems | "In the Arena" Podcast