You've Been a Bad Agent | Transcript: Study Says AI Makes Developers Slower? F1 Movie Review, Coding and Testing for AI, Free Perplexity and Free Ideas

Study Says AI Makes Developers Slower? F1 Movie Review, Coding and Testing for AI, Free Perplexity and Free Ideas

July 18, 2025 / 01:09:54/E8

Matthew Carey (00:00)
Yo, how you doing?

Wilhelm Klopp (00:02)
Good morning, good morning.

Matthew Carey (00:03)
It's afternoon for me. Mate, how's my audio?

Wilhelm Klopp (00:05)
You sound very crisp and actually you look very good as well. You get studio lighting as well on top of your upgraded microphone.

Matthew Carey (00:13)
No, I just downloaded the Mac app and now I can record in 4K and I still look around but now I can record in 4K.

Wilhelm Klopp (00:21)
no way. Yeah, I see that we've become a video podcast now. You've been putting up some clips in some places, which is exciting.

Matthew Carey (00:27)
Yeah, we probably need to centralise it. just repurposed an old YouTube channel I had from some Chris Froome clips from six years ago that were on there. But it already had a couple of subscribers, so I was like, we'll just use this one and then we can have a fake subscriber count to start with.

Wilhelm Klopp (00:33)
How cool.

Nice, nice, nice, nice.

Let's reuse that audience, baby. That's exciting.

Matthew Carey (00:47)
Yeah, exactly. I'm sure there's

a massive crossover between tech bros and cyclists. The videos that were on there were like turbo trainer bike videos. So you sit on the turbo trainer and you could watch someone ride the Tour de France. Yeah.

Wilhelm Klopp (00:59)
No

way! man, did not realize you did that. That actually brings me to something else I was going to ask you, but before I say that, I think we should have a Twitter account for sure. That would be cool. Then we can tag it. Exactly, yeah, yeah.

Matthew Carey (01:09)
Yeah, it'd be nice to tag the pod, yeah. Yeah, that's, and then

we can, then you can put it in your bio, mate. You can fill your bio with even more stuff. I have so much stuff.

Wilhelm Klopp (01:15)
Exactly.

The bio is an important

place. Do you remember there was this project years and years ago? I think Twitter shut it down. Maybe Elon shut it down. It was like a, it would track like changes in people's Twitter bios.

and it would send you like a weekly digest of every change that's happened to like the Twitter bio of someone you followed. It was fascinating because Twitter bios, like really our like identity in a way, right? So if someone changes where they live or where they work or like how they talk about what they do, it's like kind of meaningful. It was great.

Matthew Carey (01:46)
That's quite a lot. Like, you are really stalking at that point.

Wilhelm Klopp (01:50)
It was amazing. It's all in public,

Matthew Carey (01:52)
That's like, yeah, I don't know. I don't know if I

like that. Yeah, I guess, I guess like there's some really fun stuff like that. The one that I really enjoyed was tracking Elon's private jets. Did you see that one? There was this guy, computer science student, just like pulled all the data from Elon's jets every like 20 minutes and then just track them on a map of the world. I thought that was so funny. He really didn't like that.

Wilhelm Klopp (02:02)
Mmm, mm-hmm, mm-hmm, mm-hmm. Yeah, yeah, yeah.

Dude, that was very good.

I actually remember, was looking at something similar last summer. It was like a little Garmin app so that you could track when you're cycling in Richmond Park, like which plane is the one you're looking at? Like where is it coming from? What's the airline? What's the type? I should revive that.

Matthew Carey (02:29)
Dude, you should do that. I feel like a lot of people would use that. also, okay, because I'm, okay, so I met someone previously, friend of, actually, ace girlfriend of mine, at a party, and he could tell you whatever plane was going overhead, he literally could tell you where it was going.

Wilhelm Klopp (02:35)
Mm.

Just from the, ⁓ wait.

Matthew Carey (02:47)
from like looking at it,

like the, yeah, he knew like all the flight maps, because he lived near Heathrow, so he literally could tell you where it was going. And it was the roguish skill, but it was so impressive. And okay, and then I was in a car with someone quite recently. No, he did it, he did it. Dude, he did it, because we were in Chiswick, and the planes come over quite a lot.

Wilhelm Klopp (02:51)
No, come on.

Uh-huh.

That's almost hard to believe because there's a lot of, there's like a flight every 90 seconds.

Right. Yeah. ⁓

Matthew Carey (03:12)
And he could tell

you, he could tell you wherever it was going, where it come from, like how many people were on it, as in like the capacity side, like all about the actual plane. Like he could recognise the difference between like the different airbuses and bowings and stuff. It was actually just wild. And I was there like drinking Aperol, being like, is he tripping? Am I tripping? What's going on? Like, who is the one that's out of the ordinary here? I feel like I'm kind of out of the ordinary for not being able to do this now. But I'm really glad you brought me down back to earth.

Wilhelm Klopp (03:21)
Hahaha

Hahaha!

That's

Matthew Carey (03:41)
No, I was in a car with someone

Wilhelm Klopp (03:41)
One day.

Matthew Carey (03:41)
recently and he was saying that he would really like an app that as you're going past like things, you could just take a picture of them and it would tell you what it was. And I did this recently with perplexity and this like 5G antenna that they keep on popping up around West Yorkshire where my mum lives. I just took a picture of one of them and was like, what is this? Cause they look kind of futuristic, they're these like white spires. And it was like, it's probably a 5G antenna. And I was like, I've never seen one these before.

Wilhelm Klopp (03:54)
Mm.

Matthew Carey (04:07)
I feel like there's a lot of stuff where you go past and you're like, what is this? And you kind of want like enhanced, yeah, like you kind of want like enhanced Google search, but by image. And I know like Android fanboys have had this for ages. They can literally just take a picture and it Googles the picture. Like reverse image searches the picture. But I kind of want something a bit smarter than that, like some like LLM to like extract some data about it and then search it for me, like more of a, I don't know, legit thing.

Wilhelm Klopp (04:11)
Yeah, I remember seeing one of those pop up in film.

Mm-hmm.

Mmm.

Yeah, yeah,

Check like the open street map or like the ordinance survey data or whatever.

Matthew Carey (04:37)
Yeah. You

can do it with perplexity. Like I have for perplexity search, I'm pro and you can just take a picture of stuff and it works out what it is. Like it works out where you are. It's like kind of scary, good.

Wilhelm Klopp (04:42)
Okay.

Interesting. Okay, yeah, I actually let's talk about perplexity for a second because I am I also I think a purchase that I'm not sure I'm paying for it still but at some point I was trying it out because it feels like the CEO is like such a great storyteller and is able to hype it up like so well and they launched like a browser now or something right I haven't tried that out yet but like otherwise it seems like basically perplexity seems like it's like a kind of a well-known thing in the tech world but it's like never like the top of the App Store charts or it's

It hasn't reached worldwide mass adoption or whatever. But what do you find it most useful for? Because I was chatting to a friend here, actually a friend from secondary school who moved to SF, which is very exciting for both of us. He says it's really great at finding sources, good sources for the information it gives you. So he uses it at work a lot for research that really he needs to back up.

Matthew Carey (05:41)
cool. Yeah, I guess there's a task that sits somewhere in between like deep research and Google and I use it for that task. So it's like not I want to know a lot about a topic and it's not like I need to find something that I know what it is. It's like somewhere in the middle. But to caveat this, I pretty much only use Publicity because I think I got a year for free.

Wilhelm Klopp (05:53)
Right, that's so smart

Mm-hmm. ⁓

Matthew Carey (06:06)
and I have no idea when that's ending and I probably need to check, The browser though, we should actually do a, we should not do a plug. We should do a ask the audience. If anyone has any comments, I think it's perplexity comment. If anyone has any browser invites, please DM me. I would love a browser invite. I'd love to try a new browser.

Wilhelm Klopp (06:15)
Mmm.

of course,

it's like invite only.

Someone was saying that, you know how we've had this massive wave of like VS code forks and that's worked like really well for a lot of the people doing the forking and someone saying we'll have the same now for just Chromium forks. We'll have just like tons and tons of like people because the browser just like the IDE makes a lot of sense is kind of like an established entry point that you can then build like a very rich AI experience around.

Matthew Carey (06:46)
Yeah, yeah, I see that. I think the code base of Chromium is probably more hectic than the code base of VS Code, maybe. So I don't know, I reckon it's more work to put your own spin on the Chromium browser than it is to release a VS Code fork with a plugin pre-installed.

Wilhelm Klopp (06:57)
I'll spoil it.

Interesting. Yeah, I can buy that. What were you about to say?

Matthew Carey (07:13)
But

yeah, was gonna say that the browser company, you know, they pivoted from their first browser to their second browser. Which I just find hysterical. I never got onto the Arc train. I know people who loved it, but I never got onto it because, yeah, my laptop just wasn't good enough, I don't think, my last one. And so it would just get crazy hot with a bunch of tabs open. And yeah, I don't know. I never really got into it.

Wilhelm Klopp (07:19)
Mm-hmm. Mm-hmm, mm-hmm, mm-hmm.

Same actually, yeah.

Matthew Carey (07:38)
And I had a vertical monitor at the time, which I'd optimized because it was vertical. And so I kind of wanted tabs at the top rather than tabs on the side, because tabs on the side was really annoying because then I had no screen. Yeah.

Wilhelm Klopp (07:52)
Oh, then you had like 10, yeah, yeah, gotcha. That's cool. I didn't know

you had a vertical monitor. That's some elite shit. I tried that at some point, but it didn't last super long.

Matthew Carey (08:02)
I had it at uni because I was coding like one file Python scripts basically. And so the vertical monitor was pretty useful. Like if you have one like thousand line file and that's all you're working on and you haven't worked out that you could split your functions up, then the vertical thing suddenly becomes quite useful.

Wilhelm Klopp (08:10)
Nice.

Mm-hmm.

That's amazing. Yeah, I remember that it took to be honest like having multiple files that you're dealing with it's it's annoying You know like I can see why past us was like yeah No, single file is how what a code base should look like

Matthew Carey (08:35)
quite like it, like the logical separation. I'm a big fan. And then each file is like independently tested. I like how, if I'm honest.

Wilhelm Klopp (08:43)
But then you have to import

them into each other and that gets messy. Obviously, I'm also a fan of multiple files.

Matthew Carey (08:47)
No, but if your functions stay really small,

like I don't want to do the whole like, can we say his name? Can we say Uncle Bob's name? I don't want to do the whole thing like that because like the whole clean code thing where you have to have like three line functions. But if you do that, you can't do that many imports because you don't have enough real estate. No, but you had that at GitHub, right? Three line functions.

Wilhelm Klopp (08:58)
Mmm.

Ha ha ha ha ha.

The GitHub code

base was like that, yeah, Well, one line functions, two line functions, three lines was a bit sus actually. I mean, there were a couple of 10 line functions as well. Yeah, yeah, yeah, no, was a, it has a lot of...

Matthew Carey (09:19)
What was that?

You got called out for that.

Wilhelm Klopp (09:25)
it was sometimes like if you wanted to like see like follow like a call chain like all the way through, sometimes you would have to like jump through like eight one or two line functions to get to like the place where you actually want to be across eight different files in completely different parts of the code base, which.

Matthew Carey (09:37)
Ugh, see that, that is a bit grim.

Yeah, that is a big room.

Wilhelm Klopp (09:43)
was all,

the nice thing was that they were all named really well. So like, that makes it bit easier to understand. But yeah, it's a lot of indirection and especially when you're jumping across like completely different files that are in completely different places, it's a bit mad.

Matthew Carey (09:54)
Yeah.

That's one thing I really like about having integrated test runners into the ecosystem of the language. So there's one thing that Zig does really well is it's pretty standard to have a file in Zig that is a function and then a test function underneath, a test testing the above function underneath. So you like co-locate your tests in the same file. And that's like pretty standard.

Wilhelm Klopp (10:06)
Mm.

⁓ cool.

Interesting.

Matthew Carey (10:22)
And

they can do that because it's compiled, right? So you just don't include the tests in the final bundle, not even bundle, the final code, source code. Yeah, so they can do that. But I've been starting to do that a little bit with Bun, because you can do that with Bun, which is kind of jokes because Bun is written on Zig. And I feel like it takes a lot of the language ideas from Zig that you can, like, even though it's JavaScript.

Wilhelm Klopp (10:27)
Mm-hmm.

and the artifact.

That doesn't make sense, so.

Matthew Carey (10:46)
it like you can do like bun test and then just test an ordinary file. And if that file has got tests in it, they'll run. And I think that's cool because you can like co-locate and then LLMs love the co-location as well.

Wilhelm Klopp (10:54)
That's so interesting.

Yeah, yeah, I bet. Is it like, um, you have like one kind of like very important useful for readability kind of thing test co located and you have like more tests like separately because often you can have like, or you, but you can have like a hundred tests for like one function, right? And like they're all, most of them are like random edge cases that are kind of irrelevant to the day-to-day work of someone who'd be looking at the.

Matthew Carey (11:10)
In Zig you would have them in the same file. In Zig it would all be in the same file.

Yeah,

yeah, this is true. I feel like we come from, I feel like if you're unit testing though, and you are keeping stuff very functional and small, your tests don't have to be that large, right? Like they're like a couple of inputs tested against a couple of outputs maximum. It should be like a one-to-one match really. And then they don't end up being that big. I don't know.

Wilhelm Klopp (11:38)
I see, see, yeah, interesting.

Mm-hmm.

Matthew Carey (11:47)
That design, think co-location, maybe not of tests, but definitely of similar functions, similar functions should be in similar areas of the code base. So I have this really annoying thing where if you put services separate to controllers, I hate I just put them all in the same place because...

Wilhelm Klopp (11:57)
Mm-hmm.

Right. So you hate

all Rails apps.

Matthew Carey (12:08)
Yeah, I was gonna say this. This is a Rails. This is a Rails thing, isn't it? They love that type of thing. Speaking of Rails, did you hear DHH on Lex Friedman? I wasn't gonna watch the whole thing because it's like six hours, but I listened to the first little bit.

Wilhelm Klopp (12:11)
You

I have not been able

to listen to Lex Friedman anymore. There was I think some critique from some famous blogger that was like I have read the transcript of a six-hour Lex Friedman podcast and I learned nothing new. He just asks like extremely shallow softball questions and if you are at all... ⁓

Matthew Carey (12:36)
They are the worst questions. They're the worst questions. It's like, so

how do you feel about this programming language? ⁓ I love Ruby. Ruby's great. It's the best thing ever. It's like great, obviously. Nice.

Wilhelm Klopp (12:43)
Yeah, yeah, yeah, yeah. So

I feel like I've been, I don't know, I've been turned off Lex Friedman by the... I think there are sometimes interesting to listen to and he gets amazing guests on obviously. But no, long story short, haven't listened to this one.

Matthew Carey (13:00)
Yeah, I think he is incredibly boring and dull. No offense to him. He's made extremely successful podcast. Not entirely sure how. I normally skip his questions and just listen to the responses because DHH is a great speaker, right? Like he is a stunning speaker and it's obvious like how he's been so successful and influential in software engineering. Like Rails is cool, right?

Wilhelm Klopp (13:14)
Mm.

Yeah, yeah,

Matthew Carey (13:30)
It's done a huge thing. You know, Stackworn's a rail shop. You know we're a rail shop.

Wilhelm Klopp (13:29)
Yeah, and it's amazing and I've... 100%, yeah. And yeah, in just so many ways.

I did not know that, no, I thought you were a TypeScript chop.

Matthew Carey (13:39)
Well, we

started as a Rails shop. We've been slowly stripping out all the Ruby. But I saw a really funny thread on Twitter, which was like, oh, why do TypeScript devs complain that auth is so hard? Just use Rails, just use device. device is pretty old and antiquated.

Wilhelm Klopp (13:43)
Yeah.

You

Mm. Mm-hmm.

Matthew Carey (14:04)
needs a lot of work.

And so it's funny, it does just work, but it also just works with like a million security vulnerabilities. So I don't think, yeah, that's a good enough reason.

Wilhelm Klopp (14:11)
Yeah, yeah, yeah. It's really interesting. I

feel like it's very easy to forget in this amazing world of software that there's a lot of different ways to...

do things and a lot of different needs that different people have. And like if one thing works really well for you, that doesn't mean it's going to work amazingly well for everything else. And also I think it's true actually for like everyone getting kind of different mileage out of like the AI tools. Like software engineering is a field. There's actually a lot of different tasks that we do like day to day. And I think it can feel like I used to kind of have this just intuitive belief that like, yeah, you could pick up programming and like

a week or like a month or I don't know certainly two months right because it's like in your head it all feels quite straightforward but then you realize

as you maybe see someone who's like super new, realize like, wow, like, yeah, I've been counted this issue that they've seen like for the first time, like a million times before. And like, there's actually so much varied work between like reading through a big code base to like writing tests to like knowing what like random HTTP headers mean to like, I don't know, some DNS thing. Like it is just like so much actually in this field and it's easy to...

it's easy to forget both how much I guess breadth there is, but also like how varied the day-to-day tasks can be. And I think especially that last point is,

Matthew Carey (15:24)
I actually

have such a good example of this. ⁓ I was setting up a new React code base and I was just like, I'll just use the, like an SPA, I'll just use the React Vite template. Like you're probably meant to. This feels like the most stable thing in React Vite, right? Let's go. I pasted the command in like npm i, all good.

Wilhelm Klopp (15:29)
Mm.

Mm-hmm.

Matthew Carey (15:51)
PM PM dev, immediate crash, crypto is not a function, something. And I was just like, oh my God. Like, I was like, oh, maybe it's my version of node. So I installed the latest version of node. I was like, no, the template is just broken. It is literally just broken. it wasn't me. So I just, and obviously I worked out why it was broken pretty quickly. Oh, I didn't work out why, but I just downgraded Vita.

Wilhelm Klopp (16:08)
Right.

Matthew Carey (16:19)
version six or something rather than version seven. I just did a full major version downgrade and it fixed it. And I was like, sick. I probably missing loads of features now, but it works for me. But imagine you were like brand new to Frontend Dev and you just install the template, run the template and the template doesn't even run. seriously?

Wilhelm Klopp (16:30)
Yeah, that's brutal.

Brutal, absolutely brutal.

I mean, man, I was trying to figure out some, I don't know, how to do some React thing inside Cloudflare pages, Cloudflare workers in January when we had this little hackathon at my house. And I was just like, let me try Next. no, you don't want to be using Next. You want to be using, I don't know. It was just like a bunch of people who really knew, including you, who really knew what to use with whatever else. it took two hours and I was just working my way through different iterations of

framework combinations like, no, you want to be using Astro, whatever that is for this use case. no, you want to be using like, okay, go on, lay it on me. What's the hottest?

Matthew Carey (17:12)
Dude, Astro is sick actually. am sorry, I just

touched my mic. Astro is sick. I migrated my blog from Next.js on SST to Astro on Cloudflare. And I haven't got any performance numbers at all, but it feels snappier. I don't get an email every month now from AWS saying they're charging me like 50p. And...

Wilhelm Klopp (17:25)
⁓

You

Matthew Carey (17:37)
I don't know, it just feels really good. it also, it got rid of so much boilerplate because Astro handles all of like the markdown to static site generation for you. And you can do server rendering of like some interactive stuff if you want to. So it's really cool. Like I was a big fan. It's kind of weird that it's not TSX files. It's like Astro files, but to me, it looks like React. I don't know if it is React. I don't know what's difference. I mean, Cloud Code helped me out a lot and then I had to read the docs to do the final setup.

Wilhelm Klopp (17:59)
Mm-hmm.

Matthew Carey (18:06)
but it was super nice. It was a very nice experience. All my tailwind, everything just worked out the box. Yeah, it cool.

Wilhelm Klopp (18:11)
Yeah, that's really interesting.

Dude, I had an idea. We talked for a while about having like a free ideas segment on this show because there is so much to build at the moment and most of it we're not going to have the time for or whatever. We have our own priorities. But one thing I was thinking about is it would be cool if there was kind of like a lovable but for hackers or like for people like us. So basically the idea is like something like lovable or like Claude Artifacts.

bit like too abstracted away from us, right? Like really, we want to have like a bit more control and also probably we want to deploy it like on our own site or on our own like domain and probably also have a bit of a say over like the tech choices so that we can like make our own tweaks like without using AI. So how cool would it be like if there was some service, I don't know, you could just deploy on your own like sub domain or something and then you register all of your API keys, like all of your favorite stack and then you have the same kind of prompt

interface. It's just like, hey, dear lovable, Claude Outfacts, make me this little app. But it deploys on your own domain using all your favorite tech choices. has already OAuth providers configured. It has all of the right API keys configured or third party APIs you want to access for your personal Google Drive or something like that. And then

And then, yeah, obviously you own all the like, it ejects and you, or it doesn't even need to eject. It's just like a code base. It's just backed by like a GitHub repo where everything lives. What do think?

Matthew Carey (19:39)
I mean, that sounds like a great idea.

Wilhelm Klopp (19:40)
Back to the show.

Matthew Carey (19:43)
We bleeped that bit out,

Wilhelm Klopp (19:44)
Whoa. Yeah, you should actually bleep it all out. It'll make it even more. ⁓ Just bleep. That's wild. No, I, someone should build it because for, I think our, even between the two of us, the tech choices would be quite different. I'll give you an example. I wanted to build, I was quite interested in,

Matthew Carey (19:49)
Should I just bleep, bleep? Yeah, no, I told Will something very secret. But that is a great idea. Someone should build that.

Wilhelm Klopp (20:06)
Like what are the hardest problems in like, sweet bench verified that like no one has been able to solve. And I tried it in both Claude artifacts and in a lovable and lovable did a much better job. And it's the perfect kind of micro project, right? Like little microsite. It's just like, I want like an online little explorer thing that like loads in the data from GitHub, like, because all the traces or all the like problems that are solved are kind of provided by the different like.

I don't know, like by all the different people who attempt the benchmark, like, it's all in the GitHub repo. And I just wanted like have a little browser, like filter, search, sort, whatever. Lovable did a better job than Claude Artifact. Claude Artifact's just like, it didn't work at all. But then even in Lovable, it was just super janky and like barely really worked. like, I don't know, like, I'm just not ever gonna look at that again, I think. Whereas if I can like decide my own tech that I wanted to use.

Matthew Carey (20:55)
Yeah.

Wilhelm Klopp (20:59)
and like give it my own GitHub API key or whatever, I don't know, like this kind of stuff. Then it just becomes a lot more reliable. Partially actually this was inspired because someone, there was a little hacker news post the other day that was like, hey, should we just split hacker news into like people who want to talk about AI and LLMs and agents and people who don't. And then he just like whipped up a little thing that was like, here is.

the Hacker News front page with a bunch of keywords like AI, LLM, agent, filter it out, here you go. And it was on his own domain. So it was just hosted on his own page. And it was just the most beautiful, simple thing. And I was like, that is actually great. That is the right form factor for these things. Yeah, yeah, he just used AI to build it. Yeah.

Matthew Carey (21:36)
Did he use AI?

So he was in the pro camp, I'm guessing. I feel like this is one, is yeah, I mean, it's a cool use case, but two, isn't that product feedback for the platform? know hack and use don't care, right? But for Sweet Bench,

Wilhelm Klopp (21:43)
Yeah, it's a bit ironic, I guess,

Matthew Carey (21:59)
I feel like something like hugging face should have a search over sweet bench and like a semantic search over sweet bench. Isn't that literally what they're designed for? Like searching over people over datasets. ⁓ Yeah.

Wilhelm Klopp (22:00)
Mmm.

⁓ Yeah, I agree.

I agree. guess this is like, yeah, maybe they have data sets, but they don't have like a fine-grained understanding of like evals or like runs or yeah. Anyway.

Matthew Carey (22:22)
Yeah, I guess.

but you want to talk, there's loads more you want to talk about. You want to talk about how AI is slowing down developers. I saw this, you sent me this.

Wilhelm Klopp (22:25)
Dude, yeah.

Yeah,

OK, yeah, I really want to talk about this one. So this is fascinating. So basically, there was this study by METR. They did kind of a big splash on Twitter saying, all right, I'll just read the tweet. A randomized control trial to see how much AI coding tools speed up experienced open source developers.

The results surprised us. Developers thought they were 20 % faster with AI tools, but they were actually 19 % slower when they had access to AI than when they didn't. So that's obviously quite surprising. We're all like, AI speeds things up. Why would it actually slow things down? There is a lot to the study. They seem like quite good people, like kind of a well-designed study, et cetera, et cetera. But I think I've always had that perspective that I just mentioned earlier, which is like,

There are actually a huge variety to coding tasks. I think for some AI can help a lot, for others AI is probably less helpful. But then two days ago there was actually someone else who kind of quoted the study and looked into it a little bit. And none other than Emmet Shearer, the co-founder of Twitch, who, by the way,

Matthew Carey (23:31)
yeah, is he the co-founder of

He was on the board of, didn't he become the chairman of OpenAI or something for about 20 minutes?

Wilhelm Klopp (23:37)
I was going to say,

I feel like he was the CEO of OpenAI or something for like 20 minutes back when the board events, the OpenAI board events went down. But yeah, so he looked into this a little bit and like, guess long story short, what he's kind of found is that...

is that the developers who participated in the study, all of them have like experience with LLMs, but only one of them actually had experience using like a proper coding agent like cursor or Claude code. And his point is like prompting LLMs in chat GPT where it's like, you know, you ask a question, you get something back is like a very, very different thing, very different thing to using like a real like coding agent where you like write up a task and then the thing goes out and does it.

Matthew Carey (24:19)
Yeah, yeah. 100%.

Wilhelm Klopp (24:22)
And then there's a bit of back and forth with the authors of the paper. But I think it's a reasonable point. But the thing that I would be most interested in discussing with you, which we've kind of talked about a little bit previously, is how do you actually get proficient at using an agentic tool? Because it feels like the skill ceiling is quite high. And Emmet himself in this tweet thread, he even talks about how...

Matthew Carey (24:45)
Yeah.

Wilhelm Klopp (24:48)
He had some tweet saying like, yeah, I am very modestly proficient and it took me weeks. like he's saying it like, I think Thorsten from Amp Sourcegraph has been saying this for a while as well, that like, it's a real skill to use these agents like well, and it's something that needs to be like, yeah, practiced and you you get better over time. And obviously none of us have any patience anymore.

Matthew Carey (25:11)
I mean, Jared, Jared.

Yeah, I mean, Jared Sumner from Burn, Mitchell Hashimoto of Terraform and HashiCorp fame. all are like, LLMs are awesome, but it took me months. Like I think, I think Mitchell Hashimoto when he was talking about Ghostie, he's talked a lot about how he's used LLMs, huge amounts, but they've been, they started off not very helpful.

I think that I really hated the workflow of generate something in ChachiBT and then try and incorporate it into your project and then run it and realize it doesn't work. That was awful. I like, never want to go back to that sort of time. And yeah, I, I really hope this wasn't what the study was mentioning. I mean, I didn't read into it in that much depth. The other side of it that I could see.

Wilhelm Klopp (25:57)
Was it? No, actually. Yeah,

yeah, no. think it was kind of more, I think the problem was more that like they were using stuff like cursor and they just couldn't get that much value out of it basically. Yeah.

Matthew Carey (26:06)
okay. So they hadn't like worked out how

to prompt it. They hadn't worked out like when to call it quits because like there's the whole doom loop situation that you have to understand and you have to go back and then change your inputs or what exactly what you're talking about in the last episode. Yeah.

Wilhelm Klopp (26:11)
Right.

Yeah, yeah, yeah. Right.

Right. exactly, I

think the stuff that you described is exactly the skills that you need, right? Like you need to know which task to use it for, which task not to use it for. You need to know like how to like write the prompt well. This is why like Amp has like, know, this like, enter doesn't send a message, it just goes to the next line to encourage you to like write like a bigger prompt.

Matthew Carey (26:36)
I actually love that UX. I love that. I love that. I love it. They did so well there.

It makes so much sense. Cause the amount of times I'm like cursor, just do this. And then I get hard. I actually normally press enter and then I'm like, no. So I actually just write back in the, in the text box, continually write back into it, press enter, double enter, because then that restores the checkpoint and then sends it again.

Wilhelm Klopp (26:48)
Hmm.

Mmm.

⁓

Matthew Carey (27:02)
So I reckon I spend double cursor what I need to do because I press enter way more than necessary. ⁓ Yeah, that's not good. But I think there's also the, I'd be really interested to see how experienced the developers were in general. I think they had five years, right? The developers that was in the study. I think I saw the headline because I can't wait for the study that comes out. That's like the difference between

Wilhelm Klopp (27:07)
interesting.

⁓ hmm. Yeah, I wish I'd...

Matthew Carey (27:28)
people sort of my age or a couple of years younger who had a few years experience before LLMs and then a few years afterwards and whether they have like the ones that have embraced it versus the ones that haven't, then I would like to see like very, people who are very young now who have just had LLMs and I would like to see people who are older who've had like 15 years or 20 years experience without LLMs and then the ones who've adopted it and the ones that haven't because I think...

Wilhelm Klopp (27:48)
Yeah, that's fascinating. ⁓

Matthew Carey (27:55)
you need all of those categories before you start making an informed decision. Because I think for one of them will be properly like out of distribution. Like I could imagine the, I could imagine the my age who adopted LLMs at the beginning, depending on how much, how good they were when they adopted LLMs, you could be like very out of distribution. Like you could have not.

Wilhelm Klopp (28:18)
Like, way better,

you mean?

Matthew Carey (28:20)
Yeah, yeah. So like if they had an intrinsic knowledge of what the field that they were working before they adopted LLMs, I think they'd have got way better. But if they didn't, my hunch is that they're going to have really have struggled maybe even more so than the, I don't know what's after Gen Z like Gen Alpha, maybe even more than those people have having had LLMs pretty much all the way through their beam coding, right?

Wilhelm Klopp (28:44)
Yeah, yeah.

Matthew Carey (28:45)
because they've learned

to harness it from the beginning and they've learned with the model, whereas other people later on, maybe they didn't learn with the model and then they actually got, they learned about a bunch of bad habits or a bunch of ways that their company did something in that moment. And now they're fighting with a language model because they don't know the reason why they did what they did previously. And they're holding onto the only ground truth that they have, which maybe is not a good ground truth.

Wilhelm Klopp (29:02)
Thank

Right, yep, yep, man, it's so interesting.

Right, right, yeah. And the thing that we can all do is try and get better at the skill of prompting the agent, right? Which is like, so I'll give you one example actually. I have been doing this big refactor for colo because I found quite a big performance improvement, at least in theory, which is nice because colo can slow down your code a bunch when it's active.

And, it involves like a bunch of like, there's like a Rust extension to colo, like, so it, you know, it gets like, it can get a bit complicated. It's like kind of weird code. You kind of need to know it somewhat well. it's also very well tested.

I had this experience with Cloud Code a couple of times last week where I was like, here's the thing that's wrong. Go and adjust it and then run the tests to verify. But it would start running the full test, but then it would gradually run fewer tests until it ran just the tests that were passing. And then it would stop and be like, hey, all the tests pass. We've done a great job here. And then it would exit.

Matthew Carey (30:09)
What?

Wilhelm Klopp (30:12)
And this just like kept happening. And I'm just like, I don't know how to get it to not do that. so clear, like to be honest, like I don't like, I think as far as people go in terms of the skill of prompting agents, I don't think I'm like very good. Like I wish I, I was, had like more naivety or whatever. I wouldn't like, I think I've often gone back to just, okay, let me just write this code like myself and like not, which I think it like, that's totally legit. And the AI can still help a lot in,

Matthew Carey (30:12)
Yeah.

I don't know.

Wilhelm Klopp (30:39)
in like lots of other ways, but like you know how we were talking about like editing inputs versus editing outputs like I'm definitely not at the level where I can just like edit the prompt I give to the LLM and have that then do the right thing. I will for sure edit like the outputs of the LLM and like make it so that it's how I like it.

Matthew Carey (30:57)
It's a scale though, right? Like if you can see where something has gone wrong, it's kind of easy, like directionally wrong. It's kind of easier to edit the input, whether it's like, if it's something minorly stylistic that you can just curse a tab all the way through, then it's easier to edit the output. So I can't, I reckon I make that, it's probably what I wanted to say last time actually, is that phrase. I think I probably make that trade off in my head every time.

Wilhelm Klopp (31:10)
Mm-hmm.

Mmm.

Matthew Carey (31:21)
It's one of the fun things about working with LLMs is I'm less thinking about the code. I'm more thinking about how I want to make the code do what I want it to do. I'm less thinking about the syntax, I really enjoy. think coding with models is super fun. Yeah.

Wilhelm Klopp (31:21)
Yep, right.

But so do you still look at the code at the end and do you still care that it conforms to... Okay, okay.

Matthew Carey (31:38)
yeah, 100%.

Yeah, yeah, and like I have some really strict linting rules. I think really strict linting rules are good. Especially in TypeScript code bases where you can, and there is the structure and there is the tooling now. The tooling's really easy to set up. There should be no excuse for not having good linting rules. The thing I hate most is when you find a company...

Wilhelm Klopp (31:46)
Mm-hmm. Or like, yeah.

Matthew Carey (32:03)
and they're doing really cool stuff and you're like, maybe it's open source and they have a template and you download the template, it's kind of what I saying earlier, and you open the template and my default linting rules in VS Code in Biome, in my extension, are like, everywhere. And you're like, holy shit, man, come on. And it's like real simple stuff, like, there's just like nothing is typed properly, everything is any, or like, there's a bunch of like TS ignore errors and you're just like, dude,

This is your public facing template. At least make this useful. At least make this good with LLMs, like a good starting point. So yeah, that's my little gripe.

Wilhelm Klopp (32:35)
Right. Yeah.

Interesting, so if you, so just to spend like another couple minutes of this, like if you, if we had to come up with some rule or like some ideas, some best practices for the skill of like prompting an agent to do the right thing, like what would be in that list of like best practices?

Matthew Carey (32:59)
I don't know if I have a huge amount of best practices for prompting. I have some good stuff with just like working in a code base. I've been thinking a lot about generating code. Like, I guess my whole job at stack one was I was hired to generate code, my most recent project the whole project is built from the ground up to be able to be generatable.

Wilhelm Klopp (33:04)
Mm-hmm.

Matthew Carey (33:14)
And I think I've talked about this previously, but everything is typed like super, super well. Everything is co-located. Like the good examples are co-located next to each other. The bits that you don't want to change are far away from each other. So when something changes, like, so for instance, a lot of the specs and the structures of the code base are created in types. And I have a bunch of type-only packages,

Wilhelm Klopp (33:20)
Yeah, yeah, Mm-hmm, mm-hmm.

Right. Yeah, that's powerful.

Matthew Carey (33:40)
that live in another part of the code base or even in another code base so the model can't change them. And then you just have really strict linting rules. And if it has to conform, then you suddenly have almost verifiable code.

Wilhelm Klopp (33:50)
That's good, yeah, yeah. And so all of this means, right, that in the agentic loop, like the agent can see when the linting rules are like broken and can like fix them. And that's part of its like feedback cycle.

Matthew Carey (33:59)
Yeah. So

just really boring stuff like that. Yeah. And just try not to do too many things that aren't like normally accepted in the world of open source, right? Like if you have crazy names for stuff that don't actually mean those things in English and you go down a wild rabbit hole, like that's chill, but the model is going to then struggle and you're going to have to, you can do that. You just have to write some good rules files or at least some structure.

Wilhelm Klopp (34:05)
Yeah.

Yep.

Matthew Carey (34:28)
I've seen some, I think I'll think which codebase it was, but I saw a mad codebase that had like Claude.md files at every level. I was just like, it's actually nuts. But if, yeah, I mean, if it works for them and like, because I feel like some parts of a codebase, you're to want to generate more than others. And those parts, if it's a consistent thing you want to generate, then why not have a Claude.md file or like a readme at least that tells you how to build that thing? Like,

Wilhelm Klopp (34:35)
Mmm. ⁓

Nuts good? Nuts good, right? Yeah.

Yep, yep, yep, yep.

Matthew Carey (34:54)
Previously you might have had

it in docs, but now you just co-located with the code.

Wilhelm Klopp (34:58)
That's good, yeah. One thing that also seems very helpful, and I think that should be in a lot of these longer running prompts, is some way for an agent to actually verify their work. So some tests that they can run, or maybe even just running the code that they've written is a good starting point for seeing if it matches their expectations.

is great until Cloud Code decides it's just gonna start running a very small subset of my tests until day one.

Matthew Carey (35:22)
Yeah, that's really funny. That's

really funny. I've had Claude run, like Claude seems to tend to want to run like linting and build. I don't think I even ask it to run that. every time it finishes, it runs build or at least open code does. Like every time it finishes, it runs build. how beautiful, how beautiful is open code. It's so pretty.

Wilhelm Klopp (35:42)
we've been.

yeah?

Matthew Carey (35:48)
the system theme, where it matches to your like terminal colors. it's so good because I have it in ghosty and ghosty is beautiful out the box. And so now open code just has the same colors as ghosty, like by default. It looks so good, dude. ⁓

Wilhelm Klopp (35:57)
Uh-huh.

That's cool. I

need to give it another shot. I tried using it once just inside like the VS code integrated terminal and it was like super laggy. And I was like, damn.

Matthew Carey (36:10)
Yeah, no, no, that's tough.

I had this problem where, or the first time I tried to use it, I downloaded the wrong open code because there is two. And that's actually kind of tough because I think the, so there's one that was acquired, the original one that everyone was working on, that one was acquired by Chime. And then there's a fork, which is now the one I'm talking about that's maintained by the people from SST. And...

Wilhelm Klopp (36:18)
⁓ no, no, yeah, yeah.

Mm-hmm.

Matthew Carey (36:34)
the,

I accidentally downloaded the original one. And when you run the original one, there's no like automatic setup of keys or anything. It just says like agent not installed. And you're like, ⁓ great. So how I think you have to write like a config file manually and stuff, but I couldn't find the docs for it. And I just immediately churned like immediately I was like, I can't be bothered with this. And then about three weeks later.

Wilhelm Klopp (36:42)
Right.

Mm.

How do you think that naming thing

will resolve? sorry, no, go on, go on. Keep going.

Matthew Carey (36:59)
Just about three weeks later, I just tried the one by SST and wrote open code and it went through this whole like picker of like which model do you want to use, like Claude, sign in with the Max plan. I was like, my God, I love this so much. It's so nice. Yeah.

Wilhelm Klopp (37:07)
Mm.

That's awesome.

Yeah, I wonder how their naming dispute will resolve. What will actually happen?

Matthew Carey (37:19)
I

don't know, but I'm very bullish on this type of thing. was thinking about it more and the more I think about it, the more I think that Vercel is going to do one as well. Because it just, it makes sense. Like they'll run it through their AI gateway. Like they'll distribute traffic over a bunch of keys eventually. They'll do token costs plus slightly extra eventually. It will all be charged to a Vercel account. Like I feel like they're going to want to own this.

Wilhelm Klopp (37:30)
I see.

Mm-hmm.

Matthew Carey (37:47)
And everyone is going to try and own this because it's so flexible. have, I got code code running in, in a Cloudflare container for some work, yeah, for some workflows that we want to do. And I could, I just have an API, right? Or like an RPC function that I can just ping and send it, send it a prompt and it pulls a repo and adds a thing to the repo. And it's like, it's so clean and it's all done like.

Wilhelm Klopp (37:51)
Yeah.

congrats!

Yeah, that's great.

Matthew Carey (38:13)
on just in a container. Like you could run a VPS to do it, you could do whatever you want to do it. It's just so flexible. Like you can run that thing wherever you have Linux installed. ⁓ Yeah, I'm just like, wow.

Wilhelm Klopp (38:14)
Yep.

Yeah.

Yeah, I agree. Yeah.

One thing that I wish that I'm worried Cloud Code will never have, and that is a really cool idea from, again, I think Amp, which I mentioned it like a million times. I think I'm going to take it for a fresh spin sometime this week, is you get another model. You get O3 to review the output of the Oracle. Exactly. Yeah.

Matthew Carey (38:43)
Is this the Oracle? The Oracle? This is the chatting about the Oracle. dude, this is so good.

So I have a question for Amp.

Wilhelm Klopp (38:49)
Because I think that would

have caught the issue that I had with Claude. If it just looks at the diff, the first message and the last message, then it would be like, oh dude, you're not running the full tests. Or if Claude is stuck for some reason in some local minima, then it...

it can perhaps tell, I think 03 to me still feels smarter than Opus 4, then it can maybe unblock it and go to the next level.

Matthew Carey (39:20)
Dude, Amp is

like so OP. I feel like they are living in the future. I tried it. It's my favorite coding editor. Coding extension, 100%. I have a few minor gripes. One of them is it's just way too fucking slow. ⁓ Like the actual UI of it is way too slow and the sub-agent experience doesn't feel as good as it could do. I feel like it could feel so much better. I feel like if you had multiple concurrent agents, which I haven't been able to produce,

Wilhelm Klopp (39:33)
Hmm, interesting. ⁓ really?

Matthew Carey (39:46)
I thought that could feel better.

Wilhelm Klopp (39:47)
⁓ okay,

that's really good to know. So you haven't been able to do that. I've been trying to get Claude Coe to do that and it just doesn't do that. Yeah.

Matthew Carey (39:51)
I haven't been able to do it yet. I haven't been able to do it. It

was really annoying. was like, this surely should be quite straightforward, but I haven't been able to do that. And then other things I haven't been able to do. Okay, so it runs off Claude most of the time. it enabled, like let me sign up with Anthropic with the max plan. I just don't have the money to be paying for Amp all the time like this. At least let me proxy my Claude requests. Like I know you want to own it, but dude, it's just too expensive. And then the other thing, I don't mind.

Wilhelm Klopp (39:58)
Mm-hmm.

Yes, yes.

Yeah,

that's a very fair point.

Matthew Carey (40:20)
I don't mind paying

through 03, through you guys through 03. I don't mind paying extra. Like I'm happy to pay extra, but the majority of my requests are going straight to Anthropic. So just let me sign in with the max plan. I really don't even mind if you take a tiny percentage for each token. just, I managed to spend something like $80 or something in one morning. And like, I feel like that is probably too much for what I produced. Like,

Wilhelm Klopp (40:32)
Yep. Yep.

Mm-hmm.

Matthew Carey (40:49)
It was cool, right?

And everyone's going to say like, like compared to your salary, like it's like nothing, ⁓ like compared to the fact, yeah, it was a Saturday. like compared to my salary, I shouldn't have been working. So it would have been nice if that didn't end up costing loads of money. So I don't know. I don't know.

Wilhelm Klopp (40:54)
Yeah, it does make a difference. And clearly, yeah.

You

Matthew Carey (41:11)
It was so cool though. What they're doing is really good. I love the branding. I love the developer marketing. Like they're all advocates for their own products. Like everything they're doing just screams like this is going to work and they're managing to do it. The team. Yeah. I mean, it feels it's bigger than the open code team. They're shipping more innovatively. Like open code, they're doing the stuff they're doing really well, but Amp they're as simple as the

Wilhelm Klopp (41:17)
Yeah.

Matthew Carey (41:36)
return, right? It's just so good. It's so clean.

Wilhelm Klopp (41:40)
Yeah, the Zed V1 of the LLM experience in Zed had the same thing. And then the V2 kind of went back to enter sense the message. It's definitely like, it's interesting, right? Because like, clearly most products have enter sense the message, but like, if it really, what the agent is about, like writing a long, thoughtful message, like it, like,

Matthew Carey (41:46)
Okay.

Yeah, I don't like end to end message.

Wilhelm Klopp (42:03)
Yeah, maybe it is like the better paradigm. But also it's interesting because obviously Thorsten from Sourcegraph, you he loves writing well. So it's like the perfect kind of thing for him to be doing. So that's great. But it does also feel like surely this is not where things will end up, right? Like I think we're in a world now where very few people write like long form.

written content, long-form thoughtful written content. Like none of us have attention span for this kind of stuff anymore and kudos to the people who who do they are for sure better people

Matthew Carey (42:33)
don't think it has to be formatted. if you are struggling to write a long thing, then use voice, use like whisper flow, bad dictation, use some sort of like transcription model. Like that makes sense to me, that a long, okay. So thinking of the recent advances in coding agents, a big one was to-do lists, right? Where the model can tick off the to-do list.

Wilhelm Klopp (42:50)
Yeah.

Matthew Carey (42:57)
And my brother, we were chatting on the weekend. was like, dude, have you seen cursor can, can like create its own to-do list and then take it off. It's so wild. And I was like, yeah, man, you're about like two weeks. Like, like all the other coding agents have had this before. And he was just like, Matt, you're too online. You're too online. But, um, he just looked at me with this like expression of disappointment. Uh, but I'm taking that as a win. It was, I think with those to-do lists, you incentivize that longer questions and

Wilhelm Klopp (43:13)
Hahaha

Matthew Carey (43:25)
bigger tasks. And that's where we want to go, right? We want to go to the fact, like everyone's metric of success is how long can the agent loop continue creating useful stuff? And can it be 20 minutes? Can it be 30 minutes? Can it be an hour or how many tokens? Maybe not the time because also K2, you know, the new model, that new model that's like really fast on Grok. That is another one of my gripes. Like if my coding agent is too slow, then I don't want to be sat in front of it. Like I have better things to do than watch AI churn.

Wilhelm Klopp (43:28)
Right, yeah, yeah, yeah.

Yeah, although this is also a wild metric. So this is... Yeah. Yeah.

You know.

Matthew Carey (43:54)
I would prefer to use a quicker model as long as it meets a quality threshold.

Wilhelm Klopp (43:58)
Yeah, I agree. this whole, I think we both listened to the same podcast with Adam Wolf, who like leads up cloud code in some capacity.

who was saying that, I think, yeah, Anthropic, I think also hinted at this when Cloud 4 was released, both Summit and Opus, that a key metric that they care about internally is how long can the agent do tasks by itself? And Opus could do, I don't know, several hours or half a day or whatever. But when he said this, I was like, okay, interesting. It is a bit of a strange metric, right? Because ideally, you want the work to be done quickly. You don't want it to go on forever. And I've waited for Cloud Code to make

a very similar edits across a couple dozen files and it's like one by one by one. I guess it's great that it could keep going for that long without stopping. Although that seems like, yeah, I don't know. It's just like a bit of an awkward metric because those edits could been done way more quickly.

Matthew Carey (44:50)
Local,

yeah, we're falling into like a local maximum, whatever it is. Like, it's just like something that's like, like a much smaller, everyone's taking a much smaller picture and they're like, at the moment, what do I want? I want the model to be able to run for longer because I want to be able to do more stuff. But actually, what does the user want? The user wants stuff to be done yesterday. So if it was done, if it was quicker, it's closer to yesterday, I guess.

Wilhelm Klopp (44:55)
Yeah, right, right.

Yeah.

Right, right. I think it's like a clean story to tell to investors, right? It used to be that LLMs just responded to you straight away and then they were done and they were done and now they can work for a day and like be productive. It's like, okay, yeah, they can work really, really slowly doing like insane, the unnecessary tool calls. But like that's, yeah.

Matthew Carey (45:26)
⁓

Maybe it's actually like a go-to-market plan. Maybe it's like a weekend charge per hour of your virtual employees' time. Maybe the anthropomorphizing of models, I mean, it probably is, right? Like, 11X sells you virtual assistants or virtual employees. They don't sell you virtual autocomplete.

Wilhelm Klopp (45:42)
Mmm.

Mm-hmm.

Yeah.

Right, right, right. Yeah, yeah, yeah.

Matthew Carey (45:58)
Like they're not selling you that.

They're trying to like make it feel more human because it feels more competent, more like personable, more. And if it's like this human can go away and work for eight hours, we conjure up images of like lots of work being done in eight hours of tireless, hard labor. But yeah, really slowly.

Wilhelm Klopp (46:05)
Mm-hmm.

Mm-hmm.

Yeah, that's

a really good point. It makes it a bit, yeah, I guess maybe I'm being a bit pedantic, but ⁓ that could be a good story.

Matthew Carey (46:25)
Yeah. dude,

I watched the F1 movie last night.

Wilhelm Klopp (46:29)
Oh my god, yeah, you mentioned you were gonna talk about this. Tell me more. It's advertised super well everywhere. And then also somehow one of a, there's a song from it that made it into my Discover Weekly and I loved it.

Matthew Carey (46:42)
seriously? Wild. Yeah, the music's good. The music's very like intense. It feels, I think if I just listened to the music, I think I was watching Batman. It's not like, it doesn't feel like a racing, like racing car music. It's very American. It's like an American spin on F1. It's the old timer coming back.

Wilhelm Klopp (46:53)
Hahaha.

Matthew Carey (47:03)
But the really cool things they did is they put all of the F1 drivers, F1 principals, like the real people, they put them into the film. And they also had the reporter from Drive to Survive as like the main reporter in the film. So it felt like you knew everyone, like no one needed an introduction apart from the two main characters, obviously. But like literally no one needed an introduction, like Toto Wolf is like in there. But it also means that some of the acting is quite rubbish because they're like not professional actors.

Wilhelm Klopp (47:15)
That's awesome.

Mm.

that

upro- interesting.

Matthew Carey (47:32)
And so they have

to do some lines where it's like, that's not how he would have said it. I've seen him on interviews. He looks stressed. That's not the same way. Anyway, but because they're like acting as themselves, it is easier, right? ⁓ It is easier.

Wilhelm Klopp (47:39)
Right. Yeah, yeah, yeah. That's wild.

Do ⁓

you think it would be enjoyable for someone like me who's not ever watched Drive to survive?

Matthew Carey (47:52)
Yeah, I think so. Like I'm not a big F1 fan at all. I watched a bit of Drive to Survive like initially, cause I thought that they're cool stories. And then it got like really over dramatised. And this film is like the culmination of that. Like it's the culmination of the dramatisation of F1 into this like crazy like action packed, like they're crashing every lap. They're doing it on purpose to bend the rules. And it's like, yeah.

Wilhelm Klopp (47:55)
Mm.

Matthew Carey (48:18)
I don't know how much of it actually happens, but it was a good film. And Brad Pitt looks amazing. He's like my dad's age and he looks so good. He could easily be 40.

Wilhelm Klopp (48:23)
Nice.

Shout out to your dad, who I'm sure also looks ⁓ phenomenally well.

Matthew Carey (48:33)
Yeah, I mean, my dad looks great, but Brad Pitt,

look, Brad Pitt could be 40. It's like, even younger, dude. He like looks so good. I mean, he's got a bit of wrinkly eyes and it is quite unfortunate because a lot of the film, you're just seeing him in a helmet. And so you're literally just like looking at his wrinkly eyes and it's like, that is like the least, the least like young looking part of him, which I think is quite funny.

Wilhelm Klopp (48:40)
Yeah, that's impressive.

You...

really funny. By the way, did you guys do anything for Busty Day on Monday?

Matthew Carey (49:00)
shit, maybe that's why I was in bad books. No, I'm kidding. No, we didn't know.

Wilhelm Klopp (49:03)
Hahahaha

I was reminded of it because I ran into a friend actually randomly outside the gym, which is so cool. Like you definitely run into people a lot more here in SF, which is really nice. And then Joe was just a guy with a massive French flag that walked by and we were like, wait, what's going on? Oh, okay. Today is the day.

Matthew Carey (49:24)
They take

it very seriously, like very seriously. It's cool.

Wilhelm Klopp (49:28)
Yeah,

it seemed well celebrated here. ⁓

Matthew Carey (49:32)
Do have

a lot of Frenchies?

Wilhelm Klopp (49:33)
You hear it, there's definitely one rework which seems like it's all French people. So I think there's a strong French population, but I don't know if it's actually that out of the ordinary.

Matthew Carey (49:43)
Yeah, because

in central London, like zone one, I can't get on a tube. can't get, I can't walk down the street without hearing French. I can't sit in a cafe without hearing French. It's like wild, which is great for me because I'm constantly trying to eavesdrop and improve my French. But I do find it, I do find it quite funny. And then when she gets parents come over, they're always like,

Wilhelm Klopp (49:49)
Mm.

Matthew Carey (50:05)
Her mum just like scowls every time she hears French because she's like, I'm in England, why am I listening to French? And she's French, you know? But it's just very funny. I hang around French people a lot though because my bosses are all French, only my colleagues are French.

Wilhelm Klopp (50:12)
That's all.

Yeah, yeah,

yeah. I think a similar thing. could just a bunch of friends or kind of French or French adjacent. Yeah.

Matthew Carey (50:24)
There's a lot of

like French engineers though who go into computer science. I mean they produce a lot of engineering talent. Some really, I mean it's held in super high regard over there.

Wilhelm Klopp (50:28)
Mmm.

Mmm.

did you see the story that Apple might acquire Mistral? That they are considering acquiring Mistral? That someone was urging them to?

Matthew Carey (50:41)
I saw a tweet like as a piss-take. I didn't know if it was real and I don't know if it might have been like a commentary tweet. I didn't pay much attention to it, but that's really funny. Mishra have been shipping recently. I saw, ⁓ I followed one of their dev rails and apparently their new coding model is really good. Like on the curve of like cost to scores on some benchmarks or something. It's like really good. I don't know what that means anymore and I haven't tried it yet. So.

Wilhelm Klopp (50:52)
Have they?

Matthew Carey (51:09)
caveat, but probably should try it, you know. speaking of new models, go on.

Wilhelm Klopp (51:12)
I am really enjoying,

I was just gonna say, the opposite of what you, I was, it feels like the summer so far has been a little bit slower in terms of AI news and I'm really enjoying it. It's actually really nice. I feel like I to expend less energy trying to like shield the brain from all these announcements.

Matthew Carey (51:28)
Yeah, it is nice. like, I do feel like I've managed to create some workflows recently that work on Claude and that I'm happy with them just working on Claude. I haven't had like FOMO to like try getting to work on anything else. I've seen some tweets about from Dax where he's like trying to rewrite system prompts to make it work with Gemini and try and make like struggling with tool calls on Gemini because like obviously...

I know people who use this product want Gemini, but I'm in the lucky situation where I can just 100 % use the model that I decide to use. And if I use Claude, and it all just works, I'm very happy. But the first time I felt a hint of FOMO was when I saw the K2, the moonshot model, just being so fast. was like, oh, if that is good for coding, that is a game changer.

Wilhelm Klopp (51:54)
Hmm.

Totally, yeah.

Mmm.

Matthew Carey (52:20)
I want the model that I sit in front of, I don't mind if it's a little bit worse, if it's light speed fast.

Wilhelm Klopp (52:20)
Mm-hmm.

Yeah, yeah, yeah. And also, especially things like, it feels like when you're using Cloud Code, there's a core tool calling loop, which is just like, ls, grep, list everything in this directory or whatever. All this stuff.

Matthew Carey (52:41)
That's quick.

Wilhelm Klopp (52:41)
Yeah, like exactly, right? Like it feels like for that stuff, don't, and it's, and it's really surprisingly slow sometimes. And like, I mean, like what's the slow part here? Like deciding to run grep, like deciding to, like you would almost, yeah. So, and it feels like deciding to run grep, like shouldn't be that, like it feels like a lighter model can do that. So yeah, I'm also hoping to, yeah.

Matthew Carey (52:52)
Yeah, sometimes. Yeah.

Okay, maybe in our free

ideas section, you know how, maybe you haven't tried it, but OpenCode has this really funky, I actually apologize for shitting on OpenCode earlier when we were talking about Amp. It has this really cool tab. You press the tab and you switch between plan and build mode. So in plan mode, it has no access to any tools that can write data. And you can just read and just gather context and just like understand the code base. And then you go into build mode and you can ask it to execute.

Wilhelm Klopp (53:18)
Okay.

Mm.

Matthew Carey (53:30)
but you flick between the two with tab, which is super nice because you can like type out some stuff and you can be like, do I want it to actually do this? no, just tab and then press enter. I feel like there's a cool mode there where instead of plan and build, it's like context and action where like, because I actually want the planning. I don't necessarily want the planning to happen at the beginning. I want the planning to happen during the action, but, or during the build phase.

Wilhelm Klopp (53:32)
just saying.

Mm.

I see, see. Yeah, yeah, yeah.

Matthew Carey (53:57)
but I would like to develop some context of the problem that I'm talking about and I would like that to happen really fast. So if you could switch between models there, that would be super quick. That would be cool. Because you could have the fast model getting context and then the speedy model and then the slow model that's slightly better actually writing to the code base.

Wilhelm Klopp (54:02)
Yep. Yep.

Ooh, this reminds me of...

It's also great when the agent asks you questions back for things that you hadn't considered that were like genuinely ambiguous in like your initial prompt. Like that is amazing with like the deep research because that does it consistently. Apparently O3 in general is good at asking questions back. A lot of previous models, I used to actually have this as part of my chat GBT system prompt, like always ask me three questions back that you don't understand. And it never gave me good questions. Not once. I stopped reading them.

Matthew Carey (54:39)
Yeah.

That's gonna be really

bad. That's gonna be really bad for automated tasks, right? This is where we might get some model divergence of like models that are like RLHF to just continue no matter what, and models that are RLHF to have some feedback from the user. And I wonder where everyone's gonna go with that because now we have open code and also Amp, we forgot to mention, but Amp can be run via the CLI. speaking of running stuff via the CLI, we'll finish this point. Then...

Wilhelm Klopp (54:50)
Mm.

Right.

Mm-hmm.

Matthew Carey (55:10)
you might have some divergence about models that want to ask you stuff back, whether models that just continue with the task to fruition, because that is a different model at the end of the day.

Wilhelm Klopp (55:18)
Is it? Because asking questions back is just a tool call to the human.

Matthew Carey (55:21)
Yeah, but imagine the humans not there.

Wilhelm Klopp (55:23)
I see, I So like in training, you would need the human there is what you're saying.

Matthew Carey (55:28)
Yeah, you're gonna have to provide

data that has the human there when it asks questions back. And so people are gonna put in a lot of effort to make that capability. But what I'm saying is for lot of my like, for instance, my use case in Cloud Code, that automated code generation use case, I don't want the model to ask question. I want the model to run to fruition and be done. If the model asks a question, like it's gonna stop my run. And so I would much prefer to have like a 90 % outcome.

Wilhelm Klopp (55:35)
I see.

Right.

Matthew Carey (55:56)
for it not asking the question or even a 50 % outcome not asking the question, then asking a question and then stopping the run. So maybe that's my bad UX. Maybe I'm not optimized. Maybe I'm being screwed by the bitter lesson. Maybe that is the future is things that can ask you back.

Wilhelm Klopp (56:00)
Mm.

Interesting. That's really interesting

because I for sure would want it to ask me back because what I dislike the most is when you get something that's kind of broken in ways that you don't understand and now you need to untangle it because of some stuff you didn't provide upfront.

Matthew Carey (56:25)
It's a lot of investment for the labs though to generate that training code to fine tune it to do that.

Wilhelm Klopp (56:32)
But in shape, in the shape of the problem, it's not too different a shape to a conversation with the user. Already, I think the fact that you can reply to these things is different to how they're trained.

Matthew Carey (56:39)
Yeah.

Wilhelm Klopp (56:46)
But I want to do a little experiment.

I'm making note of this actually to like just try and get O3 to ask me some questions back. Maybe I'll do this as part of my Amp exploration.

Matthew Carey (56:52)
Okay, last thing. Yeah, I

mean Amp is so cool. They're doing bits there. I love the product sense. I'm a bit a little bit confused what happened to that old coding agent. Did Cody just die? Did they get so ingrained into like an architecture and then they just had to ditch it and start again? Is it hence the rename?

Wilhelm Klopp (57:11)
I, yeah, I really don't know, anything, too much about Kodi. But I assume, yeah, I don't know. Maybe they serve slightly different purposes.

Matthew Carey (57:19)
Yeah, maybe. Okay, can I tell you my story about observability provider now?

Wilhelm Klopp (57:25)
yeah, yeah, I had that on the agenda as well. Yeah, yeah, please do. The lore. We've been trying to get to this for weeks.

Matthew Carey (57:28)
the law.

I don't know if this is law and it's probably not law because I actually did a whole talk about this at AI engineer London. that okay so the story goes I'm running some evals should I mention the provider's name?

Wilhelm Klopp (57:38)
Nice.

Go on.

Mm-hmm.

Matthew Carey (57:46)
Okay, so I'm running some evals with a recently minted unicorn. Lang chain. Langsmith. And I am having a great time. I'm making some prompt improvements. It's really, it's going really well. it was back in like rag era where I was trying to work out how many pieces of context, how many chunks I wanted to return from a particular retrieval call.

And so I built a training set and then I was like, I'm gonna run this training set at k equals, I think it was like two, three, four, five, eight, 10, 15, 20, something like that. All those different amounts of context being returned just to see like, where does performance plateau? Do I get some problems when I start over stuffing the context window and all that sort of stuff?

And so I press run, bearing in mind I've been running emails all morning, no stress. And I press run on my laptop and I go for lunch. right, lovely falafel wrap from down the road, super, super nice. Come back maybe like 40 minutes later. No, who am I kidding? It was an hour later.

Wilhelm Klopp (58:41)
us.

Hahaha!

Matthew Carey (58:50)
French company guys, French company. Coming back an hour later and I loathe Langsmith to be like, oh, I wonder what happened. I wonder which one was correct. I was a bit excited and I just get access denied. And I'm like, ooh, ooh dear, ooh dear. Interesting. What have I done? And then I'm about to write an email to someone at Langsmith or Langchain even being like, guys, I'm getting access denied.

Like what's going on? I asked in my internal Slack first to be like, has anyone changed the password? Whatever. But I was getting my access denied in like, as like a text at the top, like a proper legit access denied. And so I'm sending a Slack message to my colleague and then I'm like, just like hop in the Lang chain Slack just to see if they're like having an outage or whatever. Hop into the Lang chain Slack.

And it's carnage. there's just lines of people saying, I have this many agents in production. They're all down. How is the LangSmith platform? like what's, why is it not working? When are you going to be back up? What do I say to my customers? And underneath all of them, Harrison Chase, the CEO of LangChain is just like, guys, we're working on it. We're working really hard. We're working on it. We're working really hard. We've...

established the root cause, we're pushing out a fix, access should be restored, all good. And it was, I think they were down for, they were down for quite a long time. I don't want to put like a particular number on it because it was way longer than it should have been. Like they were still down when I was, when I was trying to get onto it when I got back from lunch. And I'm like, okay, Langshan's down. that's annoying. Can't look at my e-bills.

Wilhelm Klopp (1:00:22)
Mm-hmm.

Mm-hmm.

Matthew Carey (1:00:26)
So I start doing another thing, And I was just like, before I review a PR, I'll just pop off an email to someone at Langsmith being like, I've you've had an outage, is that why I can't get onto my session? And then I'm just like, having reviewed my PR and I get an email back from someone at Langchain and I've screenshotted this email, I'm gonna frame it, it's an amazing email. It's like, dear Matt, thank you so much for your email.

Wilhelm Klopp (1:00:43)
You

Matthew Carey (1:00:47)
Blah, blah, blah, blah, blah. Your organization, this, oh, we had an outage the last couple of hours. It started at this time, UK time, interestingly, the time that I went for lunch. And we believe it was linked to your account we're still diving deeper into the problems, but it appears that a large job run was made

Wilhelm Klopp (1:00:59)
Hahaha.

Matthew Carey (1:01:10)
which overwhelmed some of our servers And I was like, holy shit, I found arbitrage. My laptop has managed to overwhelm AWS or whatever it's hosted on. was like, that's so cool. So my laptop running just like the amount of evals that I could run based on my open AI rate limits managed to bring down.

Wilhelm Klopp (1:01:26)
That's amazing.

Mm-hmm.

Matthew Carey (1:01:35)
the whole Langtian platform.

Wilhelm Klopp (1:01:36)
So how many,

any idea why? Like what is it, like a crazy amount of requests or something?

Matthew Carey (1:01:41)
Yeah, I have no idea. I actually want to look at the email a little bit more in depth because it was so fun.

going to find it. But it was quite a lot of requests. I think I had quite a big dataset. Not a huge dataset for machine learning, but quite a big test dataset. And yeah, was org blocked. Thanks for reaching out. Apologies for inconvenience. You may have noticed that we've had a significant outage this morning. It appears that the root cause was tied to a batch upload from your organization, which then exhausted all resources on our side. We're still trying to dig into how or why this happened.

Wilhelm Klopp (1:01:46)
going.

Mm-hmm. Mm-hmm.

Matthew Carey (1:02:08)
But part of the work around to get our service back up and running was temporarily blocking traffic from your organisation. We believe this happened during a period of feedback submission related to your jobs, possibly triggered by an automation rule you had running or an experiment you were conducting. It was an experiment. In any case, this block is no longer needed and you should be able to access your instance. It was like really nice email. We're still investigating the exact root cause and we're monitoring in case we see a reoccurrence and we'll let you know. But yeah.

Wilhelm Klopp (1:02:15)
Ha ha ha.

Matthew Carey (1:02:37)
Just so funny. Just so funny. I guess that's the perils of scaling really quickly. But really cool, they reached a billion dollars. Did you see that?

Wilhelm Klopp (1:02:39)
wild.

I did not, no, no, Yeah, shout out to them, congrats. No small feat.

Matthew Carey (1:02:47)
Yeah, well, they raised a billion

dollars. It's very, very cool. Yeah, so I don't know what means, but people love Langsmith. It's actually a really good platform. Really good platform. Anyway, did you have anything you want? What is differentiate yourself before we call it a day?

Wilhelm Klopp (1:02:51)
Right, right, Not revenue yet.

interesting.

You said, you added that one. That's one of yours.

Matthew Carey (1:03:08)
dear, did I? okay. Okay, ⁓ I talked about, yeah, let's try about it next time. I don't have very clear thoughts around that. ⁓

Wilhelm Klopp (1:03:11)
but we can chat about it next time.

Yeah,

there's a lot of interesting stuff still on the agenda like, oh wait, should we not mention, we've gone the whole episode without mentioning the word. Let's skip it for next time. You know which word I It's the future.

Matthew Carey (1:03:27)
Let's leave the word. Let's leave the word.

my god. Yeah, I know we're stretching, dude, I'm actually quite proud of us. I think I should go celebrate after this that we haven't mentioned the word.

Yeah, the future. Anyway, it's been an absolute pleasure. Sorry, I just went off on a rant for the last like 10 minutes. I might have to cut that down. That's horrific.

Wilhelm Klopp (1:03:41)
Cool.

No, no,

I actually think that, yeah, feel like, I mean, edit as much as you want, but I feel like the, it's nice to get the raw, raw us, you know.

Matthew Carey (1:03:55)
Yeah. I mean, I do just think though, putting your prompts, which is what they, which is what all these people were having issues with is like LangSmith is an observability platform and an eval platform. Putting all of your prompts.

Wilhelm Klopp (1:03:57)
Yeah.

Matthew Carey (1:04:08)
Like those things should just drain, right? They shouldn't map. If it's not up, it shouldn't bring down your main application. If Datadog goes down, my main application should be fine, right? I shouldn't need like, active connections to Datadog to service my main application. And that's the same thing, but, but they've encouraged this, this product use case where you store your prompts inside your eval platform because, because people who are less technical can now edit them.

Wilhelm Klopp (1:04:15)
Right.

Yeah, yeah.

Yes, yes, yes, yes.

Hmm. I see.

Matthew Carey (1:04:35)
What it means is now you've lost Git, you've lost any type of review, and you've put like a real active dependency on your production workload. Come on, guys. I think it's such an anti-pattern.

Wilhelm Klopp (1:04:44)
Yeah, that's a bit

much. Yeah, that's not great. want like, yeah, that's tough. You want like some good caching or some really good guarantees from the provider.

Matthew Carey (1:04:55)
Yeah, I think one of the reasons why they said they would do this as well is they'll cache your prompts. They'll like, they will make this quick, but they won't make this quick. Let's be honest, you're still calling a server in some cases halfway around the world and you're doing a runtime and it's not a server that you have any sort of control over. And it's the same server that's processing my like millions of random experiment requests apparently. Yeah, cool. Anyway, that was a rant.

Wilhelm Klopp (1:05:02)
Hmm.

Mm-hmm. Mm-hmm. Mm-hmm.

start to my Wednesday, end of your Wednesday. You're out next week, but what are you doing actually?

Matthew Carey (1:05:23)
Yeah, it's bad enough.

I'm going paragliding

we did a course last year for a week and then I didn't do any paragliding for a whole year and that course was in the mountains and I want to learn how to fly on the cliffs in the wind. So yeah, we're going to do something on the cliffs in the wind and then hopefully afterwards I'll actually do some paragliding this year because I'm probably a bit rusty and so I won't have to do another course.

for the foreseeable future which would be really nice. But yeah, wanted to, it's not yeah, it's not really the type of thing that you want to like do badly. It's quite binary, whether it goes well or not. So yeah.

Wilhelm Klopp (1:05:54)
That's amazing. Well, enjoy.

Yes, yes, yes,

yes, yes. Dude, I'll miss you next week, but see you in two weeks. And enjoy, enjoy the paragliding. Send lots of pics.

Matthew Carey (1:06:04)
be fun.

See you in two weeks. Yeah, see you in two

weeks, dude. Do have any big things on?

Wilhelm Klopp (1:06:11)
I don't think so. Just a lot of coding, locking in, building a certain kind of server.

Matthew Carey (1:06:16)
locking in the SF grind the

the SF summer grind

Wilhelm Klopp (1:06:22)
That's right. Yeah.

Matthew Carey (1:06:24)
you

didn't want to say the word. Is that what that was? Is that you not saying the word?

Wilhelm Klopp (1:06:26)
Yeah. Correct.

Well, I told you this already on Monday, but got my first random person on the internet customer for the Triathlon brand that we're building. I think I mentioned this, but that's kind of like huge, it's kind of a big moment.

Matthew Carey (1:06:38)
Dude, yeah, that's huge

news. You should actually shout it out. Shout it out properly here and we'll put the link in the show notes.

Wilhelm Klopp (1:06:46)
It's tritonclub.com if you do triathlons and you want to show off your race achievements, order yourself a sweater. Lots of cool things planned for the future of Triton, but it's really, really fun to have a first random customer who just saw it, loved it, bought it.

Matthew Carey (1:07:00)
of your awesome lifestyle business. And what's the selling point? Because there is a unique selling point. Go on.

Wilhelm Klopp (1:07:06)
there's a lot of triathlon swag that's specific to a single triathlon you did, but with this one you can kind of show off, in sort of the style of a North Korean general, who wears a very showy uniform, like all the different triathlons you've done across all the different distances. you can...

be proud of the hard work you've put in, not for a single race, but for many different races.

Matthew Carey (1:07:26)
So if

I've only done a sprint triathlon, would I get like half a line? And if I've done like a full distance, do I get like the full like medal on the front? How does it work?

Wilhelm Klopp (1:07:34)
Yeah, that kind of, yeah,

there's, it's kind of like that. Yeah. So there's like a little achievement system and you can put in all your different races and then you'll see a preview of the design. ⁓

Matthew Carey (1:07:43)
So it generates

the design for you based on your, on the things you've done. Okay. Almost like a better trophy.

Wilhelm Klopp (1:07:47)
Exactly,

That's right, yeah. And you can wear out to the family barbecue, the airport lounge, the Monday morning business meeting. All the people who don't know that you're a triathlete. Actually, I think in general, there's not a lot of good casual triathlon wear. There's a lot of race gear and a lot of t-shirts you can wear on a run. yeah, how could you... Yeah, exactly. But then no one knows. Are you just a runner?

Matthew Carey (1:07:53)
Nah, it's really cool.

and when you spot someone else with it. Yeah.

Well this is the casual wear, isn't it?

The Garmin watch,

Wilhelm Klopp (1:08:18)
Are you just a,

Matthew Carey (1:08:18)
the Garmin watch, yeah.

Wilhelm Klopp (1:08:20)
yeah, yeah. Anyway, that's enough of a shout out. We can talk about it more next time.

Matthew Carey (1:08:24)
it's really cool though. What's the text tag?

Wilhelm Klopp (1:08:26)
It actually runs on Cloudflare and ⁓ it's React.

Matthew Carey (1:08:29)
What's the front end?

Straight react.

Wilhelm Klopp (1:08:32)
straight react.

Matthew Carey (1:08:34)
Wow, I love it, I love it, love it. is it Python backend?

Wilhelm Klopp (1:08:37)
No, actually, it's all in workers at the moment. But I think it should be... I just love... I just love working with a database with Django. I've actually found migrations and stuff quite hard in Cloudflare in D1. It's just a little bit painful. It works for now, but... I think migrations are just really nice in Django. And then I try to use drizzle, ⁓ but that...

Matthew Carey (1:08:52)
yeah? How come? How?

Yeah, Drissel's good, man.

Wilhelm Klopp (1:09:02)
I really, it seemed not great for migrations. I don't know, there's just like tons of errors I couldn't get past for migration.

Matthew Carey (1:09:08)
I just do like drizzle.

Okay. So the one who has a really good blog post on this, and then we will end this because, my God, so much editing. The one who has a really good blog post on this is Boris. Boris X baseline. He's using durable objects for like a per user database. But you don't have to go that crazy and you can just use drizzle. But the main thing is to include a migrations directory in your drizzle config.

Wilhelm Klopp (1:09:13)
Ha ha.

⁓

Matthew Carey (1:09:32)
Sorry, in your D1 config in Wrangler and then it will automatically pick up the migrations. He goes through all of the setup and I've had it pretty painless. think I have a... Maybe we need to do some pairing. Message me next time you get stuck on that sort of Because... Yeah, let's do it. Right, catch you in a bit dude. Big love. Bye!

Wilhelm Klopp (1:09:44)
We can do some pairing. Aw, thank you.

Let's do it. Cool. See you later. Bye. love.

Creators and Guests

Host

Matt Carey

ai engineer @StackOneHQ

Host

Wilhelm Klopp

building @kolo_ai

Study Says AI Makes Developers Slower? F1 Movie Review, Coding and Testing for AI, Free Perplexity and Free Ideas

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere