AI agents wrong ~70% of time: Carnegie Mellon study

Jaden Norman@lemmy.world · 15 hours ago

AI agents wrong ~70% of time: Carnegie Mellon study

Affidavit@lemmy.world · 50 minutes ago

“…for multi-step tasks”

TheGrandNagus@lemmy.world · edit-2 14 hours ago

LLMs are an interesting tool to fuck around with, but I see things that are hilariously wrong often enough to know that they should not be used for anything serious. Shit, they probably shouldn’t be used for most things that are not serious either.

It’s a shame that by applying the same “AI” naming to a whole host of different technologies, LLMs being limited in usability - yet hyped to the moon - is hurting other more impressive advancements.

For example, speech synthesis is improving so much right now, which has been great for my sister who relies on screen reader software.

Being able to recognise speech in loud environments, or removing background noice from recordings is improving loads too.

As is things like pattern/image analysis which appears very promising in medical analysis.

All of these get branded as “AI”. A layperson might not realise that they are completely different branches of technology, and then therefore reject useful applications of “AI” tech, because they’ve learned not to trust anything branded as AI, due to being let down by LLMs.

snooggums@lemmy.world · 14 hours ago

LLMs are like a multitool, they can do lots of easy things mostly fine as long as it is not complicated and doesn’t need to be exactly right. But they are being promoted as a whole toolkit as if they are able to be used to do the same work as effectively as a hammer, power drill, table saw, vise, and wrench.

TeddE@lemmy.world · 13 hours ago

Because the tech industry hasn’t had a real hit of it’s favorite poison “private equity” in too long.

The industry has played the same playbook since at least 2006. Likely before, but that’s when I personally stated seeing it. My take is that they got addicted to the dotcom bubble and decided they can and should recreate the magic evey 3-5 years or so.

This time it’s AI, last it was crypto, and we’ve had web 2.0, 3.0, and a few others I’m likely missing.

But yeah, it’s sold like a panacea every time, when really it’s revolutionary for like a handful of tasks.

rottingleaf@lemmy.world · 13 hours ago

That’s because they look like “talking machines” from various sci-fi. Normies feel as if they are touching the very edge of the progress. The rest of our life and the Internet kinda don’t give that feeling anymore.

NarrativeBear@lemmy.world · 13 hours ago

Just add a search yesterday on the App Store and Google Play Store to see what new “productivity apps” are around. Pretty much every app now has AI somewhere in its name.

Punkie@lemmy.world · 12 hours ago

I’d compare LLMs to a junior executive. Probably gets the basic stuff right, but check and verify for anything important or complicated. Break tasks down into easier steps.

NarrativeBear@lemmy.world · 14 hours ago

The ones being implemented into emergency call centers are better though? Right?

TeddE@lemmy.world · 14 hours ago

Yes! We’ve gotten them up to 94℅ wrong at the behest of insurance agencies.

lepinkainen@lemmy.world · 13 hours ago

Wrong 70% doing what?

I’ve used LLMs as a Stack Overflow / MSDN replacement for over a year and if they fucked up 7/10 questions I’d stop.

Same with code, any free model can easily generate simple scripts and utilities with maybe 10% error rate, definitely not 70%

CodeBlooded@programming.dev · 12 hours ago

I’m far more efficient with AI tools as a programmer. I love it! 🤷‍♂️

esc27@lemmy.world · edit-2 10 hours ago

30% might be high. I’ve worked with two different agent creation platforms. Both require a huge amount of manual correction to work anywhere near accurately. I’m really not sure what the LLM actually provides other than some natural language processing.

Before human correction, the agents i’ve tested were right 20% of the time, wrong 30%, and failed entirely 50%. To fix them, a human has to sit behind the curtain and manually review conversations and program custom interactions for every failure.

In theory, once it is fully setup and all the edge cases fixed, it will provide 24/7 support in a convenient chat format. But that takes a lot more man hours than the hype suggests…

Weirdly, chatgpt does a better job than a purpose built, purchased agent.

atticus88th@lemmy.world · 10 hours ago

this study was written with the assistance of an AI agent.

FenderStratocaster@lemmy.world · 13 hours ago

I tried to order food at Taco Bell drive through the other day and they had an AI thing taking your order. I was so frustrated that I couldn’t order something that was on the menu I just drove to the window instead. The guy that worked there was more interested in lecturing me on how I need to order. I just said forget it and drove off.

If you want to use AI, I’m not going to use your services or products unless I’m forced to. Looking at you Xfinity.

kinsnik@lemmy.world · 14 hours ago

I haven’t used AI agents yet, but my job is kinda pushing for them. but i have used the google one that creates audio podcasts, just to play around, since my coworkers were using it to “learn” new things. i feed it with some of my own writing and created the podcast. it was fun, it was an audio overview of what i wrote. about 80% was cool analysis, but 20% was straight out of nowhere bullshit (which i know because I wrote the original texts that the audio was talking about). i can’t believe that people are using this for subjects that they have no knowledge. it is a fun toy for a few minutes (which is not worth the cost to the environment anyway)

lemmy_outta_here@lemmy.world · 14 hours ago

Rookie numbers! Let’s pump them up!

To match their tech bro hypers, the should be wrong at least 90% of the time.

Melvin_Ferd@lemmy.world · 12 hours ago

How often do tech journalist get things wrong?