Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the ‘reasoning’ models.

  • shortwavesurfer@lemmy.zip
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 months ago

    I’m going to have to read this, because my knee jerk reaction answer is that it depends on what type of wash you want to give the car. If you want to give the car an actual wash at the car wash, you’re going to have to drive it. But if you’re wanting to wash it at home, then it doesn’t matter how far the car wash is away. Because you can just walk out your front door and grab your water hose. and soap and shit.

    • 🌞 Alexander Daychilde 🌞@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 months ago

      if you’re wanting to wash it at home,

      The AI should absolutely understand the implication that you want to wash your car at the car wash, not at home. The prompt is clear about that, even though it is implied.

      “I want a hamburger. McDonald’s is three miles from me and Wendy’s is five miles. Which is the cheaper place to get a burger from when you consider the distance to each?” is not an exact analogy, but the point is that it should be ABSOLUTELY clear that you do not wish to make your own hamburger. Any response that discusses that as an option is ridiculout, unless maybe it’s one of those options-at-the-end thing LLMs love to do - but it has no part of the main answer at all.

      • iopq@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        edit-2
        3 months ago

        Pretty sure if you asked it on stackoverflow you would get a bunch of responses to make it at home and then someone would lock your question

  • pimpampoom@lemmy.zip
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 months ago

    They didn’t take into account the “thinking mode” most model pass when thinking is activated

    • Kyuuketsuki@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      3 months ago

      Sure they did. They even had a notation on the results table that grok passed expect when reasoning mode was off.

      ETA: they even posted all the reasoning texts for the models they tested

  • CetaceanNeeded@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 months ago

    I asked my locally hosted Qwen3 14B, it thought for 5 minutes and then gave the correct answer for the correct reason (it did also mention efficiency).

    Hilariously one of the suggested follow ups in Open Web UI was “What if I don’t have a car - can I still wash it?”

    • WolfLink@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      3 months ago

      My locally hosted Qwen3 30b said “Walk” including this awesome line:

      Why you might hesitate (and why it’s wrong):

      • X “But it’s a car wash!” -> No, the car doesn’t need to drive there—you do.

      Note that I just asked the Ollama app, I didn’t alter or remove the default system prompt nor did I force it to answer in a specific format like in the article.

  • melfie@lemy.lol
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    3 months ago

    Context engineering is one way to shift that balance. When you provide a model with structured examples, domain patterns, and relevant context at inference time, you give it information that can help override generic heuristics with task-specific reasoning.

    So the chat bots getting it right consistently probably have it in their system prompt temporarily until they can be retrained with it incorporated into the training data. 😆

    Edit:

    Oh, I see the linked article is part of a marketing campaign to promote this company’s paid cloud service that has source available SDKs as a solution to the problem being outlined here:

    Opper automatically finds the most relevant examples from your dataset for each new task. The right context, every time, without manual selection.

    I can see where this approach might be helpful, but why is it necessary to pay them per API call as opposed to using an open source solution that runs locally (aside from the fact that it’s better for their monetization this way)?

    • Schadrach@lemmy.sdf.org
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 months ago

      There are models with open weights, and you can run those locally on your GPU. It can be a bit slower depending on model and GPU. For example, GLM has an open version, both full and pruned, but it’s not the newest version. A bunch of image generation models have local versions too.

  • Greg Fawcett@piefed.social
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 months ago

    What worries me is the consistency test, where they ask the same thing ten times and get opposite answers.

    One of the really important properties of computers is that they are massively repeatable, which makes debugging possible by re-running the code. But as soon as you include an AI API in the code, you cease being able to reason about the outcome. And there will be the temptation to say “must have been the AI” instead of doing the legwork to track down the actual bug.

    I think we’re heading for a period of serious software instability.

    • bss03@infosec.pub
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      3 months ago

      Yeah, software is already not as deterministic as I’d like. I’ve encountered several bugs in my career where erroneous behavior would only show up if uninitialized memory happened to have “the wrong” values – not zero values, and not the fences that the debugger might try to use. And, mocking or stubbing remote API calls is another way replicable behavior evades realization.

      Having “AI” make a control flow decision is just insane. Especially even the most sophisticated LLMs are just not fit to task.

      What we need is more proved-correct programs via some marriage of proof assistants and CompCert (or another verified compiler pipeline), not more vague specifications and ad-hoc implementations that happen to escape into production.

      But, I’m very biased (I’m sure “AI” has “stolen” my IP, and “AI” is coming for my (programming) job(s).), and quite unimpressed with the “AI” models I’ve interacted with especially in areas I’m an expert in, but also in areas where I’m not an expert for am very interested and capable of doing any sort of critical verification.

        • bss03@infosec.pub
          link
          fedilink
          English
          arrow-up
          0
          ·
          edit-2
          3 months ago

          Yes, I’ve written some Lean. It’s not my favorite programming language or proof assistant, but it seems to have “captured the zeitgeist” and has an actively growing ecosystem.

            • bss03@infosec.pub
              link
              fedilink
              English
              arrow-up
              0
              ·
              edit-2
              3 months ago

              Also, my preference shouldn’t matter to anyone else. If you want to increase your proof assistant skill (even from nothing), I suggest lean. Probably the same if you want to increase programming skill in a dependently typed language.

              Honestly, I should get more comfortable with it.

            • bss03@infosec.pub
              link
              fedilink
              English
              arrow-up
              0
              ·
              3 months ago

              Right now, I’m spending more time in Idris. It’s not a great proof assistant, but I think it’s a lot easier to write programs in. Rocq is the real proof assistant I’ve used, but I don’t have a strong opinion on them because all the proofs I’ve wanted/needed to write where small enough to need minimal assistance. (The bare bones features that are in Agda or Idris were enough.)

    • Fmstrat@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 months ago

      This is adjustable via temperature. It is set low on chatbots, causing the answers to be more random. It’s set higher on code assistants to make things more deterministic.

    • XLE@piefed.social
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 months ago

      AI chatbots come with randomization enabled by default. Even if you completely disable it (as another reply mentions, “temperature” can be controlled), you can change a single letter and get a totally different and wrong result too. It’s an unfixable “feature” of the chatbot system

    • merc@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 months ago

      It’s also the case that people are mostly consistent.

      Take a question like “how long would it take to drive from here to [nearby city]”. You’d expect that someone’s answer to that question would be pretty consistent day-to-day. If you asked someone else, you might get a different answer, but you’d also expect that answer to be pretty consistent. If you asked someone that same question a week later and got a very different answer, you’d strongly suspect that they were making the answer up on the spot but pretending to know so they didn’t look stupid or something.

      Part of what bothers me about LLMs is that they give that same sense of bullshitting answers while trying to cover that they don’t know. You know that if you ask the question again, or phrase it slightly differently, you might get a completely different answer.

  • jaykrown@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 months ago

    Interesting, I tried it with DeepSeek and got an incorrect response from the direct model without thinking, but then got the correct response with thinking. There’s a reason why there’s a shift towards “thinking” models, because it forces the model to build its own context before giving a concrete answer.

    Without DeepThink

    With DeepThink

    • rockSlayer@lemmy.blahaj.zone
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 months ago

      It’s interesting to see it build the context necessary to answer the question, but this seems to be a lot of text just to come up with a simple answer

      • Buffy@libretechni.ca
        link
        fedilink
        English
        arrow-up
        0
        ·
        3 months ago

        They’re showing the thinking the model did, the actual response is the sentence at the end.

      • Schadrach@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        0
        ·
        3 months ago

        The whole premise of deep think and similar in other models is to come up with an answer, then ask itself if the answer is right and how it could be wrong until the result is stable.

        The seahorse emoji question is one that trips up a lot of models (it’s a Mandela effect thing where it doesn’t exist but lots of people remember it and as a consequence are firm that it’s real), I asked GLM 4.7 about it with deep think on and it wrote about two dozen paragraphs trying to think of everywhere a seahorse emoji could be hiding, if it was in a previous or upcoming standard, if maybe there was another emoji that might be mistaken for a seahorse, etc, etc. It eventually decided that it didn’t exist, double checked that it wasn’t missing anything, and gave an answer.

        It was startlingly like flow.ofnconaciousness of someone experiencing the Mandela effect trying desperately to find evidence they were right, except it eventually gave up and realized the truth.

        • Pup Biru@aussie.zone
          link
          fedilink
          English
          arrow-up
          0
          ·
          3 months ago

          yeah i find the thinking fascinating with maths too… like LLMs are horrible at maths but so am i if i have to do it in my head… the way it breaks a problem down into tiny bits that is certainly in its training data, and then combine those bits is an impressive emergent behaviour imo given it’s just doing statistical next token

          • mirshafie@europe.pub
            link
            fedilink
            English
            arrow-up
            0
            ·
            3 months ago

            Your verbal faculties are bad at math. Other parts of your brain do calculations.

            LLMs are a computer’s verbal faculties. But guess what, they’re just a really big calculator. So when LLMs realize that they’re doing a math problem and launch a calculator/equation solver, they’re not so bad after all.

            • Pup Biru@aussie.zone
              link
              fedilink
              English
              arrow-up
              0
              ·
              3 months ago

              that solver would be tool use though… i’m talking about just the “thinking” LLMs. it’s fascinating to read the thinking block, because it breaks the problem down into basic chunks, solves the basic chunks (which it would have been in its training data, so easy), and solves them with multiple methods and then compares to check itself

    • NewNewAugustEast@lemmy.zip
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 months ago

      What is the wrong answer though? It is a stupid question. I would look at you sideways if you asked me this, because the obvious answer is “walk silly, the car is already at the car wash”. Otherwise why would you ask it?

      Which is telling because when asked to review the answer, the AI’s that I have seen said, you asked me how you were going to get to the car wash. Assumption the car was already there.

      • MBech@feddit.dk
        link
        fedilink
        English
        arrow-up
        0
        ·
        3 months ago

        Why would the car already be at the car wash if you ask it wether or not you should drive there?

        • humanspiral@lemmy.ca
          link
          fedilink
          English
          arrow-up
          0
          ·
          edit-2
          3 months ago

          AI tech bros have more than 1 car? Doesn’t everybody? Or do you drive your Ferrari everywhere? Like you woke millennials make me sick. Never mind the avocado toast and rotisserie chicken. Don’t you understand the basic math of maintenance costs of driving your Ferrari everywhere?

    • Hazzard@lemmy.zip
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 months ago

      They also polled 10,000 people to compare against a human baseline:

      Turns out GPT-5 (7/10) answered about as reliably as the average human (71.5%) in this test. Humans still outperform most AI models with this question, but to be fair I expected a far higher “drive” rate.

      That 71.5% is still a higher success rate than 48 out of 53 models tested. Only the five 10/10 models and the two 8/10 models outperform the average human. Everything below GPT-5 performs worse than 10,000 people given two buttons and no time to think.

      • Modern_medicine_isnt@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        3 months ago

        This here is the point most people fail to grasp. The AI was taught by people. And people are wrong a lot of the time. So the AI is more like us than what we think it should be. Right down to it getting the right answer for all the wrong reasons. We should call it human AI. Lol.

        • NewNewAugustEast@lemmy.zip
          link
          fedilink
          English
          arrow-up
          0
          ·
          3 months ago

          Like I said the person above, there is no wrong answer. Its all about assumptions. It is a stupid trick question that no one would ask.

            • NewNewAugustEast@lemmy.zip
              link
              fedilink
              English
              arrow-up
              0
              ·
              3 months ago

              LOL! That is a great answer.

              I have a Microsoft story. I know some one who was hired to stop them from continuing an open source project. They gave them a good salary, stock options, and an office with a fully stocked bar. They said do whatever you want, they figured they would get a good developer and kill the open source competition (back in the Ballmer days).

              Sadly, given money, no real ambition to create closed source software, they mostly spent their days in their office and basically drank themselves to death.

              Microsoft just kills everything it touches.

      • architect@thelemmy.club
        link
        fedilink
        English
        arrow-up
        0
        ·
        edit-2
        3 months ago

        The question is based on assumptions. That takes advanced reading skills. I’m surprised it was 71% passing, to be honest. (The humans, that is)

      • myfunnyaccountname@lemmy.zip
        link
        fedilink
        English
        arrow-up
        0
        ·
        3 months ago

        Can they do to samplings for that? One in a city with a decent to good education system. The other in the backwoods out in the middle of nowhere…where family trees are sticks.

    • Jax@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      3 months ago

      Dirtying the car on the way there?

      The car you’re planning on cleaning at the car wash?

      Like, an AI not understanding the difference between walking and driving almost makes sense. This, though, seems like such a weird logical break that I feel like it shouldn’t be possible.

      • _g_be@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        3 months ago

        You’re assuming AI “think” “logically”.

        Well, maybe you aren’t, but the AI companies sure hope we do

        • Jax@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          0
          ·
          edit-2
          3 months ago

          Absolutely not, I’m still just scratching my head at how something like this is allowed to happen.

          Has any human ever said that they’re worried about their car getting dirtied on the way to the carwash? Maybe I could see someone arguing against getting a carwash, citing it getting dirty on the way home — but on the way there?

          Like you would think it wouldn’t have the basis to even put those words together that way — should I see this as a hallucination?

          Granted, I would never ask an AI a question like this — it seems very far outside of potential use cases for it (for me).

          Edit: oh, I guess it could have been said by a person in a sarcastic sense

            • Jax@sh.itjust.works
              link
              fedilink
              English
              arrow-up
              0
              ·
              3 months ago

              I guess I’ll know to be impressed by AI when it can distinguish things like sarcasm.

          • _g_be@lemmy.world
            link
            fedilink
            English
            arrow-up
            0
            ·
            3 months ago

            you understand the context, and can implicitly understand the need to drive to the car wash’, but these glorified auto-complete machines will latch on to the “should I walk there” and the small distance quantity. It even seems to parrot words about not wanting to drive after having your car washed. There’s no ‘thinking’ about the whole thought, and apparently no logical linking of two separate ideas

  • WraithGear@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    3 months ago

    and what is going to happen is that some engineer will band aid the issue and all the ai crazy people will shout “see! it’s learnding!” and the ai snake oil sales man will use that as justification of all the waste and demand more from all systems

    just like what they did with the full glass of wine test. and no ai fundamentally did not improve. the issue is fundamental with its design, not an issue of the data set

    • turmacar@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 months ago

      Half the issue is they’re calling 10 in a row “good enough” to treat it as solved in the first place.

      A sample size of 10 is nothing.

  • Bluewing@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 months ago

    I just asked Goggle Gemini 3 “The car is 50 miles away. Should I walk or drive?”

    In its breakdown comparison between walking and driving, under walking the last reason to not walk was labeled “Recovery: 3 days of ice baths and regret.”

    And under reasons to walk, “You are a character in a post-apocalyptic novel.”

    Me thinks I detect notes of sarcasm…

  • ThomasWilliams@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 months ago

    <“I want to wash my car. The car wash is 50 meters away. Should I walk or drive?”>

    The model discards the first sentence as it is unrelated to the others.

    Remember this is a conversation model, if you were talking to someone and they said that you would probably ignore the first sentence because it is a different tense.

    • SaltySalamander@fedia.io
      link
      fedilink
      arrow-up
      0
      ·
      3 months ago

      If I were talking to someone, and said those three sentences, and they chose to ignore the contextual sentence, I would think their social skills were basically nonexistent.

    • Tetragrade@leminal.space
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      3 months ago

      Wow you must have done some really extensive probing of the models to say that with such confidence. When can we expect the paper?

  • Rhoeri@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 months ago

    I remember years ago getting downvoted into oblivion both here, and on Reddit for saying that AI would be a disaster.

  • TrackinDaKraken@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 months ago

    I think it’s worse when they get it right only some of the time. It’s not a matter of opinion, it should not change its “mind”.

    The fucking things are useless for that reason, they’re all just guessing, literally.

    • merc@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 months ago

      It’s not literally guessing, because guessing implies it understands there’s a question and is trying to answer that question. It’s not even doing that. It’s just generating words that you could expect to find nearby.

      • m0darn@lemmy.ca
        link
        fedilink
        English
        arrow-up
        0
        ·
        3 months ago

        Isn’t it a probabilistic extrapolation? Isn’t that what a guess is?

        • Iconoclast@feddit.uk
          link
          fedilink
          English
          arrow-up
          0
          ·
          edit-2
          3 months ago

          It’s a Large Language Model. It doesn’t “know” anything, doesn’t think, and has zero metacognition. It generates language based on patterns and probabilities. Its only goal is to produce linguistically coherent output - not factually correct one.

          It gets things right sometimes purely because it was trained on a massive pile of correct information - not because it understands anything it’s saying.

          So no, it doesn’t “guess.” It doesn’t even know it’s answering a question. It just talks.

          • vii@lemmy.ml
            link
            fedilink
            English
            arrow-up
            0
            ·
            3 months ago

            It gets things right sometimes purely because it was trained on a massive pile of correct information - not because it understands anything it’s saying.

            I know some humans that applies to

          • SuspciousCarrot78@lemmy.world
            link
            fedilink
            English
            arrow-up
            0
            ·
            edit-2
            3 months ago

            A fair point but often it overlooks something -

            Language itself encodes meaning. If you can statistically predict the next word, then you are implicitly modeling the structure of ideas, relationships, and concepts carried by that language.

            You don’t get coherence, useful reasoning, or consistently relevant answers from pure noise. The patterns reflect real regularities in the world, distilled through human communication.

            Yes, that doesn’t mean an LLM “understands” in the human sense, or that it’s infallible.

            But reducing it to “just autocomplete” misses the fact that sufficiently rich pattern modeling can approximate aspects of reasoning, abstraction, and knowledge use in ways that are practically meaningful, even if the underlying mechanism is different from human thought.

            TL;DR: it’s a bit more than just a fancy spell check. ICBW and YMMV but I belive I can assert this claim with evidence if so needed.

            • Iconoclast@feddit.uk
              link
              fedilink
              English
              arrow-up
              0
              ·
              3 months ago

              No, I completely agree. My personal view is that these systems are more intelligent than the haters give them credit for, but I think this simplistic “it’s just autocomplete” take is a solid heuristic for most people - keeps them from losing sight of what they’re actually dealing with.

              I’d say LLMs are more intelligent than they have any right to be, but not nearly as intelligent as they can sometimes appear.

              The comparison I keep coming back to: an LLM is like cruise control that’s turned out to be a surprisingly decent driver too. Steering and following traffic rules was never the goal of its developers, yet here we are. There’s nothing inherently wrong with letting it take the wheel for a bit, but it needs constant supervision - and people have to remember it’s still just cruise control, not autopilot.

              The second we forget that is when we end up in the ditch. You can’t then climb out shaking your fist at the sky, yelling that the autopilot failed, when you never had autopilot to begin with.

              • SuspciousCarrot78@lemmy.world
                link
                fedilink
                English
                arrow-up
                0
                ·
                edit-2
                3 months ago

                I think were probably on the same page, tbh. OTOH, I think the “fancy auto complete” meme is a disingenuous thought stopper, so I speak against it when I see it.

                I like your cruise control+ analogy. Its not quite self driving… but, it’s not quite just cruise control, either. Something half way.

                LLMs don’t have human understanding or metacognition, I’m almost certain.

                But next-token prediction suggests a rich semantic model, that can functionally approximate reasoning. That’s weird to think about. It’s something half way.

                With external scaffolding memory, retrieval, provenance, and fail-closed policies, I think you can turn that into even more reliable behavior.

                And then… I don’t know what happens after that. There’s going to come a time where we cross that point and we just can’t tell any more. Then what? No idea. May we live in interesting times, as the old curse goes.

                • Iconoclast@feddit.uk
                  link
                  fedilink
                  English
                  arrow-up
                  0
                  ·
                  edit-2
                  3 months ago

                  I think the “fancy auto complete” meme is a disingenuous thought stopper, so I speak against it when I see it.

                  I can respect that. I’ve criticized it plenty myself too. I think this is just me knowing my audience and tweaking my language so at least the important part of my message gets through. Too much nuance around here usually means I spend the rest of my day responding to accusations about views I don’t even hold. Saying anything even mildly non-critical about AI is basically a third rail in these parts of the internet.

                  These systems do seem to have some kind of internal world model. I just have no clue how far that scales. Feels like it’s been plateauing pretty hard over the past year or so.

                  I’d be really curious to try the raw versions of these models before all the safety restrictions get slapped on top for public release. I don’t think anyone’s secretly sitting on actual AGI, but I also don’t buy that what we have access to is the absolute best versions in existence.

                • HugeNerd@lemmy.ca
                  link
                  fedilink
                  English
                  arrow-up
                  0
                  ·
                  3 months ago

                  think the “fancy auto complete” meme is a disingenuous

                  “LLMs don’t have human understanding or metacognition”

                  Then what’s the (auto-completing) fucking problem? It’s just a series of steps on data. You could feed it white noise and it would vomit up more noise. And keep doing it as long as there’s power.

                  Intelligent?

        • vii@lemmy.ml
          link
          fedilink
          English
          arrow-up
          0
          ·
          3 months ago

          This gets very murky very fast when you start to think how humans learn and process, we’re just meaty pattern matching machines.

    • Iconoclast@feddit.uk
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 months ago

      Is cruise control useless because it doesn’t drive you to the grocery store? No. It’s not supposed to. It’s designed to maintain a steady speed - not to steer.

      Large Language Models, as the name suggests, are designed to generate natural-sounding language - not to reason. They’re not useless - we’re just using them off-label and then complaining when they fail at something they were never built to do.

      • Urist@leminal.space
        link
        fedilink
        English
        arrow-up
        0
        ·
        3 months ago

        Language without meaning is garbage. Like, literal garbage, useful for nothing. Language is a tool used to express ideas, if there are no ideas being expressed then it’s just a combination of letters.

        Which is exactly why LLMs are useless.

        • Iconoclast@feddit.uk
          link
          fedilink
          English
          arrow-up
          0
          ·
          3 months ago

          Which is exactly why LLMs are useless.

          800 million weekly ChatGPT users disagree with that.

          • Urist@leminal.space
            link
            fedilink
            English
            arrow-up
            0
            ·
            3 months ago

            Those users are being harmed by it, not benefited. That isn’t useful, it’s a social disease.

          • RichardDegenne@lemmy.zip
            link
            fedilink
            English
            arrow-up
            0
            ·
            3 months ago

            And there are 1.3 billion smokers in the world according to the WHO.

            Does that make cigarettes useful?

            • Iconoclast@feddit.uk
              link
              fedilink
              English
              arrow-up
              0
              ·
              edit-2
              3 months ago

              Something being useful doesn’t imply it’s good or beneficial. Those terms are not synonymous. Usefulness describes whether a thing achieves a particular goal or serves a specific purpose effectively.

              A torture device is useful for extracting information. A landmine is useful for denying an area to enemy troops.

              • Urist@leminal.space
                link
                fedilink
                English
                arrow-up
                0
                ·
                3 months ago

                A torture device is useful for extracting information.

                No it fucking isn’t! This is a great analogy, actually, thank you for bringing it up. A person being tortured will tell you literally anything that they believe will stop you from torturing them. They will confess to crimes that never happened, tell you about all their accomplices who don’t exist, and all their daily schedules that were made up on the spot. Torture is useless but morons think it is useful. Just like AI.

                • Womble@piefed.world
                  link
                  fedilink
                  English
                  arrow-up
                  0
                  ·
                  3 months ago

                  Torture can be a useful way of extracting information if you have a way to instantly verify it, which actually makes it a good analogy to LLMs. If I want to know the password to your laptop and torture you until you give me the correct password and I log in then that works.

      • tigeruppercut@lemmy.zip
        link
        fedilink
        English
        arrow-up
        0
        ·
        3 months ago

        But natural language in service of what? If they can’t produce answers that are correct, what’s the point of using them? I can get wrong answers anywhere.

        • iopq@lemmy.world
          link
          fedilink
          English
          arrow-up
          0
          ·
          3 months ago

          Some of them can produce the correct answer. Of we do the test next year and they do better than humans then, isn’t it progress?

        • Iconoclast@feddit.uk
          link
          fedilink
          English
          arrow-up
          0
          ·
          3 months ago

          I’m not here defending the practical value of these models. I’m just explaining what they are and what they’re not.

          • XLE@piefed.social
            link
            fedilink
            English
            arrow-up
            0
            ·
            3 months ago

            You’re definitely running around Lemmy defending AI, Iconoclast… Might as well be honest about it

            • Iconoclast@feddit.uk
              link
              fedilink
              English
              arrow-up
              0
              ·
              3 months ago

              I’m not really interested in engaging in discussions about what you or anyone else thinks my underlying motives are. You’re free to point out any factual inaccuracies in my responses, but there’s no need to make it personal and start accusing me of being dishonest.

        • Threeme2189@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          0
          ·
          3 months ago

          As OP said, LLMs are really good at generating text that is fluid and looks natural to us. So if you want that kind of output, LLMs are the way to go.
          Not all LLM prompts ask factual questions and not all of the generated answers need to be correct.
          Are poems, songs, stories or movie scripts ‘correct’?

          I’m totally against shoving LLMs everywhere, but they do have their uses. They are really good at this one thing.

          • tigeruppercut@lemmy.zip
            link
            fedilink
            English
            arrow-up
            0
            ·
            edit-2
            3 months ago

            Are poems, songs, stories or movie scripts ‘correct’?

            It’s a valid point that they can produce natural language. The Turing Test has been a thing for awhile after all. But while the language sounds natural, can they create anything meaningful? Are the poems or stories they make worth anything? It’s not like humans don’t create shitty art, so I guess generating random soulless crap is similar to that.

            The value of language produced by something that can’t understand the reason for language is an interesting question I suppose.

            • iopq@lemmy.world
              link
              fedilink
              English
              arrow-up
              0
              ·
              3 months ago

              There are people out there whose job is to format promotional emails for companies. AIs can replace this kind of soulless work completely. We should applaud that.

            • Threeme2189@sh.itjust.works
              link
              fedilink
              English
              arrow-up
              0
              ·
              3 months ago

              I’m with you on that. I’ve come to realize that I value a shitty stick figure that was drawn by a human much more than an AI generated ‘Mona Lisa’.

    • Tetragrade@leminal.space
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      3 months ago

      Same takeaway as the article (everyone read the article, right?).

      Applying it to yourself, can you recall instances when you were asked the same question at different points in time? How did you respond?

      • CileTheSane@lemmy.ca
        link
        fedilink
        English
        arrow-up
        0
        ·
        3 months ago

        Having read the article (you read the article right?) what gave you the impression the AI was asked the question at different points in time?

        • Tetragrade@leminal.space
          link
          fedilink
          English
          arrow-up
          0
          ·
          edit-2
          3 months ago

          The AI was asked the same question repeatedly and gave different answers, due to its randomised structure.

          People will also often do this (I have, personally), but because our actions seem to be strongly influenced by time-dependent stuff (like sense perception and short-term memory contents), I’d expect you’d need to ask at different times.

    • XLE@piefed.social
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 months ago

      Even if you retooled the LLM to not randomize the output it generates, it can still create contradictory outputs based on a slightly reworded question. I’m talking about a misspelling, different punctuation, things that simply wouldn’t cause a person to change their answer.

      (And that’s assuming the LLM just got started from scratch. If you had any previous conversation with it, it could have influenced the output as well. It’s such a mess.)