• kescusay@lemmy.world
    link
    fedilink
    English
    arrow-up
    70
    arrow-down
    1
    ·
    22 hours ago

    Software developer, here. (No, not a “vibe coder.” I actually know how to read and write my own code and what it does.)

    Just had the opportunity to test GPT 5 as a coding assistant in Copilot for VS Code, which in my opinion is the only legitimately useful purpose for LLMs. (No, not to write everything for me, just to do some of the more tedious tasks faster.) The IDE itself can help keep them in line, because it detects when they screw up. Which is all the time, due to their nature. Even recent and relatively “good” models like Sonnet need constant babysitting.

    GPT 5 failed spectacularly. So badly, in fact, that I’m glad I only set it to analysis tasks and not to any write tasks. I will not be using it for anything else any time soon.

    • Pechente@feddit.org
      link
      fedilink
      English
      arrow-up
      15
      ·
      22 hours ago

      Yeah right? I tried it yesterday to build a simple form for me. Told it to look at the structure of other forms for reference which it did and somehow it used NONE of the UI components and helpers from the other forms. It was bafflingly bad

      • errer@lemmy.world
        link
        fedilink
        English
        arrow-up
        15
        ·
        19 hours ago

        Despite the “official” coding score for GPT5 being higher, Claude sonnet still seems to blow it out of the water. That seems to suggest they are training to the test and the test must not be a very good test. Or they are lying.

        • jj4211@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          ·
          6 hours ago

          Problem with the “benchmarks” is Goodhart’s Law: one a measure becomes a target, it ceases to be a good measurement.

          The AI companies obsession with these tests cause them to maniacly train on them, making then better at those tests, but that doesn’t necessarily map to actual real world usefulness. Occasionally you’ll see a guy that interviews well, but it’s petty useless in general on the job. LLMs are basically those all the time, but at least useful because they are cheap and fast enough to be worth it for super easy bits.

            • Elvith Ma'for@feddit.org
              link
              fedilink
              English
              arrow-up
              12
              ·
              edit-2
              19 hours ago

              Now that we have vibe coding and all programmers have been sacked, they’re apparently trying out vibe presenting and vibe graphing. Management watch out, you’re obviously next!

    • ThePowerOfGeek@lemmy.world
      link
      fedilink
      English
      arrow-up
      6
      ·
      19 hours ago

      I’m no longer even confident in modern LLMs to do stuff like convert a table schema or JSON document into a POCO. I tried this the other day with a field list from a table creation script. So it had to do was reformat the fields into a dumb C# model. Inexplicably it did fine except for omitting a random field in the middle of the list. Kinda shakes your confidence in LLMs for even the most basic programming tasks.

      • kescusay@lemmy.world
        link
        fedilink
        English
        arrow-up
        5
        ·
        18 hours ago

        More and more, for tasks like that I simply will not use an LLM at all. I’ll use a nice, predictable, deterministic script. Weirdly, LLMs are pretty decent at writing those.

    • Passerby6497@lemmy.world
      link
      fedilink
      English
      arrow-up
      10
      arrow-down
      2
      ·
      20 hours ago

      Yeah, LLMs are decent with coding tasks if you know what you’re doing and can properly guide it (and check it’s work!), but fuck if they don’t take a lot of effort to reign in. I will say they’re pretty damned good at debugging the shit I wrote. I’ve been working on an audit project for a few months and 4o/5 have helped me a good bit to find persistent errors in my execution logic that I just kept missing on rereads and debug runs.

      But new generation is painful. I had 5 generate a new function for me yesterday to do some issues recon and report generation, and I spent 20 minutes going back and forth with it dropping fields in the output repeatedly. Even on 5, it still struggles at times to not give you the same wrong answer more than once, or just waffles between wrong answers at times.

      • webhead@lemmy.world
        link
        fedilink
        English
        arrow-up
        5
        ·
        17 hours ago

        Dude forgetting stuff has to be one the most frustrating parts of the entire process . Like forgetting a column in a database or just an entire piece of a function you just pasted in… Or trying to change things you never asked it to touch. So freaking annoying. I had standing instructions in it’s memory to not leave out pieces or modify things I didn’t ask for and will put that stuff in the prompt and it just does not care lol.

        I’ve used it a lot for coding because I’m not a real programmer (more a code hacker) and need to get things done for a website, but I know just enough to know it’s really stupid sometimes lol.

        • Passerby6497@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          ·
          edit-2
          17 hours ago

          Dude forgetting stuff has to be one the most frustrating parts of the entire process . Like forgetting a column in a database or just an entire piece of a function you just pasted in

          It was actually worse. I was pulling data out of local logs and processing events. I asked to assess a couple columns that I was struggling to parse properly, and it got those ones in, but dropped some of my existing columns. I pointed out the error, it acknowledged the issue, then spat out code that reverted to the first output!

          Though, that wasn’t nearly as bad as it telling me that a variable a couple hundred lines and multiple transformations in wasn’t being populated by an early variable, and I literally went in and just copied each declaration line and sent it back like I was smacking an intern on the nose or something…

          For a bit designed to read and analyze text, it is surprisingly bad at the whole ‘reading’ aspect. But maybe that’s just how human like the intelligence is /s

          Or trying to change things you never asked it to touch. So freaking annoying. I had standing instructions in it’s memory to not leave out pieces or modify things I didn’t ask for and will put that stuff in the prompt and it just does not care lol

          OMFG this. I’ve had decent luck recently after setting up a project and explicitly laying out a number of global directives, because yeah, it was awful trying to figure out exactly what changed when I diff the input and output, and fucking everything is red because even the goddamned comments are changed. But even just trying to make it understand basic style requirements was a solid half hour of arguing with it (only partially because I forgot the proper names of casings) so it wouldn’t make me lint the whole goddamned script I just told it to analyze and fix one item.

          • webhead@lemmy.world
            link
            fedilink
            English
            arrow-up
            2
            ·
            15 hours ago

            Yessir I’ve basically run into all of that. It’s fucking infuriating. It really is like talking to a toddler at times. There seems to be a limit to the complexity of what it can process before it just starts messing everything up. Like once you hit its limit, it will not process the entire thing no matter how many times you fix it together like your example. You fix one problem and then it just forgets a different piece. FFFFFFFFFF.

      • kescusay@lemmy.world
        link
        fedilink
        English
        arrow-up
        6
        ·
        18 hours ago

        Not yet. I’ll give them a shot if they promise never to say “you’re absolutely correct” or give me un-requested summaries about how awesome they are in the middle of an unfinished task.

        Actually, I have to give GPT 5 credit on one thing: It’s actually sort of paying attention to the copilot-instructions.md file, because I put this snippet in it: “You don’t celebrate half-finished features, and your summaries of what you’ve accomplished are not only rare, they’re never more than five sentences long. You just get straight to the point.” And - surprise, surprise - it has strictly followed that instruction.

        Fucks up everything else, though.