• earthworm@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    5
    ·
    edit-2
    13 days ago

    This seems like a dumb benchmark.

    ClockBench evaluates whether models can read analog clocks - a task that is trivial for humans, but current frontier models struggle with.

    What do you mean trivial? Most humans I know can’t read the most basic white-background-big-black-numbers clocks.

    Someone rigged the jury to get 90% on this: