• Artwork@lemmy.world
    link
    fedilink
    English
    arrow-up
    15
    ·
    edit-2
    6 days ago

    Thank you very much! The red dot is likely smaller…
    Though, I don’t appreciate nor agree with the bomb part! ^^
    The work reminded me of the following paper:

    Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model’s weights during training, and whether those memorized data can be extracted in the model’s outputs.

    While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models…

    We investigate this question using a two-phase procedure: (1) an initial probe to test for extraction feasibility, which sometimes uses a Best-of-N (BoN) jailbreak, followed by (2) iterative continuation prompts to attempt to extract the book.

    We evaluate our procedure on four production LLMs: Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3, and we measure extraction success with a score computed from a block-based approximation of longest common substring…

    Taken together, our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs…

    Source 🕊

    • Deceptichum@quokk.auOP
      link
      fedilink
      English
      arrow-up
      23
      ·
      edit-2
      6 days ago

      I’m anti-copyright and anti-corporation.

      These ridiculous datacentres use large volumes of resources purely to benefit the companies, which are closing-off human made content for their profit.

      • marcos@lemmy.world
        link
        fedilink
        arrow-up
        12
        ·
        6 days ago

        As long as copyrights exist to restrict me, I’m adamant on they restricting billionaires too.

        If they want to extinguish it, I’m listening. Otherwise, they should pay statutory damages for every work they are pirating with those LLMs.

    • morto@piefed.social
      link
      fedilink
      English
      arrow-up
      5
      ·
      6 days ago

      While many believe that LLMs do not memorize much of their training data

      It’s sad that even researchers are using language that personifies llms…

      • chicken@lemmy.dbzer0.com
        link
        fedilink
        arrow-up
        4
        ·
        edit-2
        6 days ago

        What’s a better way to word it? I can’t think of another way to say it that’s as concise and clearly communicates the idea. It seems like it would be harder in general to describe machines meant to emulate human thought without anthropomorphic analogies.

        • morto@piefed.social
          link
          fedilink
          English
          arrow-up
          1
          ·
          5 days ago

          One possibility:

          While many believe that LLMs can’t output the training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models…

          Note that this neutral language makes it more apparent that it’s possible thal llms are able to output the training data, since it’s what the model’s network is build upon. By using personifying language, we’re biasing people into thinking about llms as if they were humans, and this will affect, for example, court decisions, like the ones related to copyright.