Who needs human when you have AI :p

  • FaceDeer@kbin.social
    link
    fedilink
    arrow-up
    1
    ·
    1 year ago

    LLMs need updated training data to stay relevant.

    Yes. So add relevant new data along with the older stuff. The problem is not that AI-generated content is magically “poison” somehow. Model collapse happens when you lose rare data from repeated generations of training data generated by AIs.

    A simple way to imagine it is training an AI by showing it random coloured marbles out of a bucket and then asking it to fill the next AI’s bucket with new marbles to train on. If there’s just one single blue marble in the first bucket then it’s easily possible that the AI will fail to put a blue marble in the second bucket, after which there will never be a blue marble again if that’s all that subsequent AIs have to train off of. But if each time you train a new AI you reuse half the marbles from the first bucket again, you can have that blue marble show back up again in future AIs.

    • Dr. JenkemA
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      If LLMs are as revolutionary as the zealots believe, then there will exist less and less blue marbles in the universe with each iteration. So either the bucket gets smaller or the ratio of blue marbles gets smaller.

      • FaceDeer@kbin.social
        link
        fedilink
        arrow-up
        1
        ·
        1 year ago

        I said:

        But if each time you train a new AI you reuse half the marbles from the first bucket again, you can have that blue marble show back up again in future AIs.

        The original bucket containing the blue marble isn’t going anywhere. It still exists. The blue marble will always be available to mix into future AIs. All you have to do is make sure you’re using some historical data (or otherwise guaranteed “human-generated”) along with whatever new unvetted stuff you’re using.

        • Dr. JenkemA
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          So then your back to locking LLMs to the year 2023. They’re usefulness is severely limited if you can’t train them on new data.

          • FaceDeer@kbin.social
            link
            fedilink
            arrow-up
            1
            ·
            1 year ago

            All you have to do is make sure you’re using some historical data (or otherwise guaranteed “human-generated”) along with whatever new unvetted stuff you’re using.

            Emphasis added. Please read more carefully, this is getting repetitive. You keep assuming that the AI will be trained either entirely with old data or entirely with new data and that’s just not the case.

            • Dr. JenkemA
              link
              fedilink
              English
              arrow-up
              1
              ·
              1 year ago

              And what happens when “whatever new unvetted stuff” is primarily comprised of AI-generated content?

              • FaceDeer@kbin.social
                link
                fedilink
                arrow-up
                1
                ·
                1 year ago

                Then the missing diversity comes from the non-AI-generated stuff that’s included in the mix.

                I’m not sure what the problem is here. The cause of model collapse when AIs are fed on the output of previous generations is that the rare “fringes” of the data are lost over time. The training data becomes increasingly monotonous. Adding that fringe data back in should cure that.