• trevor@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    8
    ·
    edit-2
    14 days ago

    “The new definition requires Open Source models to provide enough information about their training data so that a ‘skilled person can recreate a substantially equivalent system using the same or similar data,’ which goes further than what many proprietary or ostensibly Open Source models do today,” said Ayah Bdeir, who leads AI strategy at Mozilla.

    Garbage. What this says to me is that they’re going to allow companies that create models that were trained on data that would be illegal for you and me to scrape and regurgigate, to keep the data to themselves as long as they “provide enough information” for someone else that lacks the resources or legal impunity that companies have to theoretically re-steal the data. Which, you know, means that the models won’t be reproducible by any reasonable standard, and can’t actually be called open source.

    But the OSI is just a handful of companies in a trenchcoat, so I’m not surprised by what they would call “open”.

    • starshipwinepineapple@programming.dev
      link
      fedilink
      arrow-up
      4
      ·
      14 days ago

      the actual license text part being questioned .

      Data Information: Sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system. Data Information shall be made available under OSI-approved terms.

      In particular, this must include: (1) the complete description of all data used for training, including (if used) of unshareable data, disclosing the provenance of the data, its scope and characteristics, how the data was obtained and selected, the labeling procedures, and data processing and filtering methodologies; (2) a listing of all publicly available training data and where to obtain it; and (3) a listing of all training data obtainable from third parties and where to obtain it, including for fee.

      (The rest of the license goes on to talk about weights, etc).

      I agree with you somewhat. I’m glad that each source does need to be listed and described. I’m less thrilled to see “unshareable” data and data that cost $ in there since i think these have potential to effectively make a model not able to be retrained by a “skilled person”.

      It’s a cheap way to make an AI license without making all the training data open source (and dodging the legalities of that).

      • trevor@lemmy.blahaj.zone
        link
        fedilink
        English
        arrow-up
        3
        ·
        14 days ago

        Thanks for sharing the actual license text.

        To me, this stinks of companies knowing that if they’re actually required to reproduce the data, they’ll get hit with copyright infringement or other IP-related litigation. Whereas if they can just be trusted to very honestly list their sources, they can omit the sources they weren’t authorized to steal and reproduce content from, they can get away with it.

        I think that, in practice, this means that the industry standard will be to lie and omit the incriminating data sources, and when someone tries to reproduce the model they won’t actually be able to, but they also won’t be able to easily prove one way or another if data was withheld.

        Really, what should (but won’t) happen, is that we should fix our broken IP laws and companies should be held to account for when they engage in behavior that would be prosecuted as piracy or Computer Fraud and Abuse if you or I did it.

        AI is pretty much the epitome of companies getting to act with impunity in the eyes of the law and exerting that power over everyone else, and it’s annoying to see it get a blessing from an “open source” organization.

        • starshipwinepineapple@programming.dev
          link
          fedilink
          arrow-up
          2
          ·
          14 days ago

          Right, the other thing i considered is that you could just create a company and “buy” the data from them for a ridiculous amount of money and then you have less requirement to detail the data. Similarly you could deem the data unsharable and fudge the provenance.

          Like locks, it will only keep honest people honest.