Pay-per-output? AI firms blindsided by beefed up robots.txt instructions.

ccunning@lemmy.world · 3 months ago

Pay-per-output? AI firms blindsided by beefed up robots.txt instructions.

underline960@sh.itjust.works · 3 months ago

Leeds told Ars that the RSL standard doesn’t just benefit publishers, though. It also solves a problem for AI companies, which have complained in litigation over AI scraping that there is no effective way to license content across the web.

"If they’re using it, they pay for it, and if they’re not using it, they don’t pay for it.

…

But AI companies know that they need a constant stream of fresh content to keep their tools relevant and to continually innovate, Leeds suggested. In that way, the RSL standard “supports what supports them,” Leeds said, “and it creates the appropriate incentive system” to create sustainable royalty streams for creators and ensure that human creativity doesn’t wane as AI evolves.

This article tries to slip in the idea that creators will benefit from this arrangement. Just like with Spotify and Getty Images, it’s the publisher that’s getting paid.

Then they decide how much they’ll let trickle down to creators.

I Cast Fist@programming.dev · 3 months ago

Cue an even greater influx of AI slop pages in hopes of getting crawled for that juicy trickled down money

ccunning@lemmy.world · 3 months ago

I would assume creators and published would agree to those terms in advance (moving forward of course).

Kissaki@feddit.org · 3 months ago

evolves robots.txt instructions by adding an automated licensing layer that’s designed to block bots that don’t fairly compensate creators for content

robots.txt - the well known technology to block bad-intention bots /s

What’s automated about the licensing layer? At some point, I started skimming the article. They didn’t seem clear about it. The AI can “automatically” parse it?

# NOTICE: all crawlers and bots are strictly prohibited from using this 
# content for AI training without complying with the terms of the RSL 
# Collective AI royalty license. Any use of this content for AI training 
# without a license is a violation of our intellectual property rights.

License: https://rslcollective.org/royalty.xml

Yeah, this is as useless as I thought it would be. Nothing here is actively blocking.

I love that the XML then points to a text/html content website. I guess nothing for machine parsing, maybe for AI parsing.

I don’t remember which AI company, but they argued they’re not crawlers but agents acting on the users behalf for their specific request/action, ignoring robots.txt. Who knows how they will react. But their incentives and history is ignoring robots.txt.

Why ~~am I~~ is this comment so negative. Oh well.

FaceDeer@fedia.io · 3 months ago

And suddenly the Internet is gung-ho in favor of EULAs being enforceable simply by reading the content the website has already provided.

Recent major court cases have held that the training of an AI model is fair use and doesn’t involve copyright violation, so I don’t think licensing actually matters in this case. They’d have to put the content behind a paywall to stop the trainer from seeing it in the first place.

ccunning@lemmy.world · 3 months ago

I guess that’s a different court case than the one where Anthropic offered to pay $1.5 billion?

FaceDeer@fedia.io · 3 months ago

Nope, this was one of them. The case had two parts, one about the training and one about the downloading of pirated books. The judge issued a preliminary judgment about the training part, that was declared fair use without any further need to address it in trial. The downloading was what was proceeding to trial and what the settlement offer was about.

tabular@lemmy.world · edit-2 3 months ago

Is it hypocrisy to be for EULA enforcement on reading when it’s machines, but not when it’s humans? Crawlers “read” on a massive scale that doesn’t compare to humans.

WhyJiffie@sh.itjust.works · 3 months ago

I don’t think so, or not always. humans need to find the EULA on the website by first loading the main page or another they found a link to. but if the path of that document was standardized, it could be enforced that way for robots

GissaMittJobb@lemmy.ml · 3 months ago

I have no idea what they think this will accomplish, to be honest. It has the legal value of posting on Facebook that you don’t allow them to use your photos.

ccunning@lemmy.world · 3 months ago

I think the idea is that all parties would find it beneficial:

Leeds told Ars that the RSL standard doesn’t just benefit publishers, though. It also solves a problem for AI companies, which have complained in litigation over AI scraping that there is no effective way to license content across the web.

ricecake@sh.itjust.works · 3 months ago

The thing is a robots.txt file doesn’t work as licensing. There’s no legal requirement to fetch the file, and no mechanism to consent or track consent.

This is putting up a sign that says everyone must pay, and then giving it to anyone who asks for free.

ccunning@lemmy.world · 3 months ago

The thing is if all parties find the terms agreeable it doesn’t matter if it’s legally binding.

It’s more like putting a price on the shelf at the grocery store. Not every one will agree the price is agreeable and you might still get shoplifters but it doesn’t mean it’s a waste of time to list the price.

ricecake@sh.itjust.works · 3 months ago

It really does matter if it’s legally binding if you’re talking about content licensing. That’s the whole thing with a licensing agreement: it’s a legal agreement.

The store analogy isn’t quite right. Leaving a store with something you haven’t purchased with the consent of the store is explicitly illegal.
With a website, it’s more like if the “shoplifter” walked in, didn’t request a price sheet, picked up what they wanted and went to the cashier who explicitly gave it to them without payment.

The crux of the issue is that the website is still providing the information even if the requester never agreed or was even presented with the terms.
If your site wants to make access to something conditional then it needs to actually enforce that restriction.

It’s why the current AI training situation is unlikely to be resolved without laws to address it explicitly.

billwashere@lemmy.world · 3 months ago

The issue is the line that says “compensate creators”. Reddit still thinks it’s the creator, not the individual users.

trailee@sh.itjust.works · edit-2 3 months ago

Neither the article nor the RSL website makes clear how pricing or payment works, which seems like a huge miss. It’s not obvious if a publisher can price-differentiate among content, or even choose their own prices at all.

RSL makes an analogy:

Collective licensing organizations like ASCAP and BMI have long helped musicians get paid fairly by working together and pooling rights into a single, indispensable offering.

I’d like to get excited about this because AI companies suck, but if the best example they have is that ASCAP helps “musicians get paid fairly” I’m afraid this isn’t a solution that most content creators will celebrate.

BrianTheeBiscuiteer@lemmy.world · 3 months ago

Not a bad idea but the biggest challenge will probably be determining who needs to be sued for non-compliance. Google might not be hiding the origin of its bots now but that could easily change.