NY Court Dismisses Raw Story’s Copyright Lawsuit, OpenAI’s Data Scraping Triumphs

Stay ahead of the curve with our daily and weekly newsletters. Get the latest updates and exclusive insights on AI advancements straight to your inbox. Learn More

A copyright violation lawsuit filed by Raw Story Media, Inc. and AlterNet Media, Inc., two alternative left-leaning online news platforms, against OpenAI has been dismissed by the Southern District of New York. This effectively quashes allegations that the AI firm infringed copyrights by using scraped news content in its training data.

This dismissal marks a significant milestone in the ongoing debate over copyright and AI tools, particularly under Section 1202(b) of the Digital Millennium Copyright Act (DMCA). However, it’s worth mentioning that other cases have also failed to establish successful claims under this provision.

Let’s delve into the details of the case, understand why the judge dismissed it, and explore what this means for the future of AI, copyright, and the legality of tech companies scraping content from the web without explicit permission or compensation from the creators.

Decoding the DMCA’s Section 1202(b)

The lawsuit hinged on Section 1202(b) of the DMCA, a provision designed to safeguard “copyright management information” (CMI).

This includes details like author names, titles, and other metadata that identify copyrighted works. Section 1202(b) prohibits the unauthorized removal or alteration of such information, especially if it facilitates copyright infringement.

In this case, Raw Story and AlterNet accused OpenAI of using articles from their websites to train ChatGPT and other models without preserving CMI, thereby violating Section 1202(b).

OpenAI is not the only AI company likely to have scraped such material from the web. While AI model providers tend to closely guard their training datasets, the industry at large has undoubtedly scraped large portions of the web to train its various models. This practice is akin to what Google did to crawl and index search results in its main search engine product. In this context, some creators view data scraping as AI’s “original sin.”

In this case, the plaintiffs Raw Story and Alternet alleged that OpenAI’s AI outputs—responses generated by the models—were sometimes based on their articles and the company knowingly violated copyright after the CMI was removed.

The Reason Behind the Court’s Dismissal of Raw Story’s Claims

Judge Colleen McMahon granted OpenAI’s motion to dismiss the case on the grounds of lack of standing. Specifically, the judge ruled that the plaintiffs failed to demonstrate that they suffered a concrete, actual injury from OpenAI’s actions—an essential requirement under Article III of the U.S. Constitution for any lawsuit to proceed.

Judge McMahon also took into account the evolving landscape of large language model (LLM) interfaces, noting that updates to these systems further complicate attribution and traceability. She emphasized that generative AI’s iterative improvements make it less likely that content will be reproduced verbatim, making the plaintiffs’ claims even more speculative.

The judge pointed out that “the likelihood that ChatGPT would output plagiarized content from one of Plaintiffs’ articles seems remote.” This reflects a key challenge in these types of cases: generative AI is designed to synthesize information rather than replicate it verbatim. The plaintiffs failed to present convincing evidence that their specific works were directly infringed in a way that led to identifiable harm.

The ruling aligns with similar cases where courts have struggled to apply traditional copyright law to generative AI. For example, the Doe 1 v. GitHub case involving Microsoft’s Copilot also dealt with claims under Section 1202(b). There, the court found that the code generated by Copilot wasn’t an “identical copy” of the original, but rather snippets that were reconfigured, making it difficult to prove the violation of CMI requirements.

A Widening Gap on Section 1202(b)

The Raw Story decision underscores the broader uncertainties courts are grappling with regarding Section 1202(b), especially with generative AI.

Currently, there is no firm consensus on how Section 1202(b) applies to a wide range of online content. Some courts have imposed an “identicality” requirement—meaning plaintiffs must prove that the infringing works are an exact copy of the original content, minus CMI. However, others have allowed for more flexible interpretations.

For instance, the court in the Southern District of Texas recently rejected the identicality requirement, stating that even partial reproductions could qualify as violations if CMI is deliberately removed.

Meanwhile, in the lawsuit brought by Sarah Silverman and a group of authors, the court held that the plaintiff failed to show sufficient evidence that OpenAI had actively removed CMI from her content. That ruling, much like Raw Story’s, underscores the evidentiary burden plaintiffs face.

As Maria Crusey explained in a piece for the Authors Alliance, “The uptick in §1202(b) claims raises challenging questions, namely: How does §1202(b) apply to the use of a copyrighted work as part of a dataset that must be cleaned, restructured, and processed in ways that separate copyright management information from the content itself?”

The Significance of this Ruling for AI and Content Creators

The dismissal of Raw Story’s lawsuit is not just a victory for OpenAI—it’s a signpost of how courts may handle similar copyright claims in the rapidly evolving landscape of generative AI. With OpenAI and its investor Microsoft currently defending against a similar lawsuit filed by The New York Times, the ruling can only help establish some precedent to dismiss this and future claims.

Indeed, the ruling suggests that without clear, demonstrable harm or exact reproduction, plaintiffs may find it challenging to get their day in court.

Judge McMahon’s ruling also touches on a broader point about how AI synthesizes data versus directly replicating it. OpenAI’s ChatGPT doesn’t directly recall articles from Raw Story—it instead uses training data to produce novel outputs that resemble human writing. This makes proving violations under current copyright laws inherently difficult.

For content creators, this raises a significant challenge: how to ensure proper credit and prevent unauthorized use of their work in training datasets. Licensing agreements like the ones OpenAI has struck with large news publishers such as Vogue and Wired owner Condé Nast could become a new standard, giving companies a way to legally use copyrighted content while compensating its creators.

Caught Between a Bot and a Hard Place

Courts are still figuring out how to handle generative AI, and recent rulings suggest they’re reluctant to extend Section 1202(b) protections unless plaintiffs show real, specific harm. AI-generated content synthesizes rather than replicates, making it tough to prove copyright violations.

For plaintiffs, this means proving harm is an uphill battle. Courts are signaling that vague claims aren’t enough—plaintiffs need hard evidence of damage. For developers and tech companies, even if the odds seem favorable, no one wants a lawsuit. Transparency, data records, and compliance are essential to avoid legal trouble.

Judge McMahon noted the case could be refiled (“together with an explanation of why the proposed amendment would not be futile,” she wrote), but significant obstacles remain.

VB Daily

Stay informed! Get the latest news delivered to your inbox daily

By subscribing, you agree to VentureBeat’s Terms of Service.

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

rnrn