Getty Images Releases ‘Cleanest’ Visual Dataset for Training Base Models

September 8, 2024
Getty Images Releases ‘Cleanest’ Visual Dataset for Training Base Models

Stay ahead of the curve with our daily and weekly newsletters, packed with the latest updates and exclusive insights into the world of AI. Discover More


Getty Images, a renowned creative company celebrated for its global platform of visual content, is making a bold move to position itself as a reliable data partner. Today, they announced the release of a sample open dataset from their extensive library on Hugging Face.

Despite the abundance of visual datasets on the Hugging Face hub, Getty Images asserts that their offering is unique, providing reliable and commercially safe data. This allows enterprise developers to incorporate it into their AI training pipeline without the fear of future quality or legal complications.

“Envision enhancing your AI/ML capabilities with data that is not only diverse and high quality, but also responsibly sourced. That’s the value we’re bringing to the table,” says Andrea Gagliano, the head of data science and AI/ML at Getty Images, in a conversation with VentureBeat.

The company’s ultimate goal is to foster an ecosystem where AI companies prefer to use officially licensed content from its platform to train their AI models.

What’s in the Getty Images dataset?

Developers often face the daunting task of dealing with poorly sourced, low-quality data when training AI/ML models. To rectify this, they have to go through multiple layers of work to clean and enrich the entire repository. This involves removing duplicates, damaged files, and filtering out unnecessary elements such as celebrity images, trademarks, NSFW content, low-resolution images, and those with incomplete or missing metadata.

This task can be time-consuming and resource-intensive, leading to missed opportunities for the engineering team. Moreover, despite all the hard work, some harmful or copyrighted materials may still slip through, leading to potential legal disputes.

Getty Images aims to address these issues with its open dataset on Hugging Face, providing developers with a ready-to-use repository of high-quality images spanning 15 categories.

“Our sample Dataset includes 3,750 images from 15 categories, including abstracts, built environments, business, concepts, education, healthcare, icons, industry, nature, illustrations, and travel,” Gagliano shares with VentureBeat.

Content from Getty Images sample dataset
A glimpse of the Getty Images sample dataset

According to Gagliano, the repository is sourced from Getty’s wholly-owned creative library, ensuring the images are commercially safe and free from unexpected legal issues. The dataset is already cleaned and enriched, specifically curated for machine learning (ML) training with high-resolution images, rich structured metadata, and devoid of unwanted elements like NSFW content.

She describes it as the “cleanest, highest quality dataset” available for training ML models.

Usage conditions to note

While the sample dataset is open for use, it’s important to note that certain conditions apply to ensure the licensed content is used responsibly for commercial applications and academic research.

“Some of the restrictions include redistribution of the dataset, development of models/software to recreate or reproduce digital reproductions of the content, creation of products/services in direct competition with Getty Images, use of biometric identifiers derived from the dataset, and any use that violates applicable laws or regulations,” Gagliano explains.

Getty Images hopes this initiative will engage the developer community, showcasing the depth and breadth of content they can offer, and positioning them as a “trusted partner” for providing licensed, high-quality data for responsible AI training.

“Our goal is to demonstrate that it’s possible to accommodate licensing for all the content required to train functional AI models – developing business models that respect creator IP while enabling the creation of high-quality AI models,” Gagliano adds. She notes that developers requiring more data can contact the company with their use cases to source a larger licensed repository.

This arrangement also ensures that the original content creators/providers receive compensation on an annual recurring basis. Interestingly, Getty Images used the same approach for its AI image generation tool developed in partnership with Nvidia.

rnrn
Avatar photo

Jared Cohen

Jared studied Psychology at UCLA, focusing on the effects of fandom culture on mental health. His intriguing takes on fandom psychology and his reviews on self-help books designed for geeks make him a unique contributor to Hypernova.

Most Read

Categories

Preview of Wolverine #1 Reveals Logan in Pursuit
Previous Story

Preview of Wolverine #1 Reveals Logan in Pursuit

Review of Teenage Mutant Ninja Turtles #2: TMNT Party Dude Gets Disturbed
Next Story

Review of Teenage Mutant Ninja Turtles #2: TMNT Party Dude Gets Disturbed