Discovery of Child Abuse Material in the Largest AI Image Generation Dataset

The Stanford Internet Observatory researchers found at least 1,008 instances of child sexual abuse material (CSAM) in a dataset used to train AI image generation tools. This presence of CSAM in the dataset could enable AI models trained on the data to create new and realistic instances of CSAM. LAION, the non-profit responsible for the dataset, has temporarily taken down its datasets in response, stating that it has a zero tolerance policy for illegal content and is ensuring the safety of the datasets before republishing them. However, LAION leaders have been aware since 2021 that their systems may pick up CSAM as they gathered billions of images from the internet. The LAION-5B dataset contains millions of images of varying content, including CSAM, pornography, violence, and more. The Stanford researchers used various techniques to identify potential CSAM in the dataset and found 3,226 suspected entries, confirmed by third parties like PhotoDNA and the Canadian Centre for Child Protection. Some AI systems, such as Stability AI’s model, were trained on subsets of the LAION-5B data, with filters in place to prevent illegal activities. However, older versions of image generation tools like Stable Diffusion 1.5 may not have the same protections against explicit materials. The Stanford researchers recommend discontinuing distribution of models based on Stable Diffusion 1.5 without safety measures.