The tech world is abuzz with the latest revelation in the ongoing legal battle between the Authors Guild and OpenAI. Recently unsealed documents have shed light on the deletion of two significant datasets, “Books1” and “Books2,” by the AI startup. These datasets, comprising over 100,000 published books, were crucial in training OpenAI’s GPT-3 artificial-intelligence model. The deletion of these datasets has sparked controversy and raised questions about the use of copyrighted materials in AI model training.
For months, the Authors Guild has been pressing OpenAI for information regarding the datasets in question. The quality of training data is paramount in developing powerful AI models, and the use of copyrighted materials adds a layer of complexity to the situation. OpenAI, like many other tech companies, relied on data from the internet, including books, to train its models. In a 2020 white paper, OpenAI disclosed that the “Books1” and “Books2” datasets constituted a significant portion of the training data for GPT-3.
The unsealed documents revealed that OpenAI discontinued the use of “Books1” and “Books2” for model training in late 2021 and subsequently deleted the datasets in mid-2022 due to nonuse. While the startup maintains that none of the other data used for training GPT-3 has been deleted, it has offered the Authors Guild access to those datasets. Additionally, it was disclosed that the two researchers responsible for creating the datasets are no longer employed by OpenAI, adding another layer of intrigue to the situation.
OpenAI’s reluctance to disclose the identities of the former employees who created the datasets has further fueled speculation. The startup has sought to keep the names of the employees and information about the datasets under seal, citing confidentiality concerns. In a statement, OpenAI clarified that the models currently in use, including ChatGPT and its API, were not developed using the deleted datasets. The datasets, created by former employees no longer with the company, had not been utilized since 2021 and were deleted in 2022.
The deletion of the “Books1” and “Books2” datasets by OpenAI marks a significant development in the ongoing legal saga with the Authors Guild. The incident underscores the challenges and ethical considerations surrounding the use of training data, particularly when it involves copyrighted materials. As the tech industry continues to push the boundaries of AI development, the handling of training data and intellectual property rights will remain a hotly debated topic.