According to Engadget, the Wikimedia Foundation (who runs Wikipedia) is creating a special collection of information for AI companies to use. They are working with Kaggle (owned by Google) to offer this information in English and French. This is happening now in 2025, with a test version already available. The main reason for doing this is to stop AI programs from automatically taking too much information from Wikipedia’s website, which causes problems for the servers.
Wikipedia is facing a big problem. Many AI companies are using computer programs (called “bots”) to automatically collect huge amounts of information from Wikipedia pages without permission. This is called “web scraping.” Think of it like people making thousands of copies from a library without asking, causing the photocopier to break down. This scraping makes Wikipedia’s websites load slowly for normal users and costs Wikipedia more money to run their computers.
The new collection of information (called a “dataset“) will include:
- Text from Wikipedia articles
- Article summaries and descriptions
- Different sections from articles
- Information that is free to use under Creative Commons licenses
The dataset will not include references, pictures, videos, or other media files. AI developers can use this information legally without overloading Wikipedia’s servers. It’s like Wikipedia saying, “Instead of taking books off our shelves and making our library crowded, here’s a complete set of books you can take home.”
What exactly is unauthorized web scraping? It happens when computer programs automatically visit websites and copy information without permission. These programs often break the website’s rules. They collect text, prices, or other data very quickly. This can make websites slow, increase costs for the website owners, and sometimes the collected information is misused. When companies scrape Wikipedia without asking, it puts too much pressure on the computers that run the website.
AI companies need large amounts of information to train their AI systems. Wikipedia is perfect for this because:
- It has over 6.8 million articles covering many topics
- The information is usually high-quality and checked by many people
- It helps AI learn how language works and gain general knowledge
- Large datasets help AI become better at answering questions and summarizing information
The problems caused by unauthorized scraping have become serious. Since January 2024, Wikipedia has seen a 50% increase in bandwidth usage because of AI bots. This is like your home internet suddenly being used by many neighbors without your permission. The scraping has caused:
Problem | Impact |
---|---|
Increased bandwidth usage | 50% surge since January 2024 |
Server overload | Terabytes of data consumed at data centers |
Higher costs | More money needed to run Wikipedia |
Slower website | Pages take longer to load for users |
Risk during high traffic | Problems during events like 2.8 million views of former US President Jimmy Carter’s page in December 2024 |
Statistics show how bad the problem has become. About 65% of the most resource-intensive traffic on Wikipedia comes from these AI bots. They have targeted 144 million files on Wikimedia Commons (where Wikipedia stores images and videos). This is why Wikipedia needed to find a better solution.
This new approach should help make Wikipedia faster and more reliable for everyone. By providing a controlled way for AI companies to get the information they need, Wikipedia can protect its servers while still allowing its knowledge to be used widely. If you use Wikipedia regularly, you might notice pages loading faster in the future because of this change.