Wikipedia Creates Special Dataset for AI to Stop Server Overload

Wikipedia creates a special dataset to combat 50% AI bot server overload and ease bandwidth usage by 2025.

chandramouli
By
chandramouli
Founder
Chandra Mouli is a former software developer from Andhra Pradesh, India, who left the IT world to start CyberOven full-time. With a background in frontend technologies...
- Founder
4 Min Read
Wikipedia partners with Kaggle to offer a structured dataset, improving AI training and preventing server overload in 2025.
Highlights
  • Wikipedia offers structured dataset for AI training with Kaggle partnership.
  • Effort aims to prevent server overload and high operational costs.
  • New dataset controls access, replacing unauthorized scraping by AI bots.

According to Engadget, the Wikimedia Foundation (who runs Wikipedia) is creating a special collection of information for AI companies to use. They are working with Kaggle (owned by Google) to offer this information in English and French. This is happening now in 2025, with a test version already available. The main reason for doing this is to stop AI programs from automatically taking too much information from Wikipedia’s website, which causes problems for the servers.

Wikipedia is facing a big problem. Many AI companies are using computer programs (called “bots”) to automatically collect huge amounts of information from Wikipedia pages without permission. This is called “web scraping.” Think of it like people making thousands of copies from a library without asking, causing the photocopier to break down. This scraping makes Wikipedia’s websites load slowly for normal users and costs Wikipedia more money to run their computers.

The new collection of information (called a “dataset“) will include:

  • Text from Wikipedia articles
  • Article summaries and descriptions
  • Different sections from articles
  • Information that is free to use under Creative Commons licenses

The dataset will not include references, pictures, videos, or other media files. AI developers can use this information legally without overloading Wikipedia’s servers. It’s like Wikipedia saying, “Instead of taking books off our shelves and making our library crowded, here’s a complete set of books you can take home.”

What exactly is unauthorized web scraping? It happens when computer programs automatically visit websites and copy information without permission. These programs often break the website’s rules. They collect text, prices, or other data very quickly. This can make websites slow, increase costs for the website owners, and sometimes the collected information is misused. When companies scrape Wikipedia without asking, it puts too much pressure on the computers that run the website.

AI companies need large amounts of information to train their AI systems. Wikipedia is perfect for this because:

  • It has over 6.8 million articles covering many topics
  • The information is usually high-quality and checked by many people
  • It helps AI learn how language works and gain general knowledge
  • Large datasets help AI become better at answering questions and summarizing information

The problems caused by unauthorized scraping have become serious. Since January 2024, Wikipedia has seen a 50% increase in bandwidth usage because of AI bots. This is like your home internet suddenly being used by many neighbors without your permission. The scraping has caused:

ProblemImpact
Increased bandwidth usage50% surge since January 2024
Server overloadTerabytes of data consumed at data centers
Higher costsMore money needed to run Wikipedia
Slower websitePages take longer to load for users
Risk during high trafficProblems during events like 2.8 million views of former US President Jimmy Carter’s page in December 2024

Statistics show how bad the problem has become. About 65% of the most resource-intensive traffic on Wikipedia comes from these AI bots. They have targeted 144 million files on Wikimedia Commons (where Wikipedia stores images and videos). This is why Wikipedia needed to find a better solution.

This new approach should help make Wikipedia faster and more reliable for everyone. By providing a controlled way for AI companies to get the information they need, Wikipedia can protect its servers while still allowing its knowledge to be used widely. If you use Wikipedia regularly, you might notice pages loading faster in the future because of this change.

Share This Article