Wikipedia Creates Special Dataset for AI to Stop Server Overload

According to Engadget, the Wikimedia Foundation (who runs Wikipedia) is creating a special collection of information for AI companies to use. They are working with Kaggle (owned by Google) to offer this information in English and French. This is happening now in 2025, with a test version already available. The main reason for doing this is to stop AI programs from automatically taking too much information from Wikipedia’s website, which causes problems for the servers.

Wikipedia is facing a big problem. Many AI companies are using computer programs (called “bots”) to automatically collect huge amounts of information from Wikipedia pages without permission. This is called “web scraping.” Think of it like people making thousands of copies from a library without asking, causing the photocopier to break down. This scraping makes Wikipedia’s websites load slowly for normal users and costs Wikipedia more money to run their computers.

The new collection of information (called a “dataset“) will include:

Text from Wikipedia articles
Article summaries and descriptions
Different sections from articles
Information that is free to use under Creative Commons licenses

The dataset will not include references, pictures, videos, or other media files. AI developers can use this information legally without overloading Wikipedia’s servers. It’s like Wikipedia saying, “Instead of taking books off our shelves and making our library crowded, here’s a complete set of books you can take home.”

What exactly is unauthorized web scraping? It happens when computer programs automatically visit websites and copy information without permission. These programs often break the website’s rules. They collect text, prices, or other data very quickly. This can make websites slow, increase costs for the website owners, and sometimes the collected information is misused. When companies scrape Wikipedia without asking, it puts too much pressure on the computers that run the website.

AI companies need large amounts of information to train their AI systems. Wikipedia is perfect for this because:

It has over 6.8 million articles covering many topics
The information is usually high-quality and checked by many people
It helps AI learn how language works and gain general knowledge
Large datasets help AI become better at answering questions and summarizing information

The problems caused by unauthorized scraping have become serious. Since January 2024, Wikipedia has seen a 50% increase in bandwidth usage because of AI bots. This is like your home internet suddenly being used by many neighbors without your permission. The scraping has caused:

Problem	Impact
Increased bandwidth usage	50% surge since January 2024
Server overload	Terabytes of data consumed at data centers
Higher costs	More money needed to run Wikipedia
Slower website	Pages take longer to load for users
Risk during high traffic	Problems during events like 2.8 million views of former US President Jimmy Carter’s page in December 2024

Statistics show how bad the problem has become. About 65% of the most resource-intensive traffic on Wikipedia comes from these AI bots. They have targeted 144 million files on Wikimedia Commons (where Wikipedia stores images and videos). This is why Wikipedia needed to find a better solution.

This new approach should help make Wikipedia faster and more reliable for everyone. By providing a controlled way for AI companies to get the information they need, Wikipedia can protect its servers while still allowing its knowledge to be used widely. If you use Wikipedia regularly, you might notice pages loading faster in the future because of this change.

You Might Also Like

First Robot-Built Starbucks Coming to Texas: What This Means For You

Threads Moving to New Web Address and Adding Helpful Features

Microsoft Offers Underperforming Employees Choice: Take Pay and Leave or Improve

Spacetop Launches New Virtual Computer System That Works With Windows Laptops