Wikipedia makes investigations simple. Unlike most platforms, Wikipedia goes out of its way to be open and transparent: revision histories, user actions, traffic analytics, and even some user IP addresses are all public information. Because anyone can edit Wikipedia, they rely on openness to protect against unwanted behavior — be it petty vandalism or agenda-driven reframing.
In the course of researching an edit war, we developed a toolkit of approaches that anyone can use to conduct their own investigations into inauthentic editing on Wikipedia. These approaches range from manual signals — “what to look for” style advice — to fully automated scrapers that can help surface suspicious pages. As is the case for many domains of open source investigation, we found that the most effective approach is to supplement manual investigation with automated tools. The tools are helpful for surfacing interesting pages, but cannot piece together the details of a coordinated “attack” — there, human intuition is key.
In this post we 1) identify attributes of pages that may indicate inauthentic editing, 2) provide an overview of free resources to analyze Wikipedia pages, and 3) introduce an open source notebook for automatically surfacing Wikipedia pages worthy of deeper scrutiny, with proof of concept evidence.
In general, there are two main targets when investigating Wikipedia: pages (also called articles) and users (also called editors). In general, we found that investigations tended to start with pages that looked suspicious, which then led us to look into the histories of the users who contributed to that page.
When looking at pages, we found that the followings signs were the most reliable indicators of inauthentic activity (signals with an asterisk are assessed by our automated script):
Too much content relative to notability.* Writing a Wikipedia article takes time, so the effort invested in a page is generally proportional to that page’s importance. When a page is extensive but the subject is barely notable, find out who did the bulk of the writing (see tips on users below), then investigate those users. A page’s traffic statistics are a good proxy for notability.
High levels of edit activity relative to notability.* There are 1.9 edits per second on Wikipedia; that works out to a little less than one edit per page per month. If a page has a significantly higher level of activity, it’s worth looking into why. It’s possible you’ve found an edit war (but it might be lame).
Controversy moved from introduction to body. The first paragraph of a Wikipedia article is the most important; it’s what Google shows in its knowledge panels, and might be the only content an impatient reader will see. When a short page mentions controversies but there is no reference to them in the opening paragraph, check if they were moved.
Pages that look like resumes. This is a real problem on Wikipedia. Biographical Wikipedia pages and resumes are very different, but sometimes inexperienced editors will model a page on a resume—bullet points and all. We found that new Wikipedia editors often have ulterior motivations when creating pages—that motivation may be the reason why they joined Wikipedia in the first place—so always investigate further when you encounter a novice-looking page (especially if there is a long “awards and honors” section).
Pages with many edits from one or a few users.* If the vast majority of a page’s edits come from a small number of users over an extended period of time, something may be amiss. While it’s common for editors to have subject matter expertise (and therefore be highly active in some networks of pages) and/or take “responsibility” for pages they created, over-active editors are cause for further investigation.
Overdependence on a few sources. Sometimes, when few independent sources are available about a subject, motivated editors will try to stretch them as far as possible to create the illusion of notability or consensus. Be wary when you see a long page that cites the same source dozens of times, especially when that source is a campaign website. (Wikipedia’s guidelines require that everything be cited to a source independent from the subject, but things can slip through the cracks.)
We found that some other seemingly-useful techniques were often dead ends and didn’t yield much useful information. For example, trying to find text copy-pasted from elsewhere on the internet usually didn’t reveal anything useful: the Wikipedia community is quite good at detecting plagiarism, but it is also inevitable that some text will slip through. (Plagiarism may simply be the sign of a lazy editor, not a malicious one.)
When looking at users, we found the most useful information was available on users’ talk pages, which serve as a forum for public correspondence between the Wikipedia community and a user. There, you can see whether a user is involved in any controversy. Sometimes, a user’s main page (where they introduce themselves ) will be empty but their talk page (where other users talk to them ) will be highly revealing. If a suspicious user has an innocuous-looking talk page, check archived versions of that page; the user may have blanked their page to hide earlier warnings. We found that it was usually difficult to tie Wikipedia editors to real-world identities (or even other online identities), but looking at editors’ IP addresses can still yield useful information.
There are dozens of free online tools available that are helpful when investigating Wikipedia, but sometimes Wikipedia itself is the most useful tool. Here is an overview of a few basic techniques you can use to evaluate pages and users:
WMF Article Info Tool. This tool (run by the Wikimedia Foundation, Wikipedia’s parent) provides detailed information on any Wikipedia page, from articles to user pages to user talk pages. It shows the breakdown of editors by both number of edits and content added, the page’s traffic analytics (which is otherwise only easily accessible via Wikipedia’s API), and other useful statistical breakdowns.
WMF Global User Contributions. This site provides a history of all the edits made by a particular user across all Wikimedia platforms — be they English Wikipedia (“enwiki”), Wikimedia Commons (where images are stored), or any other WMF-run Wiki.
WFM Blame. Given a page and some text on that page, this tool shows you which user added or edited that text. This tool can be very helpful when assessing how controversies appear on a page.
Other WFM tools. The tools listed above were the ones we found most useful, but several other analysis tools are also available on this page.
Pageview comparer. This tool allows you to compare the pageview history of multiple pages on the same graph. It can be useful for detecting spikes of activity (which may be correlated with major events).
WikiSHARK. This tool allows you to visualize page views for the years 2008-2020. It can be useful for identifying trends in traffic.
Basic edit histories. To view the edit history of a page, click the “View history” button on the top right of any Wikipedia page (including user pages and talk pages). From there, you will see a chronological list of edits, showing the size, user, and associated message of each edit. To see the changes made in any edit, click the “prev” button. You can also compare across multiple edits using the select buttons on the left.
Talk pages. Talk pages are where editors talk about a page. Articles’ talk pages usually contain discussion on whether to include certain information in a page, whether the subject of a page is notable enough to warrant an article on Wikipedia, and how a page should be written. When investigating an article, always read its talk page. (Here is Kamala Harris’ talk page.)
With more than six million pages, Wikipedia is vast. Unless your investigation is focused only on a single subject, there are likely far more pages than you can hope to look through and assess manually. To solve this problem of scale, we built two systems to help seed investigations.
The first tool, our open-source Wikipedia Scanner, is a Python notebook that intakes a search query and analyzes each page that matches the query. It then outputs a number of charts about the collection of pages it found, as well as anomalous pages it thinks might warrant further manual investigation.
When looking for anomalous pages, it considers the edits-to-pageview ratio, the length-to-pageview ratio, and the editors-per-edit ratio, among other metrics. While not every page it finds is ultimately suspicious, it can be very helpful for seeding deeper investigations.
No programming is required to use the tool — just type your search query into the box at the top right, then click “Runtime” then “Run all” (or “Restart and run all”). Include “incategory:living_people” to filter your results to biographical pages. (Unlike many other categories on Wikipedia, the “living people” category has almost perfect coverage; if a page corresponds to a living person, it almost certainly is in the category, and vice versa.)
As a proof-of-concept for how the Wikipedia Scanner can be used to seed investigations, we searched for the exact phrase “serving as the U.S. Representative” (to select for pages using the standard format for members of the US House of Representatives) within the category of living people ("incategory:living_people"). We then filtered by highest edits to pageview ratio to identify controversial or vulnerable topics.
The top five results were Chris Stewart, R-Utah, Steve Womack, R-Ark., Steve Watkins, R-Kan., Yvette Clarke, D-N.Y., and Richard Neal, D-Mass. Of the five, at least four had unusual edits in their history:
Our second tool was a script that collected bulk data from Wikipedia articles that used the “officeholder” template (i.e. most politicians). It iterated over the first 10,000 pages and exported the metrics described above to a .csv file. By analyzing the exported data in Excel, we were able to identify pages with low traffic but suspiciously-high numbers of edits. These pages are often worthy of deeper scrutiny. Although it would not be suitable for smaller area-specific investigations, the bulk data was very useful in identifying leads. We discovered several suspicious pages using this script.
Note: The Wikipedia API imposes a one-per-second limit on collecting page data, so iterating through all 10,000 pages (the maximum amount returned by the API) takes hours to complete.
Wikipedia is a joy to investigate: because there is no adversarial relationship between the researcher and the platform, there are dozens of tools available to help along the way. Access to the internals of Wikipedia is unparalleled relative to other platforms. Despite this openness, inauthentic behavior still slips through the cracks. The tools described in this guide, combined with investigatory creativity and intuition, will take you as far as you need to go. As a platform, Wikipedia stands beside you — not in your way.