Exploring Retrosheet’s November 2025 Negro Leagues Data Release, part 1

Total games in Retrosheet's Negro League data by gametype and season

Introduction

Retrosheet’s second semi-annual release of 2025, posted on November 27, 2025, expanded its coverage of Negro League baseball statistics, especially with regards to the 1935 season. This is the first of a series of posts in which I’ll explore the data, particularly with an eye towards what it tells us about the regular-season statistical record of the Negro Leagues that are now designated as major leagues, both in terms of the data we currently have as well as what we know to still be missing.

For this project, I downloaded the current versions of all seven of the .csv files in which Retrosheet has stored their Negro League data and explored them in a Positron workspace, primarily using Python. A repository of the data and my code files can be viewed on my github page.

Before I jump in, I should note that Retrosheet’s data is NOT the officially recognized data source for MLB, which instead bases its official Negro Leagues statistics on the Seamheads Negro Leagues Database. Seamheads appears to have more extensive records for several of the seasons and leagues that are now designated as major leagues, and it claims that its data is based wholly on game-level accounts (“our database… is built from the game level up, using box scores, newspaper articles, and scoresheets”), but it doesn’t expose that game-level data, instead offering seasonal summaries that are more difficult to analyze and verify as an outsider. At a later time I may explore how the Seamheads statistics compare to Retrosheet’s data, but for now I’ll be looking at Retrosheet’s data in isolation.

Part 1: Seasons and Game Types Overview

The central game-level repository of Retrosheet’s Negro Leagues data is contained in the file gameinfo.csv, which contains entries for 7,681 total games ranging from as early as 1903 to as late as 1962. In addition to regular season games and postseason games, the dataset also includes many exhibition games, as throughout this period Negro League teams regularly engaged in barnstorming tours, which were at times a steadier source of revenue than league games.

Although the dataset contains entries spanning nearly 60 years, the bulk of the data–including nearly every game that’s identified as “regular season”–covers the 1935-1949 seasons. Here’s a visualization of the distribution of all game types (including exhibitions) by season, illustrating the long ‘tails’ of the dataset:

Bar chart of the total games by season in Retrosheet's November 2025 Negro Leagues data release

When we focus on the games marked as ‘regular season’ only, we see that the distribution is even more tightly centered around the high-coverage years of 1935-1949, with very few games falling outside of those seasons.

Bar chart of the total number of regular-season games by season in Retrosheet's November 2025 Negro Leagues data release

In my analysis, I want to primarily focus on a subset of this data: regular season league games from the particular leagues and seasons that MLB has now officially designated as ‘major league baseball.’ These statistics are of particular interest because they are now considered to be a part of MLB’s official record books. Although the exhibition games and barnstorming tours are an important part of the history of these teams, they are NOT a part of that official statistical record. This reduces our dataset by quite a bit and narrows our focus to a core subset of leagues and seasons. To visualize this, I consolidated the ‘gametype’ column of gameinfo.csv into four categories: regular season games, exhibition games, all-star games, and postseason games. 1 Over a third of the 7,681 games in the set are NOT regular season games.

Pie chart of game type percentages from Retrosheet's November 2025 Negro Leagues data realease

To visualize the spread of game types by season, I first consolidated the tails of the dataset, aggregating the pre-1935 seasons into a pre-1920 bin and a 1920-1934 bin and aggregating all games after 1949 into a post-1950 bin. I used those aggregated bins to construct a stacked bar chart.

Total games in Retrosheet's Negro League data by gametype and season

Although there are a smattering of regular season games from pre-1935 seasons, they number fewer than 100 total, compared to 200+ for nearly every season in the 1935-1949 span. Therefore, in subsequent posts I will narrow my focus to games from the 1935 season or later. Moreover, I will also exclude the 1949 season: although its regular season games are relatively well documented, MLB did NOT include that season in its reclassification. Starting in my next post, I’ll begin working through this data season-by-season, examining each season’s regular-season games in greater detail.

  1. To be specific, the only consolidated category is ‘postseason,’ into which I folded the categories ‘championship’, ‘lcs’, ‘divisionseries’, and ‘playoff.’ In so doing, I discovered that one game was miscoded as ‘Regular’ rather than ‘regular’ (gameid HOM193508100). I corrected that error in my copy of the data. I’ve noticed a few errors like this in my work; I’ll be contacting Retrosheet with a list once I complete this project. ↩︎

You may also like...