Exploring Retrosheet’s November 2025 Negro Leagues Data Release, part 2: Known Unknowns and Unknown Unknowns
Or: How Cocaína García’s Nickname Immediately Derailed Me
When I downloaded Retrosheet’s November 2025 Negro Leagues data, my intention was to keep this project narrowly focused on exploring the data within that release, avoiding detours and rabbit holes.
But my scope widened almost immediately as I was reading the box score of the May 11th, 1935 game between the Pittsburgh Crawfords and the New York Cubans. According to Retrosheet’s data, this was the Crawfords’ third regular season game of 1935, but the first for which Retrosheet has a full box score. The 1935 Crawfords would go on to win the league that year, finishing with the best regular season record and defeating the Cubans in the league’s championship series.
That Pittsburgh team is of particular interest to me not because it was a champion, but because it employed Josh Gibson. When MLB incorporated Seamhead’s Negro League Database into the official major league statistics in May of 2024, Josh Gibson made headlines by supplanting Ty Cobb as the all-time major league leader in career batting average, edging out Cobb’s average of .367 with a career average of .372.1 (Is this the first time that a major sports record has changed hands more than 75 years after both principals have died?) The short length of Negro League seasons effectively locks out players like Gibson from ‘counting stats’ leaderboards (career home runs, pitcher wins, etc.), but they’re on a more even playing field in ‘rate stats’ like batting average or ERA, and Gibson’s career batting average title was one of the more notable consequences of MLB’s decision to classify certain Negro League seasons as part of the official major league record.
We know that our Negro Leagues data is fragmentary. When the first tranche of Negro Leagues statistics was added to the official MLB record in May 2024, MLB’s press release estimated that the stats were about 75% complete. A February 2025 update added more data, but it’s clear that some portion remains unrecovered. In the case of counting stats, this means that we can speculate that current career totals are likely to increase if and when more games are recovered (though they sometimes decrease as erroneous data is discovered and corrected). Rate stats, though, will continue to fluctuate both upward and downward with every new game discovered, meaning that these records will remain unstable. If new Josh Gibson games are recovered, and those games happen to include a disproportionate number of bad performances, it’s possible that Cobb could regain his place atop the career batting average leaderboard–like a record chase between a pair of zombies.
For this reason, I’m interested in estimating how many “missing” Gibson at bats are still out there. What’s the current number of at bats in the denominator of his batting average, and how much could his batting average fluctuate if new at bats are discovered and added? As I’m writing this post in December of 2025, the official MLB.com stat page for Josh Gibson lists 843 hits over 2,272 at bats for a .371 career batting average. The Seamheads batting leader board shows Gibson with 1,088 hits over 3004 at bats for a .362 batting average in Negro League play. This large discrepancy is surprising as the official MLB statistics are supposedly based upon the Seamheads data, but I think the difference is that the Seamheads leader board likely includes exhibition and postseason games while the MLB stats are limited to in-league, regular season matchups. If there’s a way to filter that Seamheads leader board to only include regular season games, or if there’s something else obvious I’m missing here, feel free to contact me and I’ll update this post accordingly. Unfortunately, as neither MLB.com nor Seamheads provides access to individual game logs, it’s difficult for an independent researcher to assess how complete this data coverage is and how much Gibson’s average might change as new data comes to light.
Which brings me back to the Retrosheet Negro Leagues data, the primary focus of this project. It’s not the data deemed “official” by MLB, but it’s much more granular and transparent, allowing me to perform a wider range of analyses. Crucially, Retrosheet provides game-level data, including information about how complete each game’s statistics are. In the best-case scenarios, Retrosheet’s researchers have recovered complete box scores; for other games, they’ve collected partial information from newspaper accounts, indicating, for instance, the final score, the winning and losing pitchers, perhaps a key home run, but lacking specific details such as each hitters’ number of at bats, times on base, strikeouts, etc. In the case of the 1935 season, the box score coverage is fairly solid: over three-quarters of the 1935 regular-season games in the Retrosheet dataset include box scores, compared to only about 56% of the regular-season games across all of the seasons represented in the data.


According to Retrosheet’s data, the Pittsburgh Crawfords opened their 1935 Negro National League season with a May 7th double-header on the road against the Nashville Elite Giants, taking game one by a score of 8-5 and winning game two 3-1. Neither game has a box score, and there’s very little in the way of individual statistics. We do learn that Gibson sat out the second game; tantalizingly, in the first game, Gibson’s name appears both as a catcher and as a pitcher, but there’s no specific information about his performance that day. Fortunately, the next Pittsburgh game in the dataset–the previously referenced May 11th game against the New York Cubans–does have a complete box score, which preserves a tough day at the plate for Gibson, who went 0-4 with a strikeout. Gibson batted 4th for the Crawfords that day. In the first inning, all three of the hitters who preceded Gibson (a stacked top of the lineup including fellow Hall of Famers Oscar Charleston and Cool Papa Bell) reached against the Cubans’ starter. New York then replaced their starter with Johnny Taylor, who retired Gibson every time they faced off over eight innings of relief work in which Taylor kept the Cubans close in their eventual 6-5 loss.
It was my interest in Gibson that led me to that May 11th box score, but it was that ineffective Cubans starter that diverted me to a side quest: a name like “Cocaína García” is going to get me to open a new tab 100% of the time. I first checked his SABR bio, which unfortunately is only a placeholder at the moment. But near the top of Garcia’s Google results is a 2014 publication from the Center for Negro League Baseball Research profiling García, credited to Dr. Layton Revel and Luis Munoz, which offers this explanation for his attention-grabbing name: “When players batted against him it was said they seemed drugged by his pitches and unable to concentrate or focus on the baseball, hence the nickname ‘Cocaína.’”2 After answering my most pressing question, I continued to skim the article, pausing at this line in the article’s account of the New York Cubans’ 1935 season: “On September 2nd the Afro American reported that the New York Cubans ended the second half of the season with an unbelievable record of 20-7 (.741). Combined with their first half record, the Cubans finished the regular season with a record of 29-24 (.547).”3
Why, exactly, did that catch my eye? Well, according to Seamheads, these were the 1935 National Negro League final standings:

Retrosheet, by contrast, has the following final table:

The attentive reader will note that none of these eight records agree. My initial thought was that one must be more complete than the other, but not every discrepancy runs in the same direction: for instance, for Pittsburgh, Seamheads lists an additional loss, while for the Eagles, Retrosheet has an additional loss.
I double-checked that the contents of Retrosheet’s .csv files were in fact consistent with the above standings, which are posted on Retrosheet’s webpage for the 1935 Negro National League season. That may seem like a redundant step, but I have found inconsistencies in these sources in other seasons. In this case, though, the standings on Retrosheet’s website are consistent with the data that I extracted from their .csv files. This output refers to the teams by Retrosheet’s three-character team codes, in which “NY6” corresponds to the New York Cubans:

If the waters aren’t muddy enough for you yet, here’s another set of standings, from Baseball Reference. Baseball Reference credits the work of Gary Ashwill, the driving force behind the Seamheads Negro Leagues database, as one of the primary sources of their Negro Leagues data, so you might expect that Baseball Reference’s standings would accord with those on Seamheads. You would be incorrect:

Three credible sources, three different sets of standings–and in Layton and Munoz’s profile of Cocaína García, we have a final record for the New York Cubans that seems to be sourced from a contemporary newspaper account but which doesn’t accord with any of these three sources.4
Ultimately, this isn’t so surprising, as every source on Negro League data emphasizes its own incompleteness. Given the state of the historical record, it’s clear that many Negro Leagues games are still missing, or only incompletely recorded; sadly, it seems unlikely that we’ll ever recover a complete statistical record of these leagues and players, despite the laudable work of the researchers who are laboring to compile it. If Seamheads were to release the game-level records underlying their statistics, we could compare them to Retrosheet to produce something at least a bit more definitive than what we have now. For example, Retrosheet has the 1935 Newark Dodgers with a record of 16-46-1 (rough year!); Seamheads pegs them at 19-45-1. Does this mean that Retrosheet has access to exactly one game that Seamheads doesn’t (the 46th loss), and Seamheads has exactly three games that Retrosheet is missing (wins 17 through 19)? Or is it possible that the 16 victories and 45 losses shared in common between the two sources are in fact partially non-overlapping? Or that a game that Seamheads has recorded as a Newark victory is in the Retrosheet data as a Newark loss? Unfortunately, without the Seamheads game-level data, I am unable to make those sorts of more granular comparisons.
The former humanities student in me realizes that one obvious next step here would be for me to engage directly with the primary sources: start digging through the newspaper archives and join my energy to the other researchers who are working on the front lines of these efforts. But this is a data science project, not an archival project. I want to use the data we have–primarily from Retrosheet, whose data is more open and, thus, more useful–to assess the statistical record as it’s currently understood and to make informed estimates of what may yet be missing.
So, with this prologue out of the way, I am setting out on this project with the following objectives, ordered from least to most ambitious:
- Explore the Retrosheet data to flag as many inconsistencies and errors as I can, which I will share with Retrosheet so that they can clean them in future releases.
- For each league and season from 1935-1948, separate the box score-informed data from incomplete game accounts with only partial coverage.
- Use the “known unknowns”–games that have an entry in the Retrosheet data but lack a full box score–to create informed estimates of the portion of the statistics that are as yet unrecovered.
For now, I’ll be setting aside the “unknown unknowns”: the games that exist outside of the Retrosheet data but may perhaps be represented within the Seamheads data, as well as the games that haven’t yet made it into any of these datasets but may still be lurking in some undigitized collection, waiting to be found.
In my next post, I’ll start with the1935 Negro National League season and begin working my way toward the 1948 Negro National League and Negro American League seasons.
- As of February of 2025, Gibson’s official career batting average is now .371. ↩︎
- Upon reading that explanation, I remembered that I had actually heard of Cocaína García when he was namedropped on Episode 2293 of the “Effectively Wild” podcast. In my defense, if my memory retained everything I’ve learned from “Effectively Wild,” there would be room for little else. ↩︎
- Revel and Munoz included a scan of that September 2nd article, which does indeed report a 20-7 finish to the season for the Cubans, but I’m not clear on which source provided them with either the first half record and/or the full season record that informed their 29-24 overall figure, as that information does not appear in the scanned article. ↩︎
- Further adding to the fun, as I write this post on January 2nd, 2026, Wikipedia has a completely different set of 1935 National Negro League standings, which are attributed to Seamheads despite not according with the Seamheads data. ↩︎


