Background: The American League and National League, which together form modern Major League Baseball, were segregated institutions until Jackie Robinson broke the baseball “color barrier” by joining the Brooklyn Dodgers in 1947. In response to this discrimination, a series of alternative professional baseball leagues formed for African-American players, which were known collectively as the “Negro Leagues.” After Jackie Robinson’s National League debut, other top Black players began to enter the American League and National League, eventually leading to the decline and dissolution of the Negro Leagues.

I am using a dataset I downloaded from the baseball data website Retrosheet (https://www.retrosheet.org/). This dataset contains all of Retrosheet’s current data on baseball games from the Negro Leagues. My research question is: how did Negro League baseball games change in 1948 and 1949, after Jackie Robinson broke the MLB color barrier in 1947? To answer the question, I’ll investigate the scoring environment (total runs scored per game) and attendance in the seasons immediately before and immediately after the integration of Major League Baseball.

In [2]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
game_info = pd.read_csv('gameinfo.csv')

In [3]:

game_info.head

Out[3]:

<bound method NDFrame.head of                gid visteam hometeam   site      date  number starttime  \
0     PHG190309120     CUX      PHG  PHI10  19030912       0    0:00PM   
1     CUX190309131     PHG      CUX  NYC18  19030913       1       NaN   
2     PHG190309132     CUX      PHG  NYC18  19030913       2       NaN   
3     CUX190309140     PHG      CUX  TRE02  19030914       0    0:00PM   
4     PHG190309150     CUX      PHG  CAM02  19030915       0    0:00PM   
...            ...     ...      ...    ...       ...     ...       ...   
6316  ASW195808310     ASE      ASW  NYC16  19580831       0       NaN   
6317  ASW195908090     ASE      ASW  CHI10  19590809       0       NaN   
6318  ASW196008210     ASE      ASW  CHI10  19600821       0       NaN   
6319  ASE196108200     ASW      ASE  NYC16  19610820       0    0:00PM   
6320  ASW196208260     ASE      ASW  KAN05  19620826       0       NaN   

     daynight  innings  tiebreaker  ... vruns hruns  wteam lteam line  \
0         day      NaN         NaN  ...     4     2    CUX   PHG    y   
1         day      NaN         NaN  ...     1     8    CUX   PHG    y   
2         day      NaN         NaN  ...     2     5    PHG   CUX    y   
3         day      NaN         NaN  ...     1     3    CUX   PHG    y   
4         day      NaN         NaN  ...     0     3    PHG   CUX    y   
...       ...      ...         ...  ...   ...   ...    ...   ...  ...   
6316      day      NaN         NaN  ...     6     5    ASE   ASW    y   
6317      day      NaN         NaN  ...     7     8    ASW   ASE    y   
6318      day      NaN         NaN  ...     4     8    ASW   ASE    y   
6319      day      NaN         NaN  ...     7     1    ASW   ASE    y   
6320      day      NaN         NaN  ...     2     5    ASW   ASE    y   

     batteries lineups  box  pbp  season  
0         both       y    y    d    1903  
1         both       y    y  NaN    1903  
2         both       y    y  NaN    1903  
3         both       y    y    d    1903  
4         both       y    y    d    1903  
...        ...     ...  ...  ...     ...  
6316      both     NaN  NaN  NaN    1958  
6317      both     NaN  NaN  NaN    1959  
6318      both     NaN  NaN  NaN    1960  
6319      both       y    y    y    1961  
6320      both     NaN  NaN  NaN    1962  

[6321 rows x 42 columns]>

In [4]:

# Add a totalruns column to this dataset:
game_info['totalruns'] = game_info['vruns'] + game_info['hruns']
game_info.head

Out[4]:

<bound method NDFrame.head of                gid visteam hometeam   site      date  number starttime  \
0     PHG190309120     CUX      PHG  PHI10  19030912       0    0:00PM   
1     CUX190309131     PHG      CUX  NYC18  19030913       1       NaN   
2     PHG190309132     CUX      PHG  NYC18  19030913       2       NaN   
3     CUX190309140     PHG      CUX  TRE02  19030914       0    0:00PM   
4     PHG190309150     CUX      PHG  CAM02  19030915       0    0:00PM   
...            ...     ...      ...    ...       ...     ...       ...   
6316  ASW195808310     ASE      ASW  NYC16  19580831       0       NaN   
6317  ASW195908090     ASE      ASW  CHI10  19590809       0       NaN   
6318  ASW196008210     ASE      ASW  CHI10  19600821       0       NaN   
6319  ASE196108200     ASW      ASE  NYC16  19610820       0    0:00PM   
6320  ASW196208260     ASE      ASW  KAN05  19620826       0       NaN   

     daynight  innings  tiebreaker  ... hruns wteam  lteam line batteries  \
0         day      NaN         NaN  ...     2   CUX    PHG    y      both   
1         day      NaN         NaN  ...     8   CUX    PHG    y      both   
2         day      NaN         NaN  ...     5   PHG    CUX    y      both   
3         day      NaN         NaN  ...     3   CUX    PHG    y      both   
4         day      NaN         NaN  ...     3   PHG    CUX    y      both   
...       ...      ...         ...  ...   ...   ...    ...  ...       ...   
6316      day      NaN         NaN  ...     5   ASE    ASW    y      both   
6317      day      NaN         NaN  ...     8   ASW    ASE    y      both   
6318      day      NaN         NaN  ...     8   ASW    ASE    y      both   
6319      day      NaN         NaN  ...     1   ASW    ASE    y      both   
6320      day      NaN         NaN  ...     5   ASW    ASE    y      both   

     lineups  box  pbp season  totalruns  
0          y    y    d   1903          6  
1          y    y  NaN   1903          9  
2          y    y  NaN   1903          7  
3          y    y    d   1903          4  
4          y    y    d   1903          3  
...      ...  ...  ...    ...        ...  
6316     NaN  NaN  NaN   1958         11  
6317     NaN  NaN  NaN   1959         15  
6318     NaN  NaN  NaN   1960         12  
6319       y    y    y   1961          8  
6320     NaN  NaN  NaN   1962          7  

[6321 rows x 43 columns]>

In [5]:

# Slice out the columns we don't care about in order to make the data set easier to work with
game_info_sliced = game_info[['gid', 'visteam', 'hometeam', 'attendance', 'gametype', 'vruns', 'hruns', 'season', 'totalruns']]
game_info_sliced.head

Out[5]:

<bound method NDFrame.head of                gid visteam hometeam attendance      gametype  vruns  hruns  \
0     PHG190309120     CUX      PHG       3887  championship      4      2   
1     CUX190309131     PHG      CUX       3000  championship      1      8   
2     PHG190309132     CUX      PHG       8000  championship      2      5   
3     CUX190309140     PHG      CUX       2500  championship      1      3   
4     PHG190309150     CUX      PHG       3000  championship      0      3   
...            ...     ...      ...        ...           ...    ...    ...   
6316  ASW195808310     ASE      ASW        NaN       allstar      6      5   
6317  ASW195908090     ASE      ASW       8923       allstar      7      8   
6318  ASW196008210     ASE      ASW       5000       allstar      4      8   
6319  ASE196108200     ASW      ASE       7245       allstar      7      1   
6320  ASW196208260     ASE      ASW        NaN       allstar      2      5   

      season  totalruns  
0       1903          6  
1       1903          9  
2       1903          7  
3       1903          4  
4       1903          3  
...      ...        ...  
6316    1958         11  
6317    1959         15  
6318    1960         12  
6319    1961          8  
6320    1962          7  

[6321 rows x 9 columns]>

In [6]:

# Create a filter that selects only the regular season games from the 1949 season. Save as a new DataFrame, forty_nine.
forty_nine_filter = game_info_sliced['season'] == 1949
regular_season_filter = game_info_sliced['gametype'].str.contains('regular')
forty_nine = game_info_sliced[forty_nine_filter & regular_season_filter]
forty_nine.head

Out[6]:

<bound method NDFrame.head of                gid visteam hometeam attendance gametype  vruns  hruns  season  \
5891  BIR194904300     HOE      BIR       4636  regular      1      3    1949   
5892  BLG194905011     KCM      BLG       5588  regular      5      3    1949   
5893  BLG194905012     KCM      BLG       5588  regular      1      5    1949   
5894  LCB194905010     PH5      LCB       3350  regular      1      2    1949   
5895  BIR194905020     HOE      BIR          0  regular      3      2    1949   
...            ...     ...      ...        ...      ...    ...    ...     ...   
6265  CAG194909052     IN9      CAG        NaN  regular      0      2    1949   
6266  KCM194909051     NY6      KCM       2500  regular      6      5    1949   
6267  KCM194909052     NY6      KCM       2500  regular      2      0    1949   
6268  MEM194909051     BIR      MEM        NaN  regular      0      1    1949   
6269  MEM194909052     BIR      MEM        NaN  regular      1      8    1949   

      totalruns  
5891          4  
5892          8  
5893          6  
5894          3  
5895          5  
...         ...  
6265          2  
6266         11  
6267          2  
6268          1  
6269          9  

[370 rows x 9 columns]>

In [7]:

# Now do the same for '48:
forty_eight_filter = game_info_sliced['season'] == 1948
forty_eight = game_info_sliced[forty_eight_filter & regular_season_filter]
forty_eight.head

Out[7]:

<bound method NDFrame.head of                gid visteam hometeam attendance gametype  vruns  hruns  season  \
5323  HOM194804290     BLG      HOM        NaN  regular      3      8    1948   
5326  BIR194805010     CVB      BIR       8000  regular      2     11    1948   
5327  PH5194805010     NY6      PH5          0  regular     17      9    1948   
5328  BIR194805021     CVB      BIR       6117  regular      9      7    1948   
5329  BIR194805022     CVB      BIR       6117  regular      6      7    1948   
...            ...     ...      ...        ...      ...    ...    ...     ...   
5806  CVB194809090     CAG      CVB        400  regular      0      3    1948   
5807  IN9194809090     KCM      IN9       1500  regular      9      4    1948   
5811  BLG194809121     NW2      BLG          0  regular      5      1    1948   
5812  BLG194809122     NW2      BLG        NaN  regular      8      2    1948   
5817  NY6194809120     PH5      NY6       3200  regular     10      8    1948   

      totalruns  
5323         11  
5326         13  
5327         26  
5328         16  
5329         13  
...         ...  
5806          3  
5807         13  
5811          6  
5812         10  
5817         18  

[422 rows x 9 columns]>

In [8]:

# Now do the same for '47:
forty_seven_filter = game_info_sliced['season'] == 1947
forty_seven = game_info_sliced[forty_seven_filter & regular_season_filter]
forty_seven.head

Out[8]:

<bound method NDFrame.head of                gid visteam hometeam attendance gametype  vruns  hruns  season  \
4742  HOM194705031     NY6      HOM        NaN  regular      2      0    1947   
4743  HOM194705032     NY6      HOM        NaN  regular      4      9    1947   
4746  BLG194705041     PH5      BLG       6800  regular      4      1    1947   
4747  BLG194705042     PH5      BLG       6800  regular      2      7    1947   
4748  CVB194705040     BIR      CVB       6623  regular      4      9    1947   
...            ...     ...      ...        ...      ...    ...    ...     ...   
5266  NY6194709190     CVB      NY6       5500  regular      5      5    1947   
5269  NY6194709210     CVB      NY6       9000  regular     10      7    1947   
5271  CVB194709230     NY6      CVB       6000  regular      6      0    1947   
5274  NY6194709240     CVB      NY6       1739  regular      4      9    1947   
5281  CVB194709270     NY6      CVB       4500  regular      6      5    1947   

      totalruns  
4742          2  
4743         13  
4746          5  
4747          9  
4748         13  
...         ...  
5266         10  
5269         17  
5271          6  
5274         13  
5281         11  

[438 rows x 9 columns]>

In [9]:

# Now the last year I'll look at, '46:
forty_six_filter = game_info_sliced['season'] == 1946
forty_six = game_info_sliced[forty_six_filter & regular_season_filter]
forty_six.head

Out[9]:

<bound method NDFrame.head of                gid visteam hometeam attendance gametype  vruns  hruns  season  \
4163  BLG194605051     HOM      BLG       6729  regular      9      8    1946   
4164  BLG194605052     HOM      BLG       6729  regular      7      8    1946   
4165  CAG194605051     KCM      CAG      12000  regular      2      9    1946   
4166  CAG194605052     KCM      CAG      12000  regular      4      3    1946   
4167  CVB194605051     BIR      CVB       8364  regular      1      2    1946   
...            ...     ...      ...        ...      ...    ...    ...     ...   
4651  BLG194609151     NW2      BLG       3500  regular     12     13    1946   
4652  BLG194609152     NW2      BLG       3500  regular      3      6    1946   
4653  HOM194609151     NY6      HOM        NaN  regular      3      8    1946   
4654  HOM194609152     NY6      HOM        NaN  regular      1      2    1946   
4655  MEM194609150     CAG      MEM        NaN  regular     12      9    1946   

      totalruns  
4163         17  
4164         15  
4165         11  
4166          7  
4167          3  
...         ...  
4651         25  
4652          9  
4653         11  
4654          3  
4655         21  

[407 rows x 9 columns]>

In [10]:

# Let's start with a simple plot of mean totalruns over time
x = [1946, 1947, 1948, 1949]
y = [forty_six['totalruns'].mean(), forty_seven['totalruns'].mean(), forty_eight['totalruns'].mean(), forty_nine['totalruns'].mean()]
plt.axis([1946, 1949, 6, 12])
plt.xticks(np.arange(1946, 1950, step=1))
plt.yticks(np.arange(6, 12, step=0.5))
plt.plot(x, y)

Out[10]:

[<matplotlib.lines.Line2D at 0x278f5866790>]

No description has been provided for this image

In [11]:

# Now, make a box plot of total runs by year:

data = [forty_six['totalruns'], forty_seven['totalruns'], forty_eight['totalruns'], forty_nine['totalruns']]

plt.boxplot(data)
plt.xticks([1, 2, 3, 4], ['1946', '1947', '1948', '1949'])
plt.show()

Now, let’s check to see whether attendance decreased over these seasons as some of the Negro League’s biggest stars transitioned to the American and National Leagues.¶

In [12]:

# First, drop the NaN attendance values from all of our dataframes:
forty_six = forty_six.dropna(subset=['attendance'])
forty_seven = forty_seven.dropna(subset=['attendance'])
forty_eight = forty_eight.dropna(subset=['attendance'])
forty_nine = forty_nine.dropna(subset=['attendance'])

In [19]:

forty_six.loc[:, ('attendance')] =  forty_six[['attendance']].astype(str).astype(int)
forty_seven.loc[:, ('attendance')] =  forty_seven[['attendance']].astype(str).astype(int)
forty_nine.loc[:, ('attendance')] =  forty_nine[['attendance']].astype(str).astype(int)

In [20]:

# forty_eight needs a bit more attention, because some of its attendance values include non-digits
forty_eight.loc[:, ('attendance')] = forty_eight[['attendance']].astype(str)
forty_eight.dtypes

Out[20]:

gid           object
visteam       object
hometeam      object
attendance    object
gametype      object
vruns          int64
hruns          int64
season         int64
totalruns      int64
dtype: object

In [25]:

# Now, let's find and replace the non-digit characters inside the forty_eight attendance numbers
forty_eight.loc[:, ('attendance')] = forty_eight['attendance'].str.replace(">", "")
forty_eight.loc[:, ('attendance')] = forty_eight['attendance'].str.replace("<", "")
forty_eight.loc[:, ('attendance')] = forty_eight['attendance'].str.replace("?", "")

In [29]:

forty_eight.loc[:, ('attendance')] = forty_eight[['attendance']].astype(int)

In [32]:

# Finally, we need to drop all rows from each data frame in which attendance is listed as zero.
mask = forty_six['attendance'] == 0
forty_six = forty_six[~mask]
mask = forty_seven['attendance'] == 0
forty_seven = forty_seven[~mask]
mask = forty_eight['attendance'] == 0
forty_eight = forty_eight[~mask]
mask = forty_nine['attendance'] == 0
forty_nine = forty_nine[~mask]

In [33]:

x = [1946, 1947, 1948, 1949]
y = [forty_six['attendance'].mean(), forty_seven['attendance'].mean(), forty_eight['attendance'].mean(), forty_nine['attendance'].mean()]
# plt.axis([1946, 1949, 6, 12])
plt.xticks(np.arange(1946, 1950, step=1))
# plt.yticks(np.arange(6, 12, step=0.5))
plt.plot(x, y)

Out[33]:

[<matplotlib.lines.Line2D at 0x278f58670d0>]

In [34]:

# And let's round things out with a boxplot:

data = [forty_six['attendance'], forty_seven['attendance'], forty_eight['attendance'], forty_nine['attendance']]

plt.boxplot(data)
plt.xticks([1, 2, 3, 4], ['1946', '1947', '1948', '1949'])
plt.show()

How did the Integration of MLB affect Negro League Baseball?

Now, let’s check to see whether attendance decreased over these seasons as some of the Negro League’s biggest stars transitioned to the American and National Leagues.¶

Recent Posts

Categories

Archives

Recent Comments