What are you doing while crawling? Do you have time? Can you come to save?

4227__major_league_baseball-primary-2019

Due to previously taking on many small projects involving web scraping and LINE Bot applications, I always thought that the technique or process of web scraping should be widely known. With the rise of various AI tools like ChatGPT, Bard, etc., I haven't encountered anyone asking about web scraping projects recently. Today, a friend asked me about scraping data from the MLB website. I simply used the Network tab in the browser's developer tools to check the MLB website they sent me and told them that the API could retrieve the data they wanted. However, after seeing their confused expression, I realized that many people are unclear about how to start creating a web scraper or misunderstand the notion that just learning Python is enough to scrape data, which is why this article was born.

Myths and Misconceptions#

As long as I learn Python, I can easily write a web scraper.
As long as I can write code, I can effortlessly scrape the data I want.
As long as I learn one set of web scraping techniques, I can scrape anything I want.

Myth-Busting Section#

Web scraping is a program, and programs are written by people; it won't execute correctly just because you don't understand the process.
Web scraping is merely a process that helps you automate data retrieval, provided that you understand this process.
Before writing a web scraper, you must be able to manually retrieve the data and understand the entire flow.

Demonstration of Writing a Web Scraping Program - Using the MLB Website as an Example#

Target Data: 10 years of game data
Target Source: https://www.mlb.com/gameday/braves-vs-phillies/2023/09/11/717664/final/box
Data to be Scraped:

WP:Alvarado.
HBP:Harris II, M (by Walker, T); Riley, A (by Walker, T).
Pitches-strikes:Morton, C 104-64; Lee, D 8-4; Jiménez, J 11-8; Minter 13-9; Iglesias, R 15-10; Yates 14-9; Walker, T 103-53; Bellatti 21-13; Covey 18-14; Alvarado 16-13.
Groundouts-flyouts:Morton, C 2-4; Lee, D 1-0; Jiménez, J 0-0; Minter 1-0; Iglesias, R 0-2; Yates 1-0; Walker, T 3-2; Bellatti 0-2; Covey 5-0; Alvarado 2-0.
Batters faced:Morton, C 27; Lee, D 3; Jiménez, J 2; Minter 3; Iglesias, R 5; Yates 3; Walker, T 26; Bellatti 8; Covey 7; Alvarado 5.
Inherited runners-scored:Bellatti 1-1.
Umpires:HP: Larry Vanover. 1B: Jacob Metz. 2B: Edwin Moscoso. 3B: D.J. Reyburn.
Weather:78 degrees, Sunny.
Wind:4 mph, Out To CF.
First pitch:1:08 PM.
T:3:08.
Att:30,572.
Venue:Citizens Bank Park.
September 11, 2023

Step 1: Identify the Data Source#

Since we want to scrape data from the page, the data on the webpage basically comes from two places:

SSR - The backend returns the entire HTML webpage.
CSR - The frontend calls the backend API to retrieve data and render it on the webpage.

Here, I will use the browser's developer tools > Network and set the filter to Fetch/XHR, then refresh the page, checking each request's response one by one. After examining each request, we found that this request should be the API we are looking for because its response contains the data displayed on the webpage.
Screenshot 2023-09-13 2.25.18 AM
By clicking on the header, we can see its API URL.
Screenshot 2023-09-13 2.29.36 AM
https://ws.statsapi.mlb.com/api/v1.1/game/717664/feed/live?language=en
We reasonably suspect that 717664 is the Game ID. To confirm, we can look at the webpage's URL.
https://www.mlb.com/gameday/braves-vs-phillies/2023/09/11/717664/final/box
It seems that 717664 is the Game ID, and it is the same for other games as well.
https://www.mlb.com/gameday/braves-vs-phillies/2023/09/11/716590/final/box
https://ws.statsapi.mlb.com/api/v1.1/game/716590/feed/live?language=en

Step 2: Confirm that the Data Source Contains the Required Data#

We can Browse > Right Click > Copy Object or Response > Select All > Copy.
Screenshot 2023-09-13 2.33.16 AM
A simple way is to paste it into an online JSON parser (jsoneditoronline, json.parser) to check. You can use Ctrl + F to search for keywords. We can see that the info in the API response JSON contains the data we want to scrape. Bingo!
Screenshot 2023-09-13 2.37.14 AM

Step 3: Use External Tools to Verify API Feasibility#

Here, we need to confirm whether the API requires special authentication or other restrictions that allow only that webpage to access it. We can use Postman to test it. If you don't know how to use this tool, you can search for tutorial articles.

Screenshot 2023-09-13 2.44.03 AM
We confirmed that as long as we can call the API, we can obtain information, so we can proceed to the next step.

Step 4: How to Continuously Retrieve Information from Different Games#

Programs are designed by people, so to write a web scraper, you must understand the entire process to be able to write it. It's not just about reading a book or tutorial. In this example, the overall logic of the web scraper should be:

Retrieve all game IDs for 10 years.
Use the above API to get all game information for the 10 years.
Store the game information in a variable and then write this information to a CSV file.

How do we grab all game IDs? We can use this URL to find the previous layer's URL:
https://www.mlb.com/gameday/braves-vs-phillies/2023/09/11/717664/final/box
Using
https://www.mlb.com/gameday/
Results in navigating to
https://www.mlb.com/scores
Screenshot 2023-09-13 2.53.04 AM
This is where we need to be. Next, we will use the method mentioned earlier to find which request retrieves this data.

Screenshot 2023-09-13 2.55.31 AM
Found the API:
https://bdfed.stitch.mlbinfra.com/bdfed/transform-mlb-scoreboard?stitch_env=prod&sortTemplate=4&sportId=1&&sportId=51&startDate=2023-09-11&endDate=2023-09-11&gameType=E&&gameType=S&&gameType=R&&gameType=F&&gameType=D&&gameType=L&&gameType=W&&gameType=A&&gameType=C&language=en&leagueId=104&&leagueId=103&&leagueId=160&contextTeamId=
By throwing this API URL into Postman, we can see the parameters that accompany the call to that URL on the left. Some parameters may not be clear in function, but for safety, do not change them arbitrarily.
Screenshot 2023-09-13 2.56.41 AM
However, we can see that there are startDate & endDate among them. We can change them to test whether we can retrieve data for multiple days. If so, it can speed up our data retrieval.
Screenshot 2023-09-13 3:01:05 AM
Bingo! We can retrieve all game data from 2023-09-01 to 2023-09-11 at once, but the call takes up to 8 seconds, and the response data volume is large. Here, we may need to retrieve data in a maximum of one month at a time; too much may cause a timeout.

Step 5: Transform These Processes into Program Logic#

Here, don't rush to write code; we first need to transform the above processes into program flow and logic before writing. Here is a simple demonstration of how the code should be transformed into a web scraping process:

import required_libs ex: requests
import calendar

# Declare global variables
# Declare scores_api_url
scores_api_url = "https://bdfed.stitch.mlbinfra.com/bdfed/transform-mlb-scoreboard"
# Declare game_api_url, changing the part that will vary based on game ID to a specific placeholder game_id
game_api_url = "https://ws.statsapi.mlb.com/api/v1.1/game/game_id/feed/live?language=en"
# Declare start year for scraping
start_year = "2012"
# Declare end year for scraping
end_year = "2022"
# Store all GameId data
game_data = []

# Main program block
def main:
  day_list = get_month()
  # Loop through all months
  for seDay in day_list:
    # Use get_scores_data to retrieve all game IDs from the first to the last day of that month
    gameId_list = get_scores_data(seDay[0], seDay[1])
  
    # Use get_game_date to retrieve all gameId data and add it to game_data
    game_data = game_data + get_game_date(gameId_list)
  # Save game_data to CSV
  ...

# Get the first and last day of each month from start year to end year
def get_month() -> list:
  result = []
  for year in range(start_year, end_year + 1):
    for month in range(1, 13):
        # Get the first day of the month and the total number of days
        _, days_in_month = calendar.monthrange(year, month)
        
        # First day
        first_day = f"{year}-{month:02}-01"
        # Last day
        last_day = f"{year}-{month:02}-{days_in_month:02}"
        
        result.append((first_day, last_day))
  return result

# Function to scrape scores data
def get_scores_data(sDay: str, eDay: str) -> list:
  gameId_list = []
  # Replace with the correct URL
  url = scores_api_url
  # Set payload
  payload = {
    "stitch_env": "prod",
    "sortTemplate": "4",
    "sportId": "1",
    "sportId": "51",
    "startDate": sDay,
    "endDate": eDay,
    "gameType": "E",
    ...
  }
  res = get_api(url, payload)
  if res != {}:
    # Loop through the day's game list
    for game_list in res.get("dates"):
      # Loop through the day's games
      for game in game_list:
        gameId_list.append(game.get("gamePk"))
  return gameId_list

# Function to scrape game data
def get_game_date(gameId_list: list) -> list:
  result = []
  # Loop through gameId_list to retrieve all gameId data
  for gameId in gameId_list:
    # Replace with the correct gameId
    url = game_api_url.replace("game_id", str(gameId))
    res = get_api(url, {})
    # Call API and receive values
    if res != {}:
      # Implement logic to extract required information from res dict
      ...
      result.append(gameData)
  return result

# Function to call API
def get_api(url: str, payload: dict) -> dict:
  res = request.get(url, params=payload)
  if res.status_code == 200:
    return res.json()
  else:
    return {}
    
# Program entry point
if __name__ == '__main__':
    main()

The overall program for scraping 10 years of game data looks like this. Some functions are not implemented in detail, leaving it for everyone to try. After completing it, there may be many areas for optimization or issues encountered. Here are some potential problems and optimization directions:

May encounter data too large causing request timeout.
May encounter sudden large API access causing the server to refuse access.
Temporary data too large causing the program to crash.
Not implemented saving data to CSV.
Can save executed data to avoid repeating from the start next time.
Can use multiprocessing to speed up processing.