Classifying U.S. Congress with AI

Author

Joseph Welsh

Summary

Are you the kind of person who skips to the bottom of a Reddit post to read the TL;DR to see if you are really interested in the story before reading the full text? Are you often frustrated when a friend includes unnecessary details that cause their stories to drag on and on? If so, reading a congressional hearing transcript may be the trigger you absolutely don’t need right now. Fortunately, data analysis tools have progressed so that computers can read and generate text, providing you with a way out of reading pages of text just to get a summary of key points.

This analysis trains an LLM on text and requests summary data in order to understand the significance of a hearing without reading or listening to the full transcript. I scrape pdfs of congressional transcripts from the data.gov web API and feed the text to OpenAI’s gpt-4o model, using specific response type definitions and prompts in order to elicit the correct summary info from the LLM.

AI Large Language Models (LLMs) are very powerful for predicting the next “token” in a sentence. Tokens are to a sentence like the bricks that make up a house. A token by itself may just be “Fly”, but an LLM may be able to predict “me”, “to”, “the”, and “moon” would follow the first token if it is first trained on popular Frank Sinatra songs. This analysis takes advantage of that powerful ability to create insights from U.S. congressional hearings transcripts.

Reading in congressional hearings

In order to read in pdfs of the hearings, we first need a data.gov API key. Once this is acquired, the function below will pull zip files of all the files from each hearing based on the congress number (as long as it’s after 2018).

pull_chrg = 
  function(key=NULL, pageSize=10, congress=118){
    url =
      paste0("https://api.govinfo.gov/collections/",
           "CHRG/",
           "2018-01-01T20%3A18%3A10Z?",
           "pageSize=",
           pageSize,
           "&",
           "congress=",
           congress,
           "&",
           "offsetMark=%2A&",
           "api_key=",
           key)
    jsonlite::fromJSON(url)$packages |> 
      select(packageId, title, 
             packageLink, congress, 
             dateIssued)
  }

Once this function is defined, it can be used to read in zips, and unzip them.

pdf_df = 
  ##this was rev one which directly continued with downloaded files instead of reading in from dir.
  # gov_zips |>
  # mutate(fname = paste0("zips/", gov_zips$packageId,
  #                       "/pdf/", 
  #                       gov_zips$packageId, ".pdf")) |> 
  # rowwise() |> 
  # mutate(
  #   text = list(unzip(destfiles,list = T)$Name)
  #     ) |> 
  
    ##this is rev two reading in already downloaded files from dir
  tibble(text = list.files(recursive = TRUE)) |> 
  # unnest_longer(col = text) |> 
  filter(str_detect(text,".pdf")) |> 
  # rowwise() |> 
  mutate(
    # text = list(unzip(destfiles,files = text)),
    pdfText = map(text, pdftools::pdf_text)
      ) |> 
  unnest_longer(col = pdfText) |> 
  transmute(pdfText = str_squish(pdfText),
         packageId = str_sub(text, 
                             start = 1,
                             end = str_locate(text,"/")[2] - 1)) |> 
  aggregate(pdfText ~ packageId, FUN = paste, collapse = "") |> 
  mutate(text=map(pdfText,\(hearing){word(hearing, 1, (30000/2) )})) |> #max tokens divided by average tokens per word in English (1.5) plus .5 since congress likes the big words
  select(name=packageId, text) |> 
  filter(!is.na(text))
name text
CHRG-118hhrg55911 EXAMINING THE PRESIDENT’S FY 2025 BUDGET REQUEST FOR THE U.S. FOREST SERVICE OVERSIGHT HEARING BEFORE THE SUBCOMMITTEE ON FEDERAL LANDS OF THE COMMITTEE ON NATURAL RESOURCES U.S. HOUSE OF REPRESENTATIVES ONE HUNDRED EIGHTEENTH CONGRESS SECOND SESSION Tuesday, June 4, 2024 Serial No. 118–127 Printed for the use of the Committee on Natural Resources ( Available via the World Wide Web: http://www.govinfo.gov or Committee address: http://naturalre...

AI analysis

Now that the loading and text processing of the pdfs is complete, the text data can be sent to an AI chatbot. We will use openai’s gpt-4o model. The ellmer package allows connection to many different AI APIs. ellmer also allows meticulous specification of what datatype to return. The better the return type is specified, the more useful the AI helper’s response.

I run the analysis below. This analysis costs about 40 cents to run with current token rates for the openai api for gpt-4o.

library(ellmer)

get_gpt_summary = 
  function(text){
    
    type_summary <- type_object(
      "Summary of hearing.",
      name = type_string("Title of the hearing."),
      num_attended = type_integer("Number of congress people in attendance."),
      num_reps = type_integer("Number of republican congress people in attendance."),
      num_dems = type_integer("Number of democrat congress people in attendance."),
      topics = type_enum(
        'Topic of the hearing',
      values = c(
        "Environment",
        "Immigration",
        "Law Enforcement",
        "Emergency Response",
        "Agriculture",
        "Housing",
        "Technology",
        "Business"),
      ),
      summary = type_string("Summary of the hearing. 100 words max.")
    )
    
    type_hearings_summary <- type_array(items = type_summary)
    
    chat <- chat_openai(
      system_prompt = "You are a chatbot that summarizes congressional hearings."
    )
    
    response = chat$extract_data(text[1], type = type_hearings_summary)
    
    return(response)
  }


out_df = 
    pdf_df |> 
    transmute(responses = 
                map(text, get_gpt_summary,
                    .progress = TRUE))

Results

Now that the chatbot has been able to read the text of the pdfs and return answers for each function call, the results just need to be unnested into a dataframe.

U.S. Congressional Hearings
AI summary of hearings based on pdf text extracted from the U.S. Government API
name
Attendance
topic summary
total republicans democrats
President’s FY 2025 Budget Request for the U.S. Forest Service 25 14 11 Environment The hearing examined the President's budget request for the U.S. Forest Service, focusing on funding and management strategies to address wildfire crises, forest health, and firefighter support. The budget proposes $8.9 billion, including funds for hazardous fuels reduction and permanent pay increases for firefighters. Concerns were raised regarding the Forest Service’s ability to reduce wildfire risks and meet timber harvest goals. The need for regulatory reforms, funding continuity for long-term forest management, and enhancing forest products infrastructure were also discussed.
SAFEGUARDING WORKERS’ RIGHTS AND LIBERTIES 21 12 9 Business The hearing focused on the debate over the National Right to Work Act, aiming to eliminate compulsory union membership. Proponents argued that workers should not be forced to pay union dues, highlighting personal testimonies of pressure and threats faced by employees. Opponents contended that right-to-work laws weaken unions, reduce workers' wages, and limit resources for collective bargaining and training. The discussion underscored differing views on union influence, workers' rights, and economic impacts, with no consensus reached among the committee members.
American Indian and Alaska Native Public Witness Day 1—Morning Session 15 9 6 Environment The hearing focused on the needs of American Indian and Alaska Native communities, emphasizing the importance of honoring treaty obligations. Key issues discussed included the environmental cleanup of the Gay Mine Site, inadequate law enforcement funding, and poor road maintenance. Witnesses, including Lee Juan Tyler, highlighted the long-standing neglect in these areas and the necessity for increased funding and strategic planning to improve the living conditions and safety of their communities.
Oversight of the Department of Transportation's Policies and Programs and Fiscal Year 2025 Budget Request 54 28 26 Business The hearing assessed the Department of Transportation's policies and programs with a focus on the FY 2025 budget request. Secretary Pete Buttigieg discussed infrastructure projects backed by the Bipartisan Infrastructure Law, emphasizing improvements in transportation infrastructure like bridges, airports, and rail systems. The FAA Reauthorization Act of 2024 and its implementation also featured prominently. Additionally, attention was given to rail safety regulations following recent incidents, such as the Norfolk Southern derailment in East Palestine. The hearing underscored ongoing legislative efforts essential for furthering rail and general transportation safety.
SECURITY RISK: THE UNPRECEDENTED SURGE IN CHINESE ILLEGAL IMMIGRATION 8 3 3 Immigration The hearing examined the surge of Chinese nationals illegally entering the U.S. via the southern border, raising national security concerns. The discussion highlighted economic instability and political repression in China as push factors, with Chinese nationals using social media to organize journeys. There were concerns over inadequate vetting processes at the U.S. border, which might allow individuals with malicious intent to enter. Recommendations included strengthening border security and enhancing interagency collaboration to counter Chinese influence and ensure national safety. Historical perspectives on Chinese immigration and its challenges were also considered.
The Costs of Inaction: Economic Risks from Housing Unaffordability 20 6 14 Housing The hearing examined the economic risks associated with housing unaffordability in the U.S. Witnesses, including Rhode Island's House Speaker, experts from housing-related organizations, and economists, discussed various aspects of the housing crisis. They highlighted the shortage of affordable housing and the rising costs, partly due to regulatory barriers and insufficient supply. Proposals discussed included increasing housing production, reforming zoning laws, and revisiting federal housing policies. The debate also touched on the impact of federal fiscal policies on housing markets. Various models and solutions were proposed to boost housing supply, affordability, and ownership.
FIGHTING FRAUD: HOW SCAMMERS ARE STEALING FROM OLDER ADULTS 8 3 5 Law Enforcement The hearing focused on scams targeting older adults, leading to significant losses, estimated by the FTC to be $137 billion due to underreporting. Witnesses discussed various scams, including tech support and cryptocurrencies, and emphasized the importance of reporting these crimes to aid law enforcement efforts. Emotional and financial impacts on victims were significant, and policy recommendations included reinstating the casualty loss deduction and enhancing regulation of cryptocurrency ATMs. Calls were made for a coordinated national strategy to combat these scams, involving education, technology improvements, and stronger law enforcement collaboration.

Now the reader has a succinct summary of each hearing and can perform analysis on attendance by topic in order to derive which topics may be the most important to the congress.

Modifying the return types further would provide a clearer look at which topics congress discusses. If the programmer prudently limits the return type options the model picks from, then the model finds stronger, more enriched insights.