Automated Speech Recognition Aids in Transcription and Captioning for 62,000 Hours of Archived Television News

Jim Duran, director of the Vanderbilt Television News Archive (VTNA) is pleased to announce the recent completion of transcription and captioning for 62,000 hours of recorded television news that began in 1968, and continues today. Captioning improves the archive’s accessibility by supporting users with hearing impairments and opens the archive up to new methods of research using text and data mining. Transcripts and captions for Vanderbilt’s entire collection of recorded television news were completed using Automated Speech Recognition (ASR) through a partnership with the College of Arts and Science. The VTNA has known the importance of captions and transcripts for a long time, but until current technology became available, this massive undertaking was not feasible. Working with Associate University Librarian Cliff Anderson, Duran designed a workflow using four Python scripts running simultaneously on three different machines to automate the transcription process resulting in 89,000 transcriptions and 62,000 hours of content.

Custom Language Models

The Amazon Web Service (AWS) Transcribe service used for this project relies on ASR to generate transcripts based on audio tracks of a digital video file. By default, the service uses a language model built on existing data. By applying a custom language model, Duran and Anderson increased transcription accuracy by using examples of text that closely resembled the output of a television news transcript job. A text sample needed to be 250,000 to half-million words of near perfect text, known as a training set. VTNA partnered with 3PlayMedia to complete this task.

This past Spring, Vanderbilt University licensed 3Play Media, a professional transcription service. With support from library staff member Susan Grider, approximately 75 transcripts for each Presidential administration from Richard Nixon to Donald Trump were ordered. Splitting the collection by presidential administration covered the likely transcription of frequent names in the news for those years.  Then, Duran and VTNA staff member Dana Currier identified the proper spelling of names and places featured in the archived news records in VTNA’s existing database. Once a strategy was determined, Duran wrote a Python script to find the most commonly referenced names in the news for each presidential administration. 600 names were provided to 3Play Media for improved accuracy.

When the near-perfect transcripts for each batch were returned to the archive, Duran compiled them into a custom language model training set for Amazon Web Services. That model was then used on all news recordings for the matching date range.

“We receive many requests for computational access to the Television News Archive, but before now we lacked the infrastructure for most machine learning projects. I am so excited about the research potential of the newly created transcripts, especially when combined with the existing database of timecoded, titled and abstracted news stories and commercials, said Duran. By merging the two datasets, users will have a truly unique and powerful source for studying news media across nearly six decades.”

Closed Captions

The VTNA will be using the time-coded transcripts to embed closed captions in all access copies of the collection. Video captioning is an essential element of accessible collections. The recorded videos did not originally include captions. Using the ASR transcripts, we will add the text to the video streams using a technology called FFMPEG, managed with Python. The process will take some time to complete, starting with the oldest files and moving to the present, we hope to finish by October 2022.

New Research Methods

With the newly created transcripts, VTNA will be participating in the creation of a data lakehouse at Vanderbilt University. This data solution will be a new resource for scientists interested in researching multiple big data collections at once. Duran excitedly shared, “We hope to have capacity in the data lakehouse to accommodate the study of visual and audio elements of the digital video. For example, a researcher could use machine learning to identify the usage of specific imagery such as wildfires, airline accidents or law enforcement; or studies on color schemes or sound effects.”  Focusing on collections of historical news, the data lakehouse will make machine learning projects possible through secure and scalable data management.

If you would like updates on the progress of the news media research center, please contact the Vanderbilt Television News Archive.