Fine-Tuning Whisper for Air Traffic Control: 84% Improvement in Transcription Accuracy

Fine-tuning OpenAI’s Whisper medium.en on Air Traffic Control (ATC) data achieved a 15.08% WER and an 84% accuracy boost, offering a powerful solution to improve aviation communication and safety.
ai
project
nlp
aviation
fine-tuning
Author

Jack Tol

Published

October 9, 2024

Section 1 | Introduction

Aviation is one of the most important and consequential industries in the world. It is what allows us to be as connected and technologically advanced as we are. Not many people talk about it this way, but without the aviation industry, the world as we know it quite literally couldn’t exist. Effective communication is critical for aviation safety, yet communication errors remain one of the leading causes of aviation incidents. This project aims to address that problem by fine-tuning OpenAI’s Whisper model on a dataset containing Pilot-Air Traffic Control (ATC) communications, providing a solution to transcribe ATC audio into accurate text logs, ultimately reducing communication based aviation incidents. The fine-tuning process led to significant improvements, achieving a Word Error Rate (WER) of 15.08%, a relative improvement of 84.06% over the pretrained model.

According to ATAG.org:

“In 2019, 4.5 billion passengers were carried by the world’s airlines,” with “8.68 trillion Revenue Passenger Kilometers (RPK),” and “nearly 61 million tonnes of cargo were carried by air in 2019” ().

This was in just one year, and notably the year COVID-19 began, so there’s reason to believe that in 2024 and beyond, these numbers will be even higher. Aviation is also one of the most safety-critical industries in the world. Commercial aircraft can cost hundreds of millions of dollars to produce, may carry millions of dollars worth of cargo, and can hold hundreds of lives.

With so much at stake, aviation is a particularly fascinating field of study, not only because of the high stakes involved, but also due to the complexity of the pilot’s role and other peripheral positions. During flight training, one of the most salient aspects that stood out to me was the emphasis on human factors. Human factors study how humans interact with various elements of the aviation system, with a particular focus on optimizing safety, efficiency, situational awareness, and decision-making. Effective communication and Crew Resource Management (CRM) play crucial roles in building strong teamwork and reducing errors.

Section 2 | The Problem

The importance of communication cannot be overstated, especially when considering its critical role in aviation safety. In fact, communication issues have been identified as a leading cause of aviation incidents. In 1981, an analysis of 28,000 incident reports from NASA’s Aviation Safety Reporting System (ASRS) revealed that over 70% of aviation incidents resulted from problems with information transfer, primarily related to voice communications. These issues included ambiguous phrasing, inaccurate or incomplete messages, phonetic confusion, and distorted transmissions ().

In 2021, the International Air Transport Association (IATA) conducted a study involving 2,070 airline pilots. The study identified the use of non-standard and ambiguous phraseology as the top communication issue, with 44% of pilots encountering non-standard language on every flight. Misinterpretation of numbers, such as confusing “descend two four zero” with “descend to four zero,” has led to significant incidents, including the 1989 Controlled Flight Into Terrain (CFIT) crash in Kuala Lumpur ().

To address these errors, the International Civil Aviation Organization (ICAO) has introduced standardized aviation phraseology, which all pilots must use across all aviation jurisdictions worldwide. While English is the international language of aviation, challenges still arise due to accent differences among international pilots and varying levels of English proficiency. Tragically, in many cases, it is not engineering failures or a lack of understanding that cause accidents, but rather human error, as is the case in so many other areas of life.

Section 3 | The Solution

As we all know, in the last few years, we’ve made tremendous progress with AI systems. Generative AI in text, audio, image, and video generation, complex multi-step reasoning agents, and breakthroughs in AI-powered robotics, such as the Figure line of robots, are all rapidly changing the world for the better. Another area that is seeing significant improvement is Automatic Speech Recognition (ASR). OpenAI’s Whisper models, which are open-source, high-performant, and highly accurate transcription models, represent a major advancement in this field. These models have the potential to improve safety and clarity in industries that rely on clear and precise communication.

An accurate transcription model could provide a valuable tool and redundancy for both pilots and Air Traffic Controllers by continuously transcribing all voice communications in real-time. This system would create a log of previous exchanges, allowing quick review of past interactions without replacing standard communication protocols, such as using “Say again” to request a repeat. The transcription log would serve as a useful backup, especially in high-pressure or emergency situations, helping to mitigate issues like miscommunication, mishearing, or misremembering critical clearances and frequency changes. By incorporating ASR transcription systems, such as Whisper, in the cockpit, the aviation industry could reduce the risks associated with communication errors, enhancing overall safety and efficiency in coordination between pilots and Air Traffic Control.

Section 4 | Key Insights About Whisper

OpenAI Whisper models are built on a transformer architecture, which relies on a crucial mechanism known as multi-head attention. This attention mechanism is essential during both training and inference, allowing the model to focus not just on individual moments in the audio but also on the surrounding context. By analyzing how different parts of the audio relate to one another, Whisper can make more accurate predictions about what’s being said. As a result, the model tends to perform better on longer audio clips, which provide more contextual information, compared to shorter clips that lack such depth.

This becomes especially relevant when considering ATC communication. If you’ve ever listened to a clip of ATC communication on YouTube, what you’re really hearing is a recording from an external receiver tuned to pick up VHF signals. When you download such a clip from the internet, it’s easy to think of the “clip” as the duration of the recording itself, typically 4-5 minutes, with multiple transmissions. However, if we consider installing a transcription model in the cockpit, instead of viewing the frequency as an “infinite length recording” with no clear start or stop points, we could think of each transmission as its own distinct recording.

What I’ve found is that if you take a Whisper model fine-tuned on ATC data, and compare it against a default Whisper model of the same size, assuming the ATC clip you wish to transcribe is in English and relatively clear, both models perform similarly, with no statistically significant differences. However, clip length plays a major role. The default model will only perform comparably when processing a longer clip, like those commonly downloaded from YouTube. When dealing with an individual transmission, though, the performance of the default model plummets due to the way the attention mechanism favors longer audio sequences and penalizes shorter clips.

Of course, the fine-tuned Whisper model doesn’t escape this issue entirely, but it can be mitigated by the fine-tuning process, which specifically attunes the model to shorter, individual transmissions of aviation communication.

Section 5 | Fine-tuning Whisper medium.en on ATC Data

Section 5.1 | Creating the Custom Dataset

To fine-tune the Whisper medium.en model, I needed to create a custom dataset from available resources. After a bit of searching, I found a large, but paywalled dataset called ATCO2 (). This dataset is split into three main parts: a training set, a 4-hour test set, and a 1-hour test subset. Only the 1-hour test subset is freely available for download while getting access to the rest of the data would cost approximately $6600. Fortunately, a Hugging Face user, Juan Pablo Zuluaga, had already downloaded the test subset and made it available on the Hugging Face dataset hub (). This subset includes 871 audio-transcription pairs, which formed part of my dataset.

Additionally, the same user uploaded another helpful resource: the UWB-ATCC corpus, created by the University of West Bohemia’s Department of Cybernetics. This corpus contains manually transcribed air traffic communications, with a training set of 11.3k rows and a test set of 2.82k rows ().

Each row of the dataset contains two pieces of information: an audio sample and its corresponding ground truth transcription. The independent variable is the audio sample (stored in the audio column), and the dependent variable is the transcription (stored in the text column). For example, three rows in the dataset looks like this:

audio text
good morning lufthansa two kilo victor we are leaving two six descending level
qantas six forty two sydney tower good day
descend four thousand feet qnh one zero one eight cleared for ils approach runway three one

To prepare my dataset, I loaded both the ATCO2 test subset and the UWB-ATCC corpus. Since the ATCO2 Hugging Face dataset only included the test subset, I combined both datasets into one, shuffled the data to a random distribution, removed unnecessary columns, and manually removed some samples (more on this later). After finalizing the combined dataset, I applied an 80-20 split to create the training and testing sets. I then uploaded it to the Hugging Face datasets hub under the name “atc-dataset,” which consists of 11.9k rows in the training set and 2.93k rows in the test set. This combined dataset became the foundation for my model fine-tuning.

Section 5.2 | The Fine-Tuning Process

For the fine-tuning process, I began by loading my custom ATC dataset using the Hugging Face datasets library. As mentioned earlier, this dataset had already been split into an 80/20 ratio for the training and test sets, ensuring I had sufficient data to fine-tune the pretrained Whisper model while still being able to effectively evaluate its performance. I set the fine-tuning to run for 10 epochs, which I determined would provide enough training cycles to optimize performance without overfitting.

Next, I initialized the Whisper tokenizer, feature extractor, and processor from the pretrained model. Since Whisper models are designed to process audio input at 16kHz, I resampled all the audio files in the dataset to match this sampling rate. To make the model more robust, particularly given the variability of ATC communications, I implemented a dynamic data augmentation strategy. This included techniques such as Gaussian noise, pitch shifting, time stretching, and clipping distortion. The augmentation intensity started relatively high and gradually decreased over the 10 epochs, following an exponential decay schedule. This approach ensured that the model was exposed to challenging conditions early on but was fine-tuned to more natural data as training progressed.

The primary evaluation metric I focused on was Word Error Rate (WER), a commonly used measure for speech recognition systems. WER is calculated by comparing the predicted transcription to the actual transcription and is expressed as the percentage of errors relative to the total number of words. The formula for WER is as follows:

WER=Substitutions+Deletions+InsertionsTotal Words in Reference

Here’s a breakdown of the types of errors WER accounts for:

  1. Substitution: A word in the transcription is incorrectly replaced by another word.
  2. Deletion: A word from the reference transcription is missing in the predicted transcription.
  3. Insertion: A word is added in the predicted transcription that was not in the reference transcription.

Table of Word Error Rate Examples

Ground Truth Transcript Predicted Transcript Type of Error Explanation WER
“cleared for takeoff” “cleared for landing” Substitution The word “takeoff” was incorrectly substituted with “landing.” 33.33%
“proceed to runway one one left” “proceed runway one one left” Deletion The word “to” was deleted from the predicted transcription. 16.67%
“maintain flight level two one zero” “maintain the flight level two one zero” Insertion The word “the” was inserted into the predicted transcription unnecessarily. 16.67%

These examples illustrate how the WER metric works in practice. It captures not just outright incorrect words but also when words are omitted or added, which can affect the accuracy of the transcription.

In terms of hardware and training hyperparameters, I ran the fine-tuning on two A100 GPUs with 80GB of memory. I set the batch size to 16, and with gradient accumulation set to 2, this created an effective batch size of 32. The learning rate started at 1e-5, with 500 warm-up steps to ensure a smooth training curve. At the end of each epoch, the model evaluated its performance on a random subset of approximately 70-100 samples from the test set. To prevent overfitting, I included an early stopping mechanism that halted training if WER did not improve after three epochs.

Section 5.3 | Fine-Tuning Results

During training, the model achieved a final average evaluation WER of 11.56%, measured at periodic intervals. These periodic evaluations, conducted during training, transcribed a subset of 70-100 audio samples. To gain a more comprehensive understanding of the model’s performance, I later exported the model locally and evaluated it on the full set of 2,900 test samples. The model produced an average WER of 15.08%. I repeated this evaluation twice more to confirm the consistency of the results.

For comparison, I tested the pretrained Whisper medium.en model (without fine-tuning) on the same set of test audios. The pretrained model performed significantly worse, with an average WER of 94.59%. This represents a relative improvement of 84.06% by the fine-tuned model compared to the pretrained model. In plain terms, this means the fine-tuned model is 84% less likely to transcribe words incorrectly than the pretrained Whisper medium.en model.

Fine-tuning the model attuned it more effectively to the nuances of ATC communications, including differentiating between human speech and background noise/static, and recognizing aviation-specific jargon and vocabulary. As a result, the fine-tuned model is much better at handling the distinct characteristics of ATC communication, making it 84% less likely to make transcription errors or hallucinate, including mistakenly transcribing background noise as words or getting stuck in a loop producing the same token repeatedly.

These results highlight the significant impact of domain-specific fine-tuning, demonstrating how it greatly enhances transcription performance in specialized areas like ATC communication.

The CSV files containing the evaluation data, along with the evaluation scripts, for both the fine-tuned and pretrained models, and the pretrained model’s prediction normalization code can be found in the project’s repository.

Section 5.4 | Evaluation Limitations

When training machine learning models, proper evaluation is crucial for understanding your model’s performance. For transcription models, Word Error Rate (WER) is typically used. As mentioned previously, WER measures errors such as insertions, deletions, and substitutions in transcriptions. While it is almost universally used, WER has inherent limitations, particularly because it doesn’t consider the meaning of words and treats all errors equally. This can lead to overly harsh assessments, especially in domains like Air Traffic Control, where minor transcription mistakes may not significantly impact communication.

WER provides a straightforward method for evaluating errors by comparing predicted and actual words. However, it doesn’t account for subtleties like meaning or word proximity. For instance, predicting “approve” instead of “approved” counts as a full error, even though the intended meaning is essentially the same. Moreover, WER penalizes minor misspellings just as harshly as completely missing a word, which can skew the true picture of the model’s performance. These limitations become especially noticeable in real-world applications, where precision is important but capturing the intent behind the words may be even more crucial, particularly in fields like aviation and Air Traffic Control.

When evaluating the fine-tuned model against the pretrained Whisper model, I encountered additional challenges. My dataset used expanded, lowercase transcriptions, while the pretrained Whisper model produced transcriptions in a more formal ATC syntax. This mismatch in style and format skewed the evaluation, as the models generated outputs in different styles. To address this, I created a normalized prediction function to align both models’ outputs and ran the pretrained models evaluation on the normalized predictions. The normalization process ensured lowercase text, expanded numbers from their digit form to their word form, and expanded frequencies, runways, and altitudes, among other things.

Despite these adjustments, WER’s inherent limitations, such as case sensitivity and strict syntax matching, remained a challenge, highlighting the need for more content-focused evaluation methods that better capture the intent and meaning of transcriptions.

Section 5.5 | Post-Evaluation Review and Dataset Adjustments

As mentioned earlier, after I had fine-tuned the Whisper medium.en model, I ran it through an initial evaluation on the full dataset of 2,900 audio samples. As each transcription was generated, I logged important details, including the sample number, the ground truth, the model’s prediction, and the WER, into a standardized text document. This allowed me to easily load the data into a Pandas DataFrame for analysis. Upon sorting the transcriptions by WER in descending order, I quickly noticed a few outliers with extremely high error rates, skewing the overall results. The average WER across all samples was 19.86% for the fine-tuned model, largely due to these outliers.

I decided to investigate these outliers manually. To do so, I created a Jupyter Notebook called Interactive Model Comparison: Fine-Tuned vs. Pretrained Whisper on ATC Data, which allowed me to listen to the audio alongside its ground truth and predicted transcription. After manually reviewing several of these cases, I found that in some instances, the ground truth was incorrect. My model had produced correct transcriptions, but the dataset’s ground truth either omitted words or included errors, resulting in an artificially inflated WER.

After identifying the erroneous samples, I removed them entirely from the dataset. This required adjustments to my custom dataset creation and uploading code to ensure that these incorrect ground truths were excluded from both the training and testing sets.

Below are some of the worst examples where the ground truth was incorrect for the corresponding audio, and the prediction from the fine-tuned model:

Audio Ground Truth Prediction
roger praha nine six one you are unreadable we are returning back to frequency one three two decimal eight zero five roger
thomson one zero alfa thomson one zero alfa do you have any wind on turbulence
and hold two two continue to rapet and continue heading csa four nine two seven

You can find the complete list of the removed audio samples in the project’s repository, linked here.

Additionally, this folder includes a CSV file of evaluations from running inference with the fine-tuned Whisper model on these audio samples. The CSV contains columns for Audio ID, Ground Truth, Prediction, and WER. If you’d like to listen to a sample yourself, simply find the folder named after the corresponding Audio ID; within that folder, you’ll find the .wav file for the audio, along with the text file containing the ground truth.

To ensure a fair evaluation, after removing these erroneous samples from the dataset, I re-ran the full evaluations for both the fine-tuned and pretrained Whisper models. I found that the WER for the fine-tuned model dropped to 15.08%, while the WER for the pretrained model was 94.59%, representing a relative improvement of 84.06% with the fine-tuned model compared to the pretrained model. This manual inspection process helped refine the dataset and ensure consistency between the evaluations of both the fine-tuned model and the pretrained Whisper model. The up-to-date dataset can be found on my Hugging Face repository, linked here.

After further analysis, I found that the pretrained Whisper model was especially susceptible to a particular type of error where it would latch onto background noise or non-speech components of the audio, producing hallucinated predictions. This issue was most pronounced in short, unclear audio samples where the pretrained model failed to distinguish between meaningful speech and irrelevant sounds, leading to wildly incorrect transcriptions. A notable example can be seen in the following clip:

Audio Ground Truth Prediction
kilo keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto keto…

In cases like these, the pretrained model repeatedly generated the word “keto” even though the audio simply contained the word “kilo.” This behavior is largely due to the model’s sensitivity to non-verbal elements in the audio and its tendency to over-rely on patterns learned from its pretraining data, which leads to severe hallucinations. By contrast, the fine-tuning process substantially improves the model’s ability to differentiate between actual speech and background noise, attuning it to the specific characteristics of ATC communication. As a result, the fine-tuned model not only makes fewer incorrect predictions but also better filters out irrelevant audio, contributing to its significantly lower WER of 15.08% compared to the pretrained model’s 94.59%.

Section 5.6 | Interactive Model Comparison: Fine-Tuned vs. Pretrained Whisper on ATC Data

In the project repository, within the evaluation_scripts folder, you’ll find the Interactive Model Comparison: Fine-Tuned vs. Pretrained Whisper on ATC Data notebook. This tool allows you to explore how the fine-tuned Whisper medium.en model compares to the pretrained version on the ATC data. You can listen to audio samples, view transcriptions, and observe the WER for both models.

The notebook provides an interactive experience where you can examine challenging audio samples, compare model predictions side-by-side, and explore random samples to see how well the fine-tuned model generalizes. The idea is that engaging with the data directly, allows you to gain a clearer understanding of where the fine-tuned model excels and where it still struggles compared to the pretrained one.

This notebook is a useful resource for those interested in evaluating the model’s performance in a more dynamic and hands-on manner.

Section 6 | ATC Transcription Assistant

To make the results of the fine-tuning work more accessible, I created the ATC Transcription Assistant. This application is hosted on Hugging Face Spaces and allows users to upload MP3 or WAV files containing ATC communications. The app will then generate a transcription of the audio, offering a simple and effective way to convert ATC interactions into text.

This application is designed for researchers and aviation enthusiasts looking to analyze or review ATC communications in a clear, readable format. While the model provides high accuracy, the ATC Transcription Assistant is a public tool, so users should be aware that transcription accuracy cannot be guaranteed in all cases. The app uses a language model to reformat the transcription into standard ATC syntax and display the transcription on the screen.

If you’d like to explore the app, you can find it on Hugging Face Spaces, linked here.

Section 7 | How This Could Work in Practise

With the fine-tuned Whisper model showing promising results in transcribing ATC communications, the next step is to think about how such a system could be practically implemented in real-world aviation settings. One key aspect of integrating Automatic Speech Recognition (ASR) systems like Whisper into cockpit environments is ensuring they are seamless, intuitive, and useful during flight operations.

One approach could be achieved through a combination of VAD and a sound-threshold-based recording system. The VAD system would detect when a transmission starts and ends, triggered by the sound level rising from 0 decibels (silence) to a detectable level, and back to 0 once the transmission finishes. Given that ATC frequencies have periods of complete silence when no one is transmitting, this makes the detection process highly reliable.

Once a transmission is detected and recorded, it would be passed through the fine-tuned Whisper model, which would automatically transcribe the communication. The transcription could then be displayed on the screen in real time for the pilot or air traffic controller to review. Additionally, the interface could include a button that allows the user to manually replay the transmission audio if needed, providing an efficient way to listen to the message alongside the text for better clarity and confirmation. This combination would ensure both automatic transcription and the option for manual verification through audio playback, enhancing usability in critical communication environments.

Section 8 | Conclusion

Communication errors have long been a significant challenge in aviation, with studies revealing that miscommunication accounts for a large percentage of incidents. Despite standardized phraseology introduced by the ICAO, human factors such as accent differences and language proficiency continue to pose risks. The proposed solution, integrating advanced ASR systems like Whisper into the cockpit, offers a valuable safety net by providing real-time transcription and improving communication clarity.

Fine-tuning Whisper medium.en on domain-specific ATC datasets led to substantial improvements, with a WER of 15.08% compared to 94.59% from the pretrained model. This 84.06% relative improvement demonstrates the immense potential of ASR in reducing errors and enhancing safety in high-stakes aviation environments. As AI systems like Whisper continue to evolve, the aviation industry stands to benefit greatly from this type of technology, making the skies safer for everyone.

References

Air Transport Action Group. 2023. “Facts & Figures.” https://atag.org/facts-figures/.
ATCO2 Project. 2023. “Data Available from the Project.” https://www.atco2.org/data.
SYSTRAN Team. 2023. “Faster-Whisper: A Fast and Efficient Whisper ASR Model Implementation.” https://github.com/SYSTRAN/faster-whisper.
Wilson, Dale. 2016. “Failure to Communicate.” AeroSafety World, October. https://flightsafety.org/asw-article/failure-to-communicate/.
Zuluaga, Juan Pablo. 2023a. “ATCO2 Corpus 1h.” https://huggingface.co/datasets/Jzuluaga/atco2_corpus_1h.
———. 2023b. “UWB-ATCC Corpus.” https://huggingface.co/datasets/Jzuluaga/uwb_atcc.