Fearless Steps Challenge


The FEARLESS STEPS Challenge: Massive Naturalistic Audio
(Phase FS-1)

UPDATE 1: The evaluation set is now released!!Please check your mail or Register, if you haven't registered!
UPDATE 2: The complete Apollo 11 Corpus (11,000 hours) is now available for download!!Please check your mail or Register(Download tab), if you haven't registered!
UPDATE 3: The download link for the Fearless Steps Challenge (100 hours) corpus is now closed, check the updates tab for more information!
UPDATE 4: Evaluation Plan 1.2 is now released, check the updates tab for more information!

The NASA Apollo program relied on a massive team of dedicated scientists, engineers, and specialists working seamlessly together in a cohesive manner to accomplish probably one of humankind's greatest technological achievements in history.

The Fearless Steps Initiative by UTDallas-CRSS has led to the digitization of 19,000 hours of analog audio data and development of algorithms to extract meaningful information from this multichannel naturalistic data. Further exploring the intricate communication characteristics of problem solving on the scale as complex as going to the moon can lead to the development of novel algorithms beneficial for speech processing and conversational understanding in challenging environments. As an initial step to motivate a streamlined and collaborative effort from the speech and language community, UTDallas-CRSS is overseeing this ISCA INTERSPEECH-2019 special session entitled: "The FEARLESS STEPS Challenge: Massive Naturalistic Audio (FS-1)".

Fearless Steps Corpus Poster

Traditionally, most speech and language technology concentrates on analysis of a single audio channel with one or more speakers involved. The Apollo audio data represents 30 indivdiual analog communications channels with multiple speakers in different locations working real-time to accomplish NASA's Apollo missions. For Apollo-11, this means each channel reflects a commnuications loop that can contain anywhere from 3-33 speakers over extended time periods. The entire Apollo-11 mission lasted 195hours. While each channel has a primary function with a specific NASA Mission Specialist responsible, each of these channels are "loops", which contain core speakers working together plus background speech at times, some reflecting Air-to-Ground (CAPCOM - Capsule Communicator) communications from the Astronauts. The vast majority of the original Apollo Mission analog audio are all unlabeled, making application of speech technology a challenge. This special session (FS-1) will therefore emphasize the need to address various speech tasks using unsupervised and/or semi-supervised speech algorithms. The Challenge Tasks for this session encourage the development of such solutions for core speech and language tasks on data with limited ground-truth/low resource availability, and serves as the "First Step" towards extracting high level information from such massive unlabeled corpora.

This (FS-1) edition of the FEARLESS STEPS Challenge includes the following five tasks:

  • TASK#1: Speech Activity Detection: SAD

  • TASK#2: Speaker Diarization: SD

  • TASK#3: Speaker Identification: SID

  • TASK#4: Automatic Speech Recognition: ASR

  • TASK#5: Sentiment Detection: SENTIMENT

Participants may select to participate in any single or multiple of these challenge tasks. Participants may also choose to employ the FEARLESS STEPS corpus for other related speech applications. All participants are encouraged to submit their solutions and results for consideration to this ISCA INTERSPEECH-2019 special session.

While there is an extensive amount of audio (19,000hrs), the core data for this FEARLESS STEPS challenge is drawn from 5 channels (out of a possible 30), consisting of 100hours. There are primarily three phases of the Apollo-11 mission that were selected: (i) lift off, (ii) lunar landing, (iii) lunar walk. 80 hours of audio are provided for task system development. For these 80 hours, a sub-set of 20 hours of human verified ground truth labels and transcripts are provided (SAD, SD, SID, ASR, SENTIMENT). For the remaining 60 hours of audio, baseline system generated output labels are provided (ASR, SENTIMENT). An additional set of 20 hours will be released for open test evaluation (see Challenge Timeline).


Corpus Development

To be announced.


The Evaluation Plan version 1.2 is now available, Please download it from the Download Tab or click the link given here

The download link for the Fearless Steps Challenge is now closed as the deadline has passed, however, access to the full corpus (over 11,000 hours) is still available, click the Download tab to access it!

The Development Set has been updated with the following:

SENTIMENT-Dev A folder containing JSON format development data files for Sentiment Detection
Misc-SD A folder containing UEM format development data files for Speaker Diarization scoring
Misc-ASR A folder containing TXT format development data files with segments to ignore at training and decoding stages for Automatic Speech Recognition

The complete Apollo 11 Corpus is now Released, Please Register (Download tab) or check your mail to access it!

Evaluation Set is now available!


Organization of Data

Stages of the Mission

Figure shows the Stages of the Mission

The Stages 1, 5 and 6 were high impact mission-critical events which is ideal for the development of the 100-hour Challenge Corpus. With the quality of speech data varying between 0 and 20 dB SNR in this challenge corpus, the channel variability and complex interactions across all five channels of interest are mostly encapsulated in these 100 hours. The multichannel data are chosen from the major events given below:

  • Lift Off (25 hours)

  • Lunar Landing (50 hours)

  • Lunar Walking (25 hours)

These landmark events have been found to possess rich information from the speech and language perspective.five channels out of the 29 channels were picked since it had the the most activity over the selected events.

  • Flight Director (FD)

  • Mission Operations Control Room (MOCR)

  • Guidance Navigation and Control (GNC)

  • Network Controller (NTWK)

  • Electrical, Environmental and Consumables Manager (EECOM)

The personnel operating these five channels (channel owners/primary speakers) were in command of the most critical aspects of the mission, with additional backroom staff looping in for interactions with the primary owners of this channel.

Table 1 the shows distribution of total speech content

The distribution of total speech content in each of the channels for every event has been given in the table above. Total speech content in the challenge corpus amounts to approximately 36 hours.

To make sure there is an equitable distribution of data into training, evaluation, and development sets for the challenge tasks, we have categorized the data based on noise levels, amount of speech content, and amount of silence. Due to the long silence durations, and based on importance of the mission, the speech activity density of the corpus varies throughout the mission.

Table 2 shows the Signal to Noise Ratio Statistics per channel for Dev Data

The above table above is a general analysis of the Challenge data. Even though the Researchers are not provided with the channel information of Train, Test and Dev Sets, they may be able to make inferences by computing SNR’s for each file.

1.1 Development Set

For all tasks with the exception of SID, The development set consists of a total duration of 20 hours and 10 mins and consists of about 60% audio from clean channels and the other 40% is from degraded channels. For the SID task, a separate Dev set is provided.

1.2 Training Set

Around 60 hours of Audio data will be provided for which Baseline System generated Sentiment Labels, and Transcripts will be provided. Detailed information regarding the Baseline Systems and Results will be released in a separate document. Researchers may use this data as they see fit.

NOTE: The training labels provided, are not ground truth; they are system outputs generated by our Baseline Systems

1.3 Evaluation Set

Only the Audio Files will be provided for the Evaluation. The evaluation set will consist of audio data from every channel(in equal amounts), adding up to 20 hours in total.

Challenge Overview


  • The Entire Fearless Steps Corpus (consisting of over 11,000 hours of audio from the Apollo-11 Mission) including the 100 hours are publicly available and require no additional license to use the data.
  • There is no cost to participate in the Fearless Steps evaluation. Development data and evaluation data will be freely made available to registered participants.
  • At least One participant from each team must register on the Fearless Steps Challenge 2019
  • System Output Submissions will be sent to the official Fearless Steps mailing list correspondence email-id
  • Participants can submit at most 2 system submissions per day.
  • Results of submitted systems will be mailed to the registered email-id within a week of the submission.
  • It is required that participants agree to process the data in accordance with the following rules.

Rules for the Challenge

  • Site registration will be required in order to participate
  • Researchers who register but do not submit a system to the Challenge are considered with- drawn from the Challenge
  • Researchers may use any audio and transcriptions to build their systems
  • Only the audio for the blind eval set (20 hours) will be released. Researchers are expected to run their systems on the blind eval set.
  • Investigation of the evaluation data prior to submission of all systems outputs is not allowed. Human probing is prohibited.

While participants are encouraged to submit papers to the special session at Interspeech 2018, this is not a requirement for participation.

Speech Activity Detection

The goal in the SAD task is to automatically detect the presence of speech segments in an audio recordings of variable duration. A system output is scored by comparing the system produced start and end times of speech and non-speech segments in audio recordings to human annotated start and end times. The DCF( Detection Cost Function)) is a function of false-positive (false alarm) and false-negative (missed detection) rates calculated from comparison to the human annotation that will be the reference for the comparison.

Performance Metrics

Four system output possibilities are considered:

  • True Positive (TP) - system correctly identifies start-stop times of speech segments compared to the reference (manual annotation)

  • True Negative (TN) - system correctly identifies start-stop times of non-speech segments compared to reference,

  • False Positive (FP) - system incorrectly identifies speech in a segment where the reference identifies the segment as non-speech

  • False Negative (FN) - system missed identification of speech in a segment where the reference identifies a segment as speech.

SAD error rates represent a measure of the amount of time that is misclassified by the systems segmentation of the test audio files. Missing, or failing to detect, actual speech is considered a more serious error than misidentifying its start and end times.

The four system output possibilities mentioned above determine the probability of a false positive (PFP) and the probability of a false negative (PFN). Developers are responsible for determining a hypothetical optimum setting (θ) for their system that minimizes the DCF value.

PFP = detecting speech where there is no speech, also called a false alarm. PFN = missed detection of speech, i.e., not detecting speech where there is speech, also called a miss

DCF(θ) is the detection cost function value for a system at a given system decision-threshold setting.

PFN and PFP are weighted 0.75 and 0.25, respectively, θ - denotes a given system decision-threshold setting.

Speaker Diarization

Speaker diarization has received much attention by the speech community, and while many of the currently developed state-of-the-art systems for telephone speech, broadcast news and meetings,their performance does not translate to naturalistic speech in highly degraded noise environments. This inaugural challenge will focus on evaluating systems on per file Diarization Error Rate (DER).

Performance Metrics

Diarization error rate (DER), introduced for the NIST Rich Transcription Spring 2003 Evaluation (RT-03S), is the total percentage of reference speaker time that is not correctly attributed to a speaker, where correctly attributed is defined in terms of an optimal one-to-one mapping between the reference and system speakers. More concretely, DER is defined as:


  • TOTAL is the total reference speaker time; that is, the sum of the durations of all reference speaker

  • FA is the total system speaker time not attributed to a reference speaker

  • MISS is the total reference speaker time not attributed to a system speaker

  • ERROR is the total reference speaker time attributed to the wrong speaker segments

The per-file results for DER will be considered for evaluation. For additional details about scoring and tool usage, please consult the documentation.

Speaker Recognition

Speaker Recognition is the identification of a person from characteristics of voices. Recognizing the speaker can simplify the task of translating speech in systems that have been trained on specific voices or it can be used to authenticate or verify the identity of a speaker as part of a security process. With over 450 known speakers contributing in varying degree of content, the sample set of speakers is narrowed down to 200 speakers with at least 6 spoken utterances, that are distributed in the Dev and Eval Sets. The primary focus of this challenge will be to identify Speakers with drastically varying duration of speech

Performance Metrics

The SID Task will be evaluated for Accuracy of the Top-5 system predictions for a given input file.

where, S = {k ∈ [1, M] : Nref (k) ⊆ Nsys(k)} and M is the total number of input segments.

For additional details about labeling, please consult the documentation.

Automatic Speech Recogntion

The goal of the ASR task is to automatically produce a verbatim, case-insensitive transcript of all words spoken in an audio recording. ASR performance is measured by the word error rate(WER), calculated as the sum of errors (deletions, insertions and substitutions) divided by the total number of words from the reference.

Performance Metrics

Manual transcription for the Dev and Eval sets has been carried out using hand annotated SAD labels as a starting point.

An overall Word Error Rate (WER) will be computed as the fraction of token recognition errors per maximum number of reference tokens (scorable and optionally deletable tokens):


  • NDel : number of unmapped reference tokens (tokens missed, not detected, by the system)

  • NIns : number of unmapped system outputs tokens (tokens that are not in the reference)

  • NSubst : number of system output tokens mapped to reference tokens but non-matching to the reference spelling

  • NRef : the maximum number of reference tokens (includes scorable and optionally deletable reference tokens)

Sentiment Detection

To be announced.


Important Dates Date
Registration Period February 11, 2019 to April 6, 2019
Evaluation Plan Release February 11, 2019
Training and Development set release February 11, 2019
Baseline Results (Dev set) February 16, 2019
Evaluation set release March 1, 2019
Baseline Description and Results (Eval set) March 10  20, 2019
System Submission Opens* March 10  14, 2019
Interspeech Paper Registration Deadline March 29, 2019
Interspeech Paper Submission Deadline April 5, 2019
Final System Submission Deadline June 28, 2019
Final Results Released for all Tasks July 20, 2019 at 9:56pm CST
(50th Anniversary of the First Moon Walk!)
Interspeech 2019 Special Session September 15-19, 2019

* : System Submissions starting March 10  14 will be sent to FearlessSteps@utdallas.edu ,(Please check the Evaluation Plan for details). System Submission results will be privately mailed to each participant within 5 business days of their submission. This format of System Submission and declaring results privately will continue till the final system submission deadline.

The Challenge will be ongoing till 28th June. Researchers who wish to submit papers in Interspeech may provide their System Submissions through the evaluation period (10th March - 28th June). Researchers submitting a paper to Interspeech and sending their System Submission after April 5th will have the opportunity to put their results and ranking in the Camera Ready paper

Neil on Moon

Neil Armstrong steps onto the Moon, 10:56:15 p.m. EDT, 20 July 1969. (NASA)

NOTE: During Registration, participants are advised to provide an email address they are most active on. Any other updates and changes will be displayed on the website and sent to the email address of the participants who have registered on this site.

Baseline Evaluation Results

Development Set
Task Metric Baseline Results
SAD DCF 0.086 (8.6%)
SD DER 65.22%
SID Top-5 Accuracy 58.17%
ASR WER 71.24%
Sentiment Accuracy 46.2%

Evaluation Set
Task Metric Baseline Results
SAD DCF 0.117 (11.7%)
SD DER 68.23%
SID Top-5 Accuracy 47%
ASR WER 88.42%
Sentiment Accuracy 49.75%

Contact Us

For further questions, Please do not hesitate to contact us,

Fearless Steps