Tutorial: Creating a detector

Note

You can download materials to follow along this tutorial from the links below:

create_a_narw_detector

North Atlantic Right Whale detector - part 2¶

This is the second of a two-parts tutorial illustrating how to build a deep learning acoustic detector with Ketos.

We will be using the deep learning classifier that we trained in Part 1 of this tutorial. As you may recall, this classifier was trained to determine whether a given 3-second audio clip contains a North Atlantic right whale (NARW) upcall. Our goal is now to transform this classifier into a detector that can analyze an audio file (.wav) of an arbitrary duration (e.g. 30 min) and tell us where within that file upcalls occur.

Intuitively, the solution to this problem is straightforward: Simply slide a 3-s wide window across the audio file and use the classifier to determine if an upcall is present within the window at any given instant. However, as you try to implement this solution, you immediately run into a number of practical questions. For example, how large steps should we take when sliding the window?

In this tutorial, we will outline a few different strategies to implemeting the sliding-window approach. Ultimately, however, there is no best strategy, and the preferred strategy will depend on the objectives of your study, e.g., do you need to know the precise location of every upcall, or do you only need to know the call rate per hour, etc.

Contents:¶

1. Importing the packages
2. Loading the classifier
3. Inspecting the test data
4. Choosing a step size
5. Loading the test data, frame by frame
6. Feeding the frames to the classifier
7. Putting it all together - the process method
8. Performance metrics
9. Notes on execution time
10. Conclusion

1. Importing the packages¶

We start by importing the modules we will use throughout the tutorial.

In [1]:

import pandas as pd
from ketos.audio.spectrogram import MagSpectrogram
from ketos.audio.audio_loader import AudioFrameLoader
from ketos.audio import load_audio_representation_from_file
from ketos.neural_networks.resnet import ResNetInterface
from ketos.neural_networks.dev_utils.detection import batch_load_audio_file_data, add_detection_buffer, compute_score_running_avg, merge_overlapping_detections, filter_by_threshold, filter_by_label
import matplotlib.pyplot as plt
%matplotlib inline

2023-07-07 14:16:12.246110: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-07 14:16:12.251222: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-07 14:16:12.251383: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-07 14:16:12.252286: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-07 14:16:12.252682: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-07 14:16:12.252840: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-07 14:16:12.252977: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-07 14:16:12.701080: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-07 14:16:12.701266: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-07 14:16:12.701414: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-07 14:16:12.701531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6603 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1070 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1
/home/bruno/.pyenv/versions/3.10.8/envs/ketos-tutorial/lib/python3.10/site-packages/keras/optimizer_v2/adam.py:105: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.
  super(Adam, self).__init__(name, **kwargs)

2. Loading the classifier¶

Next, we load the deep learning classifier that we trained in Part 1 of this tutorial along with the spectrogram parameters used in the training phase. It is important that we use precisely the same set of parameters for computing the spectrograms now, in the inference phase. Otherwise the classifier will get confused!

We load the trained classifier using the ResNetInterface.load method, and the spectrogram parameters with the load_audio_representation_from_file method, as follows

In [2]:

model = ResNetInterface.load(model_file='narw.kt', new_model_folder='./narw_tmp_folder')
audio_repr = load_audio_representation_from_file(model_file='narw.kt')

The first argument, model_file, specifies the path to the saved classifier; the second argument, new_model_folder, is a folder where temporary files needed by the classifier will be saved during execution of the code; finally, with the third argument, load_audio_repr, we specify that we want to load not only the classifier, but also the spectrogram parameters.

Let us briefly inspect the spectrogram parameters,

In [3]:

spec_config = audio_repr[0]
spec_config

Out[3]:

{'rate': 1000,
 'window': 0.256,
 'step': 0.032,
 'freq_min': 0,
 'freq_max': 500,
 'window_func': 'hamming',
 'type': ketos.audio.spectrogram.MagSpectrogram,
 'duration': 3.0}

This tells us the type of the spectrogram (Magnitude Spectrogram), the sampling rate of the audio signal (1000 samples/s), the window size (0.256 s), the step size (0.032 s), the minimum and maximum frequencies (0 and 500 Hz), the window function (Hamming), and the duration of each clip (3.0 s).

3. Inspecting the test data¶

This tutorial includes three audio files, which were recorded in the Gulf of St. Lawrence in the summer of 2016. The files, which are named sample_1.wav, sample_2.wav, and sample_3.wav, contain a total of 26 NARW upcalls. Each file is 30 minute long. We will be using these data to test the performance of our detector.

The time of occurrence of the upcalls is listed in the file annotations.csv. Let's take a look at them,

In [4]:

annot = pd.read_csv('annotations.csv', sep=';')
print(annot)

      sound_file  call_time
0   sample_1.wav   1128.840
1   sample_1.wav   1153.526
2   sample_1.wav   1196.778
3   sample_1.wav   1227.642
4   sample_1.wav   1358.181
5   sample_1.wav   1437.482
6   sample_1.wav   1489.288
7   sample_1.wav   1511.670
8   sample_1.wav   1530.595
9   sample_1.wav   1536.580
10  sample_1.wav   1714.372
11  sample_1.wav   1768.251
12  sample_1.wav   1777.835
13  sample_2.wav     68.149
14  sample_2.wav    688.507
15  sample_2.wav    755.940
16  sample_2.wav    770.440
17  sample_3.wav     68.853
18  sample_3.wav    105.927
19  sample_3.wav   1057.015
20  sample_3.wav   1067.282
21  sample_3.wav   1290.563
22  sample_3.wav   1378.955
23  sample_3.wav   1428.648
24  sample_3.wav   1663.622
25  sample_3.wav   1676.682

We see that the file contains two columns, one for the filename, and one for the time of occurrence of the upcall, measured in seconds from the beginning of the file. This time corresponds roughly to the mid point of the upcall.

We can visualize the temporal distribution of the upcalls, like this, using methods from the matplotlib package,

In [5]:

fig, axes = plt.subplots(ncols=3, figsize=(15,3), sharey=True)

for i in range(3):
    filename = f'sample_{i+1}.wav' #filename
    values = annot[annot['sound_file']==filename]['call_time'].values / 60  #select the occurrence times for this file, and convert from seconds to minutes
    axes[i].hist(values, bins=30, range=(0,30)) #plot the data in a histogram
    axes[i].set_xlabel('Time (minutes)') #set axes labels and title
    if i==0: axes[i].set_ylabel('Upcalls per minute')
    axes[i].set_title(filename)

We see that all three files contain upcalls. Moreover, we note that the upcalls have a tendency to cluster.

To inspect the spectrogram representation of individual upcalls, we can use the MagSpectrogram.from_wav method of the ketos package, like this

In [6]:

# compute the spectrogram of the 1st upcall, using the spectrogram parameters loaded from the saved model
spec = MagSpectrogram.from_wav(path='audio/sample_1.wav', 
                               offset=1128.840 - 0.5*spec_config['duration'],
                               **spec_config)

spec.plot() #create the figure
plt.show()  #display it!

4. Choosing a step size¶

The firs step in creating the detector is choosing a suitable step size. Our window is 3 seconds wide, so our step size should certainly not be greater than 3 seconds. Otherwise, we will miss parts of the data. If we make the step size smaller than 3 seconds, consecutive windows will overlap. The smaller the step size, the greater the overlap.

It is usually desirable to have some overlap between consecutive windows to ensure that all signals are fully captured in at least one window. However, the step size should not be made unnecessarily small as this will increase the computational cost (since more spectrograms will have to be computed and examined by the classifier) without further gain in performance.

For the NARW upcall, which has a duration ranging from 1.0 to 1.5 seconds or so, an overlap of 1.5 seconds would appear to provide the optimal choice if we want to maximize the chances of detecting all the upcalls while not incurring any unnecessary computional costs.

5. Loading the test data, frame by frame¶

Ketos provides a handy class called AudioFrameLoader that helps us split the audio files into (overlapping) frames of equal length and load them one at the time. Moreover, this class can also convert the waveform into a spectrogram representation for us.

To initialize an instance of the AudioFrameLoader, we have to specify the path to the folder where the audio files are stored (path), the step size in seconds (step), and the spectrogram representation we want as output (repres). Thus,

In [7]:

audio_loader = AudioFrameLoader(path='./audio/', duration=spec_config['duration'], 
                                step=1.5, stop=False, representation=spec_config['type'], representation_params=spec_config)

The AudioFrameLoader class as an iterator protocol implemented to load the frames, one by one. It works like this,

In [8]:

first_spec = next(audio_loader) #load the first 3.0-s frame
second_spec = next(audio_loader) #load the second 3.0-s frame
# etc.

RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 112 samples were padded on the left and 0 samples were padded on the right

Let's take a quick peak at the two spectrograms we have just loaded,

In [9]:

first_spec.plot()
second_spec.plot()
plt.show()

We note that the first 1.5 seconds of the second spectrogram are identical to the last 1.5 seconds of the first spectrogram. (This can also be seen from the top time axis of the second spectrogram, which shows the offset relative to the start of the file.)

Using the next() method repeatedly, we can scan through the entire audio recordings. In order to know when to stop, we can use the num() method to determine how many frames there are in total,

In [10]:

print(audio_loader.num())

Another useful method is reset(), which resets the audio loader to the first frame.

In [11]:

audio_loader.reset()

6. Feeding the frames to the classifier¶

Now that we know how to load the frames, the next step is to feed them to the classifier. We can feed the frames one at the time using the run_on_instance method, like this

In [12]:

spec = next(audio_loader) #load a spectrogram
data = spec.get_data()    #extract the pixel values as a 2d array
output = model.run_on_instance(data) #pass the pixel values to the classifier
print(output) #print the classifier's output

2023-07-07 14:16:16.886041: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8101

(array([0]), array([0.6073958], dtype=float32))

Let's take a moment to inspect the output returned by the classifier. We see that the classifier has returned a tuple with two elements, array([0]) and array([0.64735734]); these are both arrays with length 1 and 32 floating point precision.

The first element, array([0]), tells us the label predicted by the classifier, in this case a 0, i.e., no upcall. The second element, array([0.64735734]), is the score assigned to the prediction. We can think of this as a measure of the classifier's confidence in its prediction; the higher the score, the larger the confidence.

We can print the label and the score in a more human-readable style, like this

In [13]:

label = output[0]
score = output[1]
print(f'label: {label}')
print(f'score: {score}')

label: [0]
score: [0.6073958]

When processing entire audio files, it is usually advantageous to feed multiple frames to the classifier at the time, as this will allow for faster processing. This approach is called batch processing, and can be accomplished using the method run_on_batch in place of run_on_instance. We will use it in the next step.

In [14]:

audio_loader.reset()

7. Putting it all together¶

At this point, we have all the necessary components to construct our detector. We've learned how to load spectrograms frame by frame and how to feed these spectrograms to our classifier. Additionally, we understand how to process the classifier's output. The final step is to put all these operations together within a for loop that will run through audio files.

To achieve this, we will use the batch_load_audio_file_data method from the detection module. This method allows us to load our spectrograms in batches, which is more memory-efficient and typically faster than loading them one at a time. The function takes two parameters: loader, which is an AudioFrameLoader object responsible for loading audio data and converting them into spectrograms, and batch_size, which specifies the number of samples to include in each batch. This function yields batches of audio data, where each batch consists of spectrogram data, filename, start time, and end time of each audio segment.

All that remains is to place this inside a for loop to iterate over the batches, feeding each batch into our model. Let's try it out! (Note: this will take a few moments)

In [15]:

# Let's store our detections into a list of dictionaries
detections = []

batch_generator = batch_load_audio_file_data(loader=audio_loader, batch_size=64) # creating out batch generator

# looping through 
for batch_data in batch_generator: 
        # Run the model on the spectrogram data from the current batch
        batch_predictions = model.run_on_batch(batch_data['data']) 
        
        # As we did before, unpack batch_predictions into labels and scores
        labels, scores = batch_predictions

        # For each item in the batch, store the filename, start, end, and prediction score
        for filename, start, end, label, score in zip(batch_data['filename'], batch_data['start'], batch_data['end'], labels, scores):
                
                # We want only detections of right whale, and not background so lets select only those
                if label == 1:
                        detections.append({'filename': filename, 'start': start, 'end': end, 'score': score})

 32%|███▏      | 18/57 [00:13<00:28,  1.37it/s]RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 0 samples were padded on the left and 97 samples were padded on the right
RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 0 samples were padded on the left and 1597 samples were padded on the right
100%|██████████| 57/57 [00:41<00:00,  1.38it/s]

The code above return a list of detections of the form (filename, offset, duration, score). For example, the first detection looks like this:

In [16]:

print(detections[0])

{'filename': 'sample_1.wav', 'start': 4.5, 'end': 7.508, 'score': 0.5466379}

Next, let us see how many upcalls our detector found,

In [17]:

print(len(detections))

Ketos provides a nubmer of post-processing function to organize and filter our detections into a more human readable format. Lets improve the loop above by using these functions. First we will use the detection module filter_by_threshold to only select detection scores above 0.5. In addition, this function will already convert out raw output into a dataframe where each row is a detection with the respective highest score and label.

In [18]:

# Let's store our detections into a list of dictionaries
detections = pd.DataFrame()

batch_generator = batch_load_audio_file_data(loader=audio_loader, batch_size=64) # creating out batch generator

# looping through 
for batch_data in batch_generator: 
        # Run the model on the spectrogram data from the current batch. Here we will return the raw output of the network without unpaking the label and score
        # The filter_by_threshold function does that for us
        batch_predictions = model.run_on_batch(batch_data['data'], return_raw_output=True) 
        
        # Lets store our data in a dictionary
        raw_output = {'filename': batch_data['filename'], 'start': batch_data['start'], 'end': batch_data['end'], 'score': batch_predictions}

        # Batch detection is already a DF
        batch_detections = filter_by_threshold(raw_output, threshold=0.5)

        detections = pd.concat([detections, batch_detections], ignore_index=True)

  0%|          | 0/57 [00:00<?, ?it/s]RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 112 samples were padded on the left and 0 samples were padded on the right
 32%|███▏      | 18/57 [00:13<00:29,  1.31it/s]RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 0 samples were padded on the left and 97 samples were padded on the right
RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 0 samples were padded on the left and 1597 samples were padded on the right
100%|██████████| 57/57 [00:41<00:00,  1.37it/s]

Lets see the result:

In [19]:

print(detections)

          filename   start       end  label     score
0     sample_1.wav     0.0     3.008      0  0.607396
1     sample_1.wav     1.5     4.508      0  0.571148
2     sample_1.wav     3.0     6.008      0  0.695276
3     sample_1.wav     4.5     7.508      1  0.546638
4     sample_1.wav     6.0     9.008      1  0.633900
...            ...     ...       ...    ...       ...
3595  sample_3.wav  1792.5  1795.508      0  0.571708
3596  sample_3.wav  1794.0  1797.008      0  0.753081
3597  sample_3.wav  1795.5  1798.508      0  0.964293
3598  sample_3.wav  1797.0  1800.008      0  0.815634
3599  sample_3.wav  1798.5  1801.508      0  0.998141

[3600 rows x 5 columns]

Much better! All segments created by our audio laoder are now displayed in our dataframe with a label and score associated with it. Now, since we are only concerned with our right whale detections, lets filter our detections to only include those. We will do that with the filter_by_label function:

In [20]:

detections_filtered = filter_by_label(detections, labels=1).reset_index(drop=True)
print(detections_filtered)

         filename   start       end  label     score
0    sample_1.wav     4.5     7.508      1  0.546638
1    sample_1.wav     6.0     9.008      1  0.633900
2    sample_1.wav     9.0    12.008      1  0.518522
3    sample_1.wav    18.0    21.008      1  0.570077
4    sample_1.wav    22.5    25.508      1  0.545963
..            ...     ...       ...    ...       ...
559  sample_3.wav  1674.0  1677.008      1  0.684691
560  sample_3.wav  1675.5  1678.508      1  0.718384
561  sample_3.wav  1678.5  1681.508      1  0.506235
562  sample_3.wav  1686.0  1689.008      1  0.539553
563  sample_3.wav  1710.0  1713.008      1  0.519255

[564 rows x 5 columns]

We now have the same result as we previously had with 564 detections. However, this is a great deal higher than the 26 upcalls we expected to find! Of course, since we have 50% overlap between consecutive frames, it is possible that the same upcall is detected two or even three times.

We can see that this is the case for some of the detections above, with very close starting and end times. They are in fact adjacent drames with 50% overlap. To avoid counting such detections multiple times, we can merge overlapping detections with the merge_overlapping_detections. The effect of this is to merge overlapping detections into a single detection with the combined duration.

In [21]:

detections_grp = merge_overlapping_detections(detections_filtered)

This greatly reduces the number of detections, but still leaves us with many more detections than there should be.

In [22]:

print(len(detections_grp))

There are other options for adjusting the sensitivity of the detector. One way to do this is to increase the value of the threshold for the detection. The value should be between 0 and 1, and as default has the value 0.5. If we lower the threshold, our detector becomes more sensitive; if we increase the threshold, it becomes less sensitive. Let's try to filter our detections to only pass detections with a score value above 0.7. Note that we can already set a higher threshold lue when using the function filter_by_threshold.

In [23]:

threshold = 0.6
detections_grp2 = detections_grp[detections_grp['score'] > threshold]

In [24]:

print(len(detections_grp2))

Now, we only have 48 detections. That's much better!

Another method to reduce the sensitivity of the detector is apply a running average to the detection scores. This is done with the win_len argument, which specifies the length of the averaging window. Since the running average is applied before the threshold, we will likely need to lower the threshold, because the detection scores will dilluted by the surrounding frames. This is a more conservative approach that is less prone to false positives. In adition, as we are using the compute_score_running_avg before the threshold, we need to feed it the raw output from running the network. Let's try with a window size of 5 frames and a threshold of 0.55.

In [25]:

# Let's store our detections into a list of dictionaries
detections_avg = pd.DataFrame()

batch_generator = batch_load_audio_file_data(loader=audio_loader, batch_size=64) # creating out batch generator

# looping through 
for batch_data in batch_generator: 
        # Run the model on the spectrogram data from the current batch. Here we will return the raw output of the network without unpaking the label and score
        # The filter_by_threshold function does that for us
        batch_predictions = model.run_on_batch(batch_data['data'], return_raw_output=True) 
        
        scores = compute_score_running_avg(batch_predictions, 5) # Lets average the scores

        # Lets store our data in a dictionary
        raw_output = {'filename': batch_data['filename'], 'start': batch_data['start'], 'end': batch_data['end'], 'score': scores}

        # Batch detection is already a DF
        batch_detections = filter_by_threshold(raw_output, threshold=0.55)

        # Lets filter by label already
        batch_detections = filter_by_label(batch_detections, labels=1).reset_index(drop=True)

        detections_avg = pd.concat([detections_avg, batch_detections], ignore_index=True)

  0%|          | 0/57 [00:00<?, ?it/s]RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 112 samples were padded on the left and 0 samples were padded on the right
 32%|███▏      | 18/57 [00:13<00:28,  1.35it/s]RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 0 samples were padded on the left and 97 samples were padded on the right
RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 0 samples were padded on the left and 1597 samples were padded on the right
100%|██████████| 57/57 [00:41<00:00,  1.38it/s]

In [26]:

print(len(detections_avg))

This is lower than the 564 we had previously at this point before.

Let us stick with the detections_grp2 (without averaging) for now. Later we'll revisit and compare a few options.

We can visualize the temporal distribution of the detections in a similar manner to what we did with the annotated upcalls in Section 3.

In [27]:

fig, axes = plt.subplots(ncols=3, figsize=(15,3), sharey=True)

for i in range(3):
    filename = f'sample_{i+1}.wav' #filename
    subset = detections_grp2[detections_grp2['filename'] == filename] 
    values = [(row['start'] + 0.5 * row['end']) / 60. for index, row in subset.iterrows()] #mid-point of detection in minutes
    axes[i].hist(values, bins=30, range=(0,30)) #plot the data in a histogram
    axes[i].set_xlabel('Time (minutes)') #set axes labels and title
    if i==0: axes[i].set_ylabel('Upcalls per minute')
    axes[i].set_title(filename)

A quick visual comparison between the temporal distribution of the detections and the annotated upcalls from Section 3, indicates a good match between the two. In the next section, we will demonstrate how to determine the level of agreement in a more quantitative manner.

We end this section by demonstrating how to save the detections to a .csv file using pandas own built in function,

In [28]:

detections_grp2.to_csv('detections.csv', index=False)

8. Performance metrics¶

In the following, we will evaluate how good our detector was at finding individual calls. We will consider a call as detected if the time of occurrence (as reported by the human analyst) is within one of the time intervals flagged by the detector.

The most straightforward way to count the number of detected calls (albeit not the most elegant or fastest) is to construct a nested for-loop, like this,

In [29]:

#Define a function that compares the upcalls found by the model (detections)
#with the upcalls identified by the human expert (annotations).
#The function returns the annotation DataFrame with an extra boolean 
#column indicating if a given annotated upcall was detected by the model.
def compare(annotations, detections):

    detected_list = []

    for idx,row in annotations.iterrows(): #loop over annotations
        filename_annot = row['sound_file']
        time_annot = row['call_time']
        detected = False
        for _, d in detections.iterrows(): #loop over detections
            filename_det = d['filename']
            start_det    = d['start']
            end_det      = start_det + d['end']
            # if the filenames match and the annotated time falls with the start and 
            # end time of the detection interval, consider the call detected
            if filename_annot==filename_det and time_annot >= start_det and time_annot <= end_det:
                detected = True
                break

        detected_list.append(detected)       

    annotations['detected'] = detected_list  #add column to the annotations table
    
    return annotations

#call the function
annotation = compare(annot, detections_grp2)

print(annotation)            

      sound_file  call_time  detected
0   sample_1.wav   1128.840      True
1   sample_1.wav   1153.526      True
2   sample_1.wav   1196.778      True
3   sample_1.wav   1227.642      True
4   sample_1.wav   1358.181      True
5   sample_1.wav   1437.482      True
6   sample_1.wav   1489.288      True
7   sample_1.wav   1511.670      True
8   sample_1.wav   1530.595      True
9   sample_1.wav   1536.580      True
10  sample_1.wav   1714.372      True
11  sample_1.wav   1768.251      True
12  sample_1.wav   1777.835      True
13  sample_2.wav     68.149      True
14  sample_2.wav    688.507      True
15  sample_2.wav    755.940      True
16  sample_2.wav    770.440      True
17  sample_3.wav     68.853     False
18  sample_3.wav    105.927     False
19  sample_3.wav   1057.015      True
20  sample_3.wav   1067.282      True
21  sample_3.wav   1290.563      True
22  sample_3.wav   1378.955      True
23  sample_3.wav   1428.648      True
24  sample_3.wav   1663.622      True
25  sample_3.wav   1676.682      True

We see that many of the detections do indeed match up with annotated upcalls. We can summarize the performance as follows:

The detector found 24 out of 26 upcalls (Recall of 92%)

We could decrease the number of FP by increasing the threshold. Another usefull method is add_detection_buffer which adds a buffer to the start and end times of each detection and can make the model more robust agains slight temporal misalignments.

In [30]:

add_detection_buffer(detections_grp2, 1.5)

Out[30]:

	filename	start	end	label	score
14	sample_1.wav	175.5	181.508	1	0.624963
19	sample_1.wav	217.5	229.508	1	0.686901
45	sample_1.wav	517.5	523.508	1	0.631580
54	sample_1.wav	634.5	640.508	1	0.604014
62	sample_1.wav	699.0	705.008	1	0.624835
75	sample_1.wav	840.0	846.008	1	0.626274
86	sample_1.wav	942.0	957.008	1	0.634163
87	sample_1.wav	958.5	967.508	1	0.619642
90	sample_1.wav	987.0	993.008	1	0.685292
101	sample_1.wav	1104.0	1114.508	1	0.607236
102	sample_1.wav	1120.5	1132.508	1	0.895328
105	sample_1.wav	1150.5	1158.008	1	0.756210
106	sample_1.wav	1158.0	1164.008	1	0.624756
109	sample_1.wav	1191.0	1215.008	1	0.690886
111	sample_1.wav	1224.0	1230.008	1	0.881223
128	sample_1.wav	1431.0	1444.508	1	0.692235
130	sample_1.wav	1449.0	1455.008	1	0.645343
133	sample_1.wav	1485.0	1492.508	1	0.946453
135	sample_1.wav	1507.5	1515.008	1	0.962210
136	sample_1.wav	1513.5	1521.008	1	0.604832
137	sample_1.wav	1522.5	1534.508	1	0.801157
138	sample_1.wav	1548.0	1555.508	1	0.727302
140	sample_1.wav	1566.0	1572.008	1	0.624956
143	sample_1.wav	1600.5	1606.508	1	0.602406
154	sample_1.wav	1710.0	1717.508	1	0.991572
158	sample_1.wav	1762.5	1773.008	1	0.688623
159	sample_1.wav	1774.5	1783.508	1	0.708584
162	sample_1.wav	1795.5	1801.508	1	0.813489
163	sample_2.wav	64.5	72.008	1	0.991744
176	sample_2.wav	684.0	691.508	1	0.888154
177	sample_2.wav	751.5	760.508	1	0.827067
178	sample_2.wav	766.5	774.008	1	0.927823
184	sample_2.wav	1014.0	1020.008	1	0.710808
185	sample_2.wav	1024.5	1030.508	1	0.631745
188	sample_2.wav	1087.5	1093.508	1	0.609067
200	sample_2.wav	1281.0	1287.008	1	0.637306
207	sample_2.wav	1339.5	1351.508	1	0.623313
231	sample_3.wav	645.0	651.008	1	0.690382
233	sample_3.wav	763.5	769.508	1	0.626259
235	sample_3.wav	906.0	912.008	1	0.625516
237	sample_3.wav	1002.0	1008.008	1	0.608265
238	sample_3.wav	1053.0	1060.508	1	0.662479
239	sample_3.wav	1063.5	1071.008	1	0.688362
245	sample_3.wav	1287.0	1294.508	1	0.776136
246	sample_3.wav	1375.5	1383.008	1	0.698425
247	sample_3.wav	1425.0	1432.508	1	0.815898
253	sample_3.wav	1660.5	1666.508	1	0.677343
254	sample_3.wav	1672.5	1683.008	1	0.603886

Finally, let's see what happens if we increase the step size from 1.5 s to 3.0 s. In this case, there will be no ovelap between consecutive spectrograms, so it runs a little faster but it could result in some upcalls being missed.

In [31]:

audio_loader = AudioFrameLoader(path='./audio/', duration=spec_config['duration'], 
                                step=3.0, stop=True, pad=False, representation=spec_config['type'], representation_params=spec_config)

In [32]:

# Let's store our detections into a list of dictionaries
detections_large_step_size = pd.DataFrame()

batch_generator = batch_load_audio_file_data(loader=audio_loader, batch_size=64) # creating out batch generator

# looping through 
for batch_data in batch_generator: 
        # Run the model on the spectrogram data from the current batch. Here we will return the raw output of the network without unpaking the label and score
        # The filter_by_threshold function does that for us
        batch_predictions = model.run_on_batch(batch_data['data'], return_raw_output=True) 
        
        # Lets store our data in a dictionary
        raw_output = {'filename': batch_data['filename'], 'start': batch_data['start'], 'end': batch_data['end'], 'score': batch_predictions}

        # Batch detection is already a DF
        batch_detections = filter_by_threshold(raw_output, threshold=0.55)

        detections_large_step_size = pd.concat([detections_large_step_size, batch_detections], ignore_index=True)

  0%|          | 0/29 [00:00<?, ?it/s]RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 112 samples were padded on the left and 0 samples were padded on the right
 31%|███       | 9/29 [00:06<00:14,  1.35it/s]RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 0 samples were padded on the left and 97 samples were padded on the right
100%|██████████| 29/29 [00:20<00:00,  1.45it/s]

Let's only select detections with label = 1 again.

In [33]:

detections_large_step_size_filter = filter_by_label(detections_large_step_size, labels=1).reset_index(drop=True) 

In [34]:

annotation = compare(annot, detections_large_step_size_filter)

print(len(detections_large_step_size_filter))
print(annotation) 

131
      sound_file  call_time  detected
0   sample_1.wav   1128.840      True
1   sample_1.wav   1153.526      True
2   sample_1.wav   1196.778      True
3   sample_1.wav   1227.642      True
4   sample_1.wav   1358.181      True
5   sample_1.wav   1437.482      True
6   sample_1.wav   1489.288      True
7   sample_1.wav   1511.670      True
8   sample_1.wav   1530.595      True
9   sample_1.wav   1536.580      True
10  sample_1.wav   1714.372      True
11  sample_1.wav   1768.251      True
12  sample_1.wav   1777.835      True
13  sample_2.wav     68.149      True
14  sample_2.wav    688.507      True
15  sample_2.wav    755.940      True
16  sample_2.wav    770.440      True
17  sample_3.wav     68.853     False
18  sample_3.wav    105.927     False
19  sample_3.wav   1057.015      True
20  sample_3.wav   1067.282      True
21  sample_3.wav   1290.563      True
22  sample_3.wav   1378.955      True
23  sample_3.wav   1428.648      True
24  sample_3.wav   1663.622      True
25  sample_3.wav   1676.682      True

We see that the processing took about half as long with the same detections.

9. Conclusion¶

In this tutorial we used a pre-trained binary classifier that works on short snapshots to detect North Atlantic right whales in longer recordings. We explored several options and observed how the exact same pre-trained neural network can result in very different performances, depending on how we apply it. Although the quality of model is very important, we highlight that it's worth thinking about how the model will be applied to the task at hand and test the alternatives available.

Once you are happy with your model and have chosen the best way to use it for your workflow, chances are that you will benefit from encapsulate it in some sort of application. Alongside this tutorial, we provided a Command Line Interface that includes all the options we explored here. After going through this tutorial, you'll be familiar with the functions used in the CLI, which can be used as is for this model, other models produced with Ketos or even custom architectures and audio representations. It also serves as a template not only for command line tools, but more generally as a way of integrating ketos models into your applications. A CLI is great to be used by other programs. For example, if you need to run a detector on large amounts of archived data and you have acess to a cluster, a CLI is the way to go. If instead you want to develop a web-app that allows your collaborators to use your detector, or even a desktop app with a friendly graphical interface, the same components can be used.