Tutorial: Creating a detector
Note
You can download materials to follow along this tutorial from the links below:
North Atlantic Right Whale detector - part 2¶
This is the second of a two-parts tutorial illustrating how to build a deep learning acoustic detector with Ketos.
We will be using the deep learning classifier that we trained in Part 1 of this tutorial. As you may recall, this classifier was trained to determine whether a given 3-second audio clip contains a North Atlantic right whale (NARW) upcall. Our goal is now to transform this classifier into a detector that can analyze an audio file (.wav) of an arbitrary duration (e.g. 30 min) and tell us where within that file upcalls occur.
Intuitively, the solution to this problem is straightforward: Simply slide a 3-s wide window across the audio file and use the classifier to determine if an upcall is present within the window at any given instant. However, as you try to implement this solution, you immediately run into a number of practical questions. For example, how large steps should we take when sliding the window?
In this tutorial, we will outline a few different strategies to implemeting the sliding-window approach. Ultimately, however, there is no best strategy, and the preferred strategy will depend on the objectives of your study, e.g., do you need to know the precise location of every upcall, or do you only need to know the call rate per hour, etc.
Contents:¶
1. Importing the packages
2. Loading the classifier
3. Inspecting the test data
4. Choosing a step size
5. Loading the test data, frame by frame
6. Feeding the frames to the classifier
7. Putting it all together - the process method
8. Performance metrics
9. Notes on execution time
10. Conclusion
1. Importing the packages¶
We start by importing the modules we will use throughout the tutorial.
import pandas as pd
from ketos.audio.spectrogram import MagSpectrogram
from ketos.audio.audio_loader import AudioFrameLoader
from ketos.audio import load_audio_representation_from_file
from ketos.neural_networks.resnet import ResNetInterface
from ketos.neural_networks.dev_utils.detection import batch_load_audio_file_data, add_detection_buffer, compute_score_running_avg, merge_overlapping_detections, filter_by_threshold, filter_by_label
import matplotlib.pyplot as plt
%matplotlib inline
2023-07-07 14:16:12.246110: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-07-07 14:16:12.251222: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-07-07 14:16:12.251383: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-07-07 14:16:12.252286: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-07-07 14:16:12.252682: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-07-07 14:16:12.252840: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-07-07 14:16:12.252977: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-07-07 14:16:12.701080: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-07-07 14:16:12.701266: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-07-07 14:16:12.701414: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-07-07 14:16:12.701531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6603 MB memory: -> device: 0, name: NVIDIA GeForce GTX 1070 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1 /home/bruno/.pyenv/versions/3.10.8/envs/ketos-tutorial/lib/python3.10/site-packages/keras/optimizer_v2/adam.py:105: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead. super(Adam, self).__init__(name, **kwargs)
2. Loading the classifier¶
Next, we load the deep learning classifier that we trained in Part 1 of this tutorial along with the spectrogram parameters used in the training phase. It is important that we use precisely the same set of parameters for computing the spectrograms now, in the inference phase. Otherwise the classifier will get confused!
We load the trained classifier using the ResNetInterface.load
method, and the spectrogram parameters with the load_audio_representation_from_file
method, as follows
model = ResNetInterface.load(model_file='narw.kt', new_model_folder='./narw_tmp_folder')
audio_repr = load_audio_representation_from_file(model_file='narw.kt')
The first argument, model_file
, specifies the path to the saved classifier; the second argument, new_model_folder
, is a folder where temporary files needed by the classifier will be saved during execution of the code; finally, with the third argument, load_audio_repr
, we specify that we want to load not only the classifier, but also the spectrogram parameters.
Let us briefly inspect the spectrogram parameters,
spec_config = audio_repr[0]
spec_config
{'rate': 1000, 'window': 0.256, 'step': 0.032, 'freq_min': 0, 'freq_max': 500, 'window_func': 'hamming', 'type': ketos.audio.spectrogram.MagSpectrogram, 'duration': 3.0}
This tells us the type of the spectrogram (Magnitude Spectrogram), the sampling rate of the audio signal (1000 samples/s), the window size (0.256 s), the step size (0.032 s), the minimum and maximum frequencies (0 and 500 Hz), the window function (Hamming), and the duration of each clip (3.0 s).
3. Inspecting the test data¶
This tutorial includes three audio files, which were recorded in the Gulf of St. Lawrence in the summer of 2016. The files, which are named sample_1.wav
, sample_2.wav
, and sample_3.wav
, contain a total of 26 NARW upcalls. Each file is 30 minute long. We will be using these data to test the performance of our detector.
The time of occurrence of the upcalls is listed in the file annotations.csv
. Let's take a look at them,
annot = pd.read_csv('annotations.csv', sep=';')
print(annot)
sound_file call_time 0 sample_1.wav 1128.840 1 sample_1.wav 1153.526 2 sample_1.wav 1196.778 3 sample_1.wav 1227.642 4 sample_1.wav 1358.181 5 sample_1.wav 1437.482 6 sample_1.wav 1489.288 7 sample_1.wav 1511.670 8 sample_1.wav 1530.595 9 sample_1.wav 1536.580 10 sample_1.wav 1714.372 11 sample_1.wav 1768.251 12 sample_1.wav 1777.835 13 sample_2.wav 68.149 14 sample_2.wav 688.507 15 sample_2.wav 755.940 16 sample_2.wav 770.440 17 sample_3.wav 68.853 18 sample_3.wav 105.927 19 sample_3.wav 1057.015 20 sample_3.wav 1067.282 21 sample_3.wav 1290.563 22 sample_3.wav 1378.955 23 sample_3.wav 1428.648 24 sample_3.wav 1663.622 25 sample_3.wav 1676.682
We see that the file contains two columns, one for the filename, and one for the time of occurrence of the upcall, measured in seconds from the beginning of the file. This time corresponds roughly to the mid point of the upcall.
We can visualize the temporal distribution of the upcalls, like this, using methods from the matplotlib
package,
fig, axes = plt.subplots(ncols=3, figsize=(15,3), sharey=True)
for i in range(3):
filename = f'sample_{i+1}.wav' #filename
values = annot[annot['sound_file']==filename]['call_time'].values / 60 #select the occurrence times for this file, and convert from seconds to minutes
axes[i].hist(values, bins=30, range=(0,30)) #plot the data in a histogram
axes[i].set_xlabel('Time (minutes)') #set axes labels and title
if i==0: axes[i].set_ylabel('Upcalls per minute')
axes[i].set_title(filename)
We see that all three files contain upcalls. Moreover, we note that the upcalls have a tendency to cluster.
To inspect the spectrogram representation of individual upcalls, we can use the MagSpectrogram.from_wav
method of the ketos
package, like this
# compute the spectrogram of the 1st upcall, using the spectrogram parameters loaded from the saved model
spec = MagSpectrogram.from_wav(path='audio/sample_1.wav',
offset=1128.840 - 0.5*spec_config['duration'],
**spec_config)
spec.plot() #create the figure
plt.show() #display it!
4. Choosing a step size¶
The firs step in creating the detector is choosing a suitable step size. Our window is 3 seconds wide, so our step size should certainly not be greater than 3 seconds. Otherwise, we will miss parts of the data. If we make the step size smaller than 3 seconds, consecutive windows will overlap. The smaller the step size, the greater the overlap.
It is usually desirable to have some overlap between consecutive windows to ensure that all signals are fully captured in at least one window. However, the step size should not be made unnecessarily small as this will increase the computational cost (since more spectrograms will have to be computed and examined by the classifier) without further gain in performance.
For the NARW upcall, which has a duration ranging from 1.0 to 1.5 seconds or so, an overlap of 1.5 seconds would appear to provide the optimal choice if we want to maximize the chances of detecting all the upcalls while not incurring any unnecessary computional costs.
5. Loading the test data, frame by frame¶
Ketos provides a handy class called AudioFrameLoader
that helps us split the audio files into (overlapping) frames of equal length and load them one at the time. Moreover, this class can also convert the waveform into a spectrogram representation for us.
To initialize an instance of the AudioFrameLoader, we have to specify the path to the folder where the audio files are stored (path
), the step size in seconds (step
), and the spectrogram representation we want as output (repres
). Thus,
audio_loader = AudioFrameLoader(path='./audio/', duration=spec_config['duration'],
step=1.5, stop=False, representation=spec_config['type'], representation_params=spec_config)
The AudioFrameLoader
class as an iterator protocol implemented to load the frames, one by one. It works like this,
first_spec = next(audio_loader) #load the first 3.0-s frame
second_spec = next(audio_loader) #load the second 3.0-s frame
# etc.
RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 112 samples were padded on the left and 0 samples were padded on the right
Let's take a quick peak at the two spectrograms we have just loaded,
first_spec.plot()
second_spec.plot()
plt.show()
We note that the first 1.5 seconds of the second spectrogram are identical to the last 1.5 seconds of the first spectrogram. (This can also be seen from the top time axis of the second spectrogram, which shows the offset relative to the start of the file.)
Using the next()
method repeatedly, we can scan through the entire audio recordings. In order to know when to stop, we can use the num()
method to determine how many frames there are in total,
print(audio_loader.num())
3600
Another useful method is reset()
, which resets the audio loader to the first frame.
audio_loader.reset()
6. Feeding the frames to the classifier¶
Now that we know how to load the frames, the next step is to feed them to the classifier. We can feed the frames one at the time using the run_on_instance
method, like this
spec = next(audio_loader) #load a spectrogram
data = spec.get_data() #extract the pixel values as a 2d array
output = model.run_on_instance(data) #pass the pixel values to the classifier
print(output) #print the classifier's output
2023-07-07 14:16:16.886041: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8101
(array([0]), array([0.6073958], dtype=float32))
Let's take a moment to inspect the output returned by the classifier. We see that the classifier has returned a tuple with two elements, array([0])
and array([0.64735734])
; these are both arrays with length 1 and 32 floating point precision.
The first element, array([0])
, tells us the label predicted by the classifier, in this case a 0, i.e., no upcall. The second element, array([0.64735734])
, is the score
assigned to the prediction. We can think of this as a measure of the classifier's confidence in its prediction; the higher the score, the larger the confidence.
We can print the label and the score in a more human-readable style, like this
label = output[0]
score = output[1]
print(f'label: {label}')
print(f'score: {score}')
label: [0] score: [0.6073958]
When processing entire audio files, it is usually advantageous to feed multiple frames to the classifier at the time, as this will allow for faster processing. This approach is called batch processing, and can be accomplished using the method run_on_batch
in place of run_on_instance
. We will use it in the next step.
audio_loader.reset()
7. Putting it all together¶
At this point, we have all the necessary components to construct our detector. We've learned how to load spectrograms frame by frame and how to feed these spectrograms to our classifier. Additionally, we understand how to process the classifier's output. The final step is to put all these operations together within a for loop that will run through audio files.
To achieve this, we will use the batch_load_audio_file_data
method from the detection module. This method allows us to load our spectrograms in batches, which is more memory-efficient and typically faster than loading them one at a time. The function takes two parameters: loader, which is an AudioFrameLoader object responsible for loading audio data and converting them into spectrograms, and batch_size, which specifies the number of samples to include in each batch. This function yields batches of audio data, where each batch consists of spectrogram data, filename, start time, and end time of each audio segment.
All that remains is to place this inside a for loop to iterate over the batches, feeding each batch into our model. Let's try it out! (Note: this will take a few moments)
# Let's store our detections into a list of dictionaries
detections = []
batch_generator = batch_load_audio_file_data(loader=audio_loader, batch_size=64) # creating out batch generator
# looping through
for batch_data in batch_generator:
# Run the model on the spectrogram data from the current batch
batch_predictions = model.run_on_batch(batch_data['data'])
# As we did before, unpack batch_predictions into labels and scores
labels, scores = batch_predictions
# For each item in the batch, store the filename, start, end, and prediction score
for filename, start, end, label, score in zip(batch_data['filename'], batch_data['start'], batch_data['end'], labels, scores):
# We want only detections of right whale, and not background so lets select only those
if label == 1:
detections.append({'filename': filename, 'start': start, 'end': end, 'score': score})
32%|███▏ | 18/57 [00:13<00:28, 1.37it/s]RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 0 samples were padded on the left and 97 samples were padded on the right RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 0 samples were padded on the left and 1597 samples were padded on the right 100%|██████████| 57/57 [00:41<00:00, 1.38it/s]
The code above return a list of detections of the form (filename, offset, duration, score)
. For example, the first detection looks like this:
print(detections[0])
{'filename': 'sample_1.wav', 'start': 4.5, 'end': 7.508, 'score': 0.5466379}
Next, let us see how many upcalls our detector found,
print(len(detections))
564
Ketos provides a nubmer of post-processing function to organize and filter our detections into a more human readable format. Lets improve the loop above by using these functions. First we will use the detection
module filter_by_threshold
to only select detection scores above 0.5. In addition, this function will already convert out raw output into a dataframe where each row is a detection with the respective highest score and label.
# Let's store our detections into a list of dictionaries
detections = pd.DataFrame()
batch_generator = batch_load_audio_file_data(loader=audio_loader, batch_size=64) # creating out batch generator
# looping through
for batch_data in batch_generator:
# Run the model on the spectrogram data from the current batch. Here we will return the raw output of the network without unpaking the label and score
# The filter_by_threshold function does that for us
batch_predictions = model.run_on_batch(batch_data['data'], return_raw_output=True)
# Lets store our data in a dictionary
raw_output = {'filename': batch_data['filename'], 'start': batch_data['start'], 'end': batch_data['end'], 'score': batch_predictions}
# Batch detection is already a DF
batch_detections = filter_by_threshold(raw_output, threshold=0.5)
detections = pd.concat([detections, batch_detections], ignore_index=True)
0%| | 0/57 [00:00<?, ?it/s]RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 112 samples were padded on the left and 0 samples were padded on the right 32%|███▏ | 18/57 [00:13<00:29, 1.31it/s]RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 0 samples were padded on the left and 97 samples were padded on the right RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 0 samples were padded on the left and 1597 samples were padded on the right 100%|██████████| 57/57 [00:41<00:00, 1.37it/s]
Lets see the result:
print(detections)
filename start end label score 0 sample_1.wav 0.0 3.008 0 0.607396 1 sample_1.wav 1.5 4.508 0 0.571148 2 sample_1.wav 3.0 6.008 0 0.695276 3 sample_1.wav 4.5 7.508 1 0.546638 4 sample_1.wav 6.0 9.008 1 0.633900 ... ... ... ... ... ... 3595 sample_3.wav 1792.5 1795.508 0 0.571708 3596 sample_3.wav 1794.0 1797.008 0 0.753081 3597 sample_3.wav 1795.5 1798.508 0 0.964293 3598 sample_3.wav 1797.0 1800.008 0 0.815634 3599 sample_3.wav 1798.5 1801.508 0 0.998141 [3600 rows x 5 columns]
Much better! All segments created by our audio laoder are now displayed in our dataframe with a label and score associated with it. Now, since we are only concerned with our right whale detections, lets filter our detections to only include those. We will do that with the filter_by_label
function:
detections_filtered = filter_by_label(detections, labels=1).reset_index(drop=True)
print(detections_filtered)
filename start end label score 0 sample_1.wav 4.5 7.508 1 0.546638 1 sample_1.wav 6.0 9.008 1 0.633900 2 sample_1.wav 9.0 12.008 1 0.518522 3 sample_1.wav 18.0 21.008 1 0.570077 4 sample_1.wav 22.5 25.508 1 0.545963 .. ... ... ... ... ... 559 sample_3.wav 1674.0 1677.008 1 0.684691 560 sample_3.wav 1675.5 1678.508 1 0.718384 561 sample_3.wav 1678.5 1681.508 1 0.506235 562 sample_3.wav 1686.0 1689.008 1 0.539553 563 sample_3.wav 1710.0 1713.008 1 0.519255 [564 rows x 5 columns]
We now have the same result as we previously had with 564 detections. However, this is a great deal higher than the 26 upcalls we expected to find! Of course, since we have 50% overlap between consecutive frames, it is possible that the same upcall is detected two or even three times.
We can see that this is the case for some of the detections above, with very close starting and end times. They are in fact adjacent drames with 50% overlap. To avoid counting such detections multiple times, we can merge overlapping detections with the merge_overlapping_detections
. The effect of this is to merge overlapping detections into a single detection with the combined duration.
detections_grp = merge_overlapping_detections(detections_filtered)
This greatly reduces the number of detections, but still leaves us with many more detections than there should be.
print(len(detections_grp))
257
There are other options for adjusting the sensitivity of the detector. One way to do this is to increase the value of the threshold for the detection. The value should be between 0 and 1, and as default has the value 0.5. If we lower the threshold, our detector becomes more sensitive; if we increase the threshold, it becomes less sensitive. Let's try to filter our detections to only pass detections with a score value above 0.7. Note that we can already set a higher threshold lue when using the function filter_by_threshold.
threshold = 0.6
detections_grp2 = detections_grp[detections_grp['score'] > threshold]
print(len(detections_grp2))
48
Now, we only have 48 detections. That's much better!
Another method to reduce the sensitivity of the detector is apply a running average to the detection scores. This is done with the win_len
argument, which specifies the length of the averaging window. Since the running average is applied before the threshold, we will likely need to lower the threshold, because the detection scores will dilluted by the surrounding frames. This is a more conservative approach that is less prone to false positives. In adition, as we are using the compute_score_running_avg
before the threshold, we need to feed it the raw output from running the network. Let's try with a window size of 5 frames and a threshold of 0.55.
# Let's store our detections into a list of dictionaries
detections_avg = pd.DataFrame()
batch_generator = batch_load_audio_file_data(loader=audio_loader, batch_size=64) # creating out batch generator
# looping through
for batch_data in batch_generator:
# Run the model on the spectrogram data from the current batch. Here we will return the raw output of the network without unpaking the label and score
# The filter_by_threshold function does that for us
batch_predictions = model.run_on_batch(batch_data['data'], return_raw_output=True)
scores = compute_score_running_avg(batch_predictions, 5) # Lets average the scores
# Lets store our data in a dictionary
raw_output = {'filename': batch_data['filename'], 'start': batch_data['start'], 'end': batch_data['end'], 'score': scores}
# Batch detection is already a DF
batch_detections = filter_by_threshold(raw_output, threshold=0.55)
# Lets filter by label already
batch_detections = filter_by_label(batch_detections, labels=1).reset_index(drop=True)
detections_avg = pd.concat([detections_avg, batch_detections], ignore_index=True)
0%| | 0/57 [00:00<?, ?it/s]RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 112 samples were padded on the left and 0 samples were padded on the right 32%|███▏ | 18/57 [00:13<00:28, 1.35it/s]RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 0 samples were padded on the left and 97 samples were padded on the right RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 0 samples were padded on the left and 1597 samples were padded on the right 100%|██████████| 57/57 [00:41<00:00, 1.38it/s]
print(len(detections_avg))
104
This is lower than the 564 we had previously at this point before.
Let us stick with the detections_grp2
(without averaging) for now. Later we'll revisit and compare a few options.
We can visualize the temporal distribution of the detections in a similar manner to what we did with the annotated upcalls in Section 3.
fig, axes = plt.subplots(ncols=3, figsize=(15,3), sharey=True)
for i in range(3):
filename = f'sample_{i+1}.wav' #filename
subset = detections_grp2[detections_grp2['filename'] == filename]
values = [(row['start'] + 0.5 * row['end']) / 60. for index, row in subset.iterrows()] #mid-point of detection in minutes
axes[i].hist(values, bins=30, range=(0,30)) #plot the data in a histogram
axes[i].set_xlabel('Time (minutes)') #set axes labels and title
if i==0: axes[i].set_ylabel('Upcalls per minute')
axes[i].set_title(filename)
A quick visual comparison between the temporal distribution of the detections and the annotated upcalls from Section 3, indicates a good match between the two. In the next section, we will demonstrate how to determine the level of agreement in a more quantitative manner.
We end this section by demonstrating how to save the detections to a .csv file using pandas own built in function,
detections_grp2.to_csv('detections.csv', index=False)
8. Performance metrics¶
In the following, we will evaluate how good our detector was at finding individual calls. We will consider a call as detected if the time of occurrence (as reported by the human analyst) is within one of the time intervals flagged by the detector.
The most straightforward way to count the number of detected calls (albeit not the most elegant or fastest) is to construct a nested for-loop, like this,
#Define a function that compares the upcalls found by the model (detections)
#with the upcalls identified by the human expert (annotations).
#The function returns the annotation DataFrame with an extra boolean
#column indicating if a given annotated upcall was detected by the model.
def compare(annotations, detections):
detected_list = []
for idx,row in annotations.iterrows(): #loop over annotations
filename_annot = row['sound_file']
time_annot = row['call_time']
detected = False
for _, d in detections.iterrows(): #loop over detections
filename_det = d['filename']
start_det = d['start']
end_det = start_det + d['end']
# if the filenames match and the annotated time falls with the start and
# end time of the detection interval, consider the call detected
if filename_annot==filename_det and time_annot >= start_det and time_annot <= end_det:
detected = True
break
detected_list.append(detected)
annotations['detected'] = detected_list #add column to the annotations table
return annotations
#call the function
annotation = compare(annot, detections_grp2)
print(annotation)
sound_file call_time detected 0 sample_1.wav 1128.840 True 1 sample_1.wav 1153.526 True 2 sample_1.wav 1196.778 True 3 sample_1.wav 1227.642 True 4 sample_1.wav 1358.181 True 5 sample_1.wav 1437.482 True 6 sample_1.wav 1489.288 True 7 sample_1.wav 1511.670 True 8 sample_1.wav 1530.595 True 9 sample_1.wav 1536.580 True 10 sample_1.wav 1714.372 True 11 sample_1.wav 1768.251 True 12 sample_1.wav 1777.835 True 13 sample_2.wav 68.149 True 14 sample_2.wav 688.507 True 15 sample_2.wav 755.940 True 16 sample_2.wav 770.440 True 17 sample_3.wav 68.853 False 18 sample_3.wav 105.927 False 19 sample_3.wav 1057.015 True 20 sample_3.wav 1067.282 True 21 sample_3.wav 1290.563 True 22 sample_3.wav 1378.955 True 23 sample_3.wav 1428.648 True 24 sample_3.wav 1663.622 True 25 sample_3.wav 1676.682 True
We see that many of the detections do indeed match up with annotated upcalls. We can summarize the performance as follows:
The detector found 24 out of 26 upcalls (Recall of 92%)
We could decrease the number of FP by increasing the threshold. Another usefull method is add_detection_buffer
which adds a buffer to the start and end times of each detection and can make the model more robust agains slight temporal misalignments.
add_detection_buffer(detections_grp2, 1.5)
filename | start | end | label | score | |
---|---|---|---|---|---|
14 | sample_1.wav | 175.5 | 181.508 | 1 | 0.624963 |
19 | sample_1.wav | 217.5 | 229.508 | 1 | 0.686901 |
45 | sample_1.wav | 517.5 | 523.508 | 1 | 0.631580 |
54 | sample_1.wav | 634.5 | 640.508 | 1 | 0.604014 |
62 | sample_1.wav | 699.0 | 705.008 | 1 | 0.624835 |
75 | sample_1.wav | 840.0 | 846.008 | 1 | 0.626274 |
86 | sample_1.wav | 942.0 | 957.008 | 1 | 0.634163 |
87 | sample_1.wav | 958.5 | 967.508 | 1 | 0.619642 |
90 | sample_1.wav | 987.0 | 993.008 | 1 | 0.685292 |
101 | sample_1.wav | 1104.0 | 1114.508 | 1 | 0.607236 |
102 | sample_1.wav | 1120.5 | 1132.508 | 1 | 0.895328 |
105 | sample_1.wav | 1150.5 | 1158.008 | 1 | 0.756210 |
106 | sample_1.wav | 1158.0 | 1164.008 | 1 | 0.624756 |
109 | sample_1.wav | 1191.0 | 1215.008 | 1 | 0.690886 |
111 | sample_1.wav | 1224.0 | 1230.008 | 1 | 0.881223 |
128 | sample_1.wav | 1431.0 | 1444.508 | 1 | 0.692235 |
130 | sample_1.wav | 1449.0 | 1455.008 | 1 | 0.645343 |
133 | sample_1.wav | 1485.0 | 1492.508 | 1 | 0.946453 |
135 | sample_1.wav | 1507.5 | 1515.008 | 1 | 0.962210 |
136 | sample_1.wav | 1513.5 | 1521.008 | 1 | 0.604832 |
137 | sample_1.wav | 1522.5 | 1534.508 | 1 | 0.801157 |
138 | sample_1.wav | 1548.0 | 1555.508 | 1 | 0.727302 |
140 | sample_1.wav | 1566.0 | 1572.008 | 1 | 0.624956 |
143 | sample_1.wav | 1600.5 | 1606.508 | 1 | 0.602406 |
154 | sample_1.wav | 1710.0 | 1717.508 | 1 | 0.991572 |
158 | sample_1.wav | 1762.5 | 1773.008 | 1 | 0.688623 |
159 | sample_1.wav | 1774.5 | 1783.508 | 1 | 0.708584 |
162 | sample_1.wav | 1795.5 | 1801.508 | 1 | 0.813489 |
163 | sample_2.wav | 64.5 | 72.008 | 1 | 0.991744 |
176 | sample_2.wav | 684.0 | 691.508 | 1 | 0.888154 |
177 | sample_2.wav | 751.5 | 760.508 | 1 | 0.827067 |
178 | sample_2.wav | 766.5 | 774.008 | 1 | 0.927823 |
184 | sample_2.wav | 1014.0 | 1020.008 | 1 | 0.710808 |
185 | sample_2.wav | 1024.5 | 1030.508 | 1 | 0.631745 |
188 | sample_2.wav | 1087.5 | 1093.508 | 1 | 0.609067 |
200 | sample_2.wav | 1281.0 | 1287.008 | 1 | 0.637306 |
207 | sample_2.wav | 1339.5 | 1351.508 | 1 | 0.623313 |
231 | sample_3.wav | 645.0 | 651.008 | 1 | 0.690382 |
233 | sample_3.wav | 763.5 | 769.508 | 1 | 0.626259 |
235 | sample_3.wav | 906.0 | 912.008 | 1 | 0.625516 |
237 | sample_3.wav | 1002.0 | 1008.008 | 1 | 0.608265 |
238 | sample_3.wav | 1053.0 | 1060.508 | 1 | 0.662479 |
239 | sample_3.wav | 1063.5 | 1071.008 | 1 | 0.688362 |
245 | sample_3.wav | 1287.0 | 1294.508 | 1 | 0.776136 |
246 | sample_3.wav | 1375.5 | 1383.008 | 1 | 0.698425 |
247 | sample_3.wav | 1425.0 | 1432.508 | 1 | 0.815898 |
253 | sample_3.wav | 1660.5 | 1666.508 | 1 | 0.677343 |
254 | sample_3.wav | 1672.5 | 1683.008 | 1 | 0.603886 |
Finally, let's see what happens if we increase the step size from 1.5 s to 3.0 s. In this case, there will be no ovelap between consecutive spectrograms, so it runs a little faster but it could result in some upcalls being missed.
audio_loader = AudioFrameLoader(path='./audio/', duration=spec_config['duration'],
step=3.0, stop=True, pad=False, representation=spec_config['type'], representation_params=spec_config)
# Let's store our detections into a list of dictionaries
detections_large_step_size = pd.DataFrame()
batch_generator = batch_load_audio_file_data(loader=audio_loader, batch_size=64) # creating out batch generator
# looping through
for batch_data in batch_generator:
# Run the model on the spectrogram data from the current batch. Here we will return the raw output of the network without unpaking the label and score
# The filter_by_threshold function does that for us
batch_predictions = model.run_on_batch(batch_data['data'], return_raw_output=True)
# Lets store our data in a dictionary
raw_output = {'filename': batch_data['filename'], 'start': batch_data['start'], 'end': batch_data['end'], 'score': batch_predictions}
# Batch detection is already a DF
batch_detections = filter_by_threshold(raw_output, threshold=0.55)
detections_large_step_size = pd.concat([detections_large_step_size, batch_detections], ignore_index=True)
0%| | 0/29 [00:00<?, ?it/s]RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 112 samples were padded on the left and 0 samples were padded on the right 31%|███ | 9/29 [00:06<00:14, 1.35it/s]RuntimeWarning: Waveform padded with its own reflection to achieve required length to compute the stft. 0 samples were padded on the left and 97 samples were padded on the right 100%|██████████| 29/29 [00:20<00:00, 1.45it/s]
Let's only select detections with label = 1 again.
detections_large_step_size_filter = filter_by_label(detections_large_step_size, labels=1).reset_index(drop=True)
annotation = compare(annot, detections_large_step_size_filter)
print(len(detections_large_step_size_filter))
print(annotation)
131 sound_file call_time detected 0 sample_1.wav 1128.840 True 1 sample_1.wav 1153.526 True 2 sample_1.wav 1196.778 True 3 sample_1.wav 1227.642 True 4 sample_1.wav 1358.181 True 5 sample_1.wav 1437.482 True 6 sample_1.wav 1489.288 True 7 sample_1.wav 1511.670 True 8 sample_1.wav 1530.595 True 9 sample_1.wav 1536.580 True 10 sample_1.wav 1714.372 True 11 sample_1.wav 1768.251 True 12 sample_1.wav 1777.835 True 13 sample_2.wav 68.149 True 14 sample_2.wav 688.507 True 15 sample_2.wav 755.940 True 16 sample_2.wav 770.440 True 17 sample_3.wav 68.853 False 18 sample_3.wav 105.927 False 19 sample_3.wav 1057.015 True 20 sample_3.wav 1067.282 True 21 sample_3.wav 1290.563 True 22 sample_3.wav 1378.955 True 23 sample_3.wav 1428.648 True 24 sample_3.wav 1663.622 True 25 sample_3.wav 1676.682 True
We see that the processing took about half as long with the same detections.
9. Conclusion¶
In this tutorial we used a pre-trained binary classifier that works on short snapshots to detect North Atlantic right whales in longer recordings. We explored several options and observed how the exact same pre-trained neural network can result in very different performances, depending on how we apply it. Although the quality of model is very important, we highlight that it's worth thinking about how the model will be applied to the task at hand and test the alternatives available.
Once you are happy with your model and have chosen the best way to use it for your workflow, chances are that you will benefit from encapsulate it in some sort of application. Alongside this tutorial, we provided a Command Line Interface that includes all the options we explored here. After going through this tutorial, you'll be familiar with the functions used in the CLI, which can be used as is for this model, other models produced with Ketos or even custom architectures and audio representations. It also serves as a template not only for command line tools, but more generally as a way of integrating ketos models into your applications. A CLI is great to be used by other programs. For example, if you need to run a detector on large amounts of archived data and you have acess to a cluster, a CLI is the way to go. If instead you want to develop a web-app that allows your collaborators to use your detector, or even a desktop app with a friendly graphical interface, the same components can be used.