Creating a database
In this tutorial we will use Ketos to build a database of North Atlantic Right Whale upcalls. The audio will be represented as spectrograms and can later be used to train a deep learning based classfier and build a right whale detector.
Note
You can download an executable version (Jupyter Notebook) of this tutorial, the data needed to follow along
here
.
Creating a training database¶
In this tutorial, we will use Ketos to create a database that can be used to train a deep learning classifier.
The data is a subset of the recordings used in the 2013 DCLDE challenge, in which participants had to detect calls from the North Atlantic Right Whale. To keep the tutorial simple, we will only use recording from a couple of days containing the characteristic upcall. The size of the database will also be modest. In practice you'll probably want to have more data available to train your classifiers/detectors. But the steps included here should give you a good understanding of the data preparation steps. We will even use a simple data augmentation techinique to increase the size of the training dataset.
Starting with the raw wavfiles and annotations, we will build a database of spectrograms that will be used to train a deep neural network capable of distinguishing upcalls from the background sounds. For our purposes, 'background' includes all sounds that are not upcalls, from other animal vocalizations to ambient noises produced by waves.
You can find the audio files and the annotations in the data
folder within .zip file at the top of this page.
This is a common scenario: you have audio recordings and annotations indicating where in these recording the signals of interest are. However, if your data is not in the exact same format, we encourage you to explore the documentation, since Ketos has a variety of tools including some that are not covered in this tutorial.
Contents:¶
1. Importing the packages
2. Loading the annotations
3. Putting the annotations in the Ketos format
4. Creating segments of uniform length
5. Augmenting the data
6. Including background noise
7. Choosing the spectrogram settings
8. Creating the database
1. Importing the packages¶
For this tutorial we will use several modules within ketos and also the pandas package
import pandas as pd
from ketos.data_handling import selection_table as sl
import ketos.data_handling.database_interface as dbi
from ketos.data_handling.parsing import load_audio_representation
from ketos.audio.spectrogram import MagSpectrogram
from ketos.data_handling.parsing import load_audio_representation
2. Loading the annotations¶
Our annotations are saved in two .csv
files (with values separated by ;
): annotations_train.csv
and annotations_val.csv
, which we will use to create the training and validation datasets respectively. These files can also be found within the .zip
file at the top of the page.
annot_train = pd.read_csv("annotations_train.csv")
annot_val = pd.read_csv("annotations_val.csv")
Let's inspect our annotations
annot_train
Unnamed: 0 | start | end | label | sound_file | datetime | |
---|---|---|---|---|---|---|
0 | 2957 | 188.8115 | 190.5858 | upcall | NOPP6_EST_20090328_000000.wav | 2009-03-28 00:00:00 |
1 | 2958 | 235.7556 | 237.1603 | upcall | NOPP6_EST_20090328_000000.wav | 2009-03-28 00:00:00 |
2 | 2959 | 398.6924 | 400.1710 | upcall | NOPP6_EST_20090328_000000.wav | 2009-03-28 00:00:00 |
3 | 2960 | 438.9091 | 440.3138 | upcall | NOPP6_EST_20090328_000000.wav | 2009-03-28 00:00:00 |
4 | 2961 | 451.0518 | 452.2716 | upcall | NOPP6_EST_20090328_000000.wav | 2009-03-28 00:00:00 |
... | ... | ... | ... | ... | ... | ... |
995 | 3952 | 52.0791 | 53.6686 | upcall | NOPP6_EST_20090329_031500.wav | 2009-03-29 03:15:00 |
996 | 3953 | 76.1057 | 77.2146 | upcall | NOPP6_EST_20090329_031500.wav | 2009-03-29 03:15:00 |
997 | 3954 | 99.9104 | 101.3520 | upcall | NOPP6_EST_20090329_031500.wav | 2009-03-29 03:15:00 |
998 | 3955 | 120.9983 | 121.9224 | upcall | NOPP6_EST_20090329_031500.wav | 2009-03-29 03:15:00 |
999 | 3956 | 104.6603 | 105.8431 | upcall | NOPP6_EST_20090329_031500.wav | 2009-03-29 03:15:00 |
1000 rows × 6 columns
annot_val
Unnamed: 0 | start | end | label | sound_file | datetime | file_duration | |
---|---|---|---|---|---|---|---|
0 | 4157 | 891.4625 | 892.5714 | upcall | NOPP6_EST_20090329_084500.wav | 2009-03-29 08:45:00 | 900 |
1 | 4158 | 52.7486 | 53.8945 | upcall | NOPP6_EST_20090329_090000.wav | 2009-03-29 09:00:00 | 900 |
2 | 4159 | 42.1030 | 43.5076 | upcall | NOPP6_EST_20090329_090000.wav | 2009-03-29 09:00:00 | 900 |
3 | 4160 | 98.0663 | 98.9165 | upcall | NOPP6_EST_20090329_090000.wav | 2009-03-29 09:00:00 | 900 |
4 | 4161 | 116.4928 | 117.8605 | upcall | NOPP6_EST_20090329_090000.wav | 2009-03-29 09:00:00 | 900 |
... | ... | ... | ... | ... | ... | ... | ... |
495 | 4652 | 201.8252 | 203.4886 | upcall | NOPP6_EST_20090329_130000.wav | 2009-03-29 13:00:00 | 900 |
496 | 4653 | 235.1851 | 236.1831 | upcall | NOPP6_EST_20090329_130000.wav | 2009-03-29 13:00:00 | 900 |
497 | 4654 | 236.7006 | 237.7726 | upcall | NOPP6_EST_20090329_130000.wav | 2009-03-29 13:00:00 | 900 |
498 | 4655 | 246.4406 | 247.6974 | upcall | NOPP6_EST_20090329_130000.wav | 2009-03-29 13:00:00 | 900 |
499 | 4656 | 264.1833 | 265.8097 | upcall | NOPP6_EST_20090329_130000.wav | 2009-03-29 13:00:00 | 900 |
500 rows × 7 columns
The annot_train dataframe contains 1000 rows and the annot_val 500. The columns indicate:
start: start time for the annotation, in seconds from the beginning of the file
end: end time for the annotation, in seconds from the beginning of the file
label: label for the annotation (in our case, all annotated signals are 'upcalls', but the origincal DCLDE2013 dataset also had 'gunshots')
sound_file: name of the audio file
datetime: a timestamp for the beginning of the file (UTC)
3. Putting the annotations in the Ketos format¶
Let's check if our annotations follow the Ketos standard.
If that's the case, the function sl.is_standardized
will return True
.
sl.is_standardized(annot_train)
Your table is not in the Ketos format. It should have two levels of indices: filename and annot_id. It should also contain at least the 'label' column. If your annotations have time information, these should appear in the 'start' and 'end' columns extra columns are allowed. Here is a minimum example: label filename annot_id file1.wav 0 2 1 1 2 2 file2.wav 0 2 1 2 2 1 And here is a table with time information and a few extra columns ('min_freq', 'max_freq' and 'file_time_stamp') start end label min_freq max_freq file_time_stamp filename annot_id file1.wav 0 7.0 8.1 2 180.6 294.3 2019-02-24 13:15:00 1 8.5 12.5 1 174.2 258.7 2019-02-24 13:15:00 2 13.1 14.0 2 183.4 292.3 2019-02-24 13:15:00 file2.wav 0 2.2 3.1 2 148.8 286.6 2019-02-24 13:30:00 1 5.8 6.8 2 156.6 278.3 2019-02-24 13:30:00 2 9.0 13.0 1 178.2 304.5 2019-02-24 13:30:00
False
Setting the verbose argument to False
will not show the example above:
sl.is_standardized(annot_val, verbose=False)
False
Neither of our annotations are in the format ketos expects. But we can use the sl.standardize
function to convert to the specified format.
The annot_id column is created automatically by the sl.standardize
function. From the remaining required columns indicated in the example above, we already have start, end and label. Our sound_file column needs to be renamed to filename, so we will need to provide a dictionary to specify that.
We have one extra column, datetime, that we don't really need to keep, so we'll set trim_table=True
, which will discard any columns that are not required by the standardized.
If we wanted to keep the datetime (or any other columns), we would just set trim_table=False
. One situation in which you might want to do that is if you need this information to split a dataset into train/test or train/validation/test, because then you can sort all your annotations by time and make sure the training set does not overlap with the validation/test. But in our case, the annotations are already split.
map_to_ketos_annot_std ={'filename': 'sound_file'}
std_annot_train = sl.standardize(table=annot_train, labels=["upcall"], start_labels_at_1=True,
mapper=map_to_ketos_annot_std, trim_table=True)
std_annot_val = sl.standardize(table=annot_val, labels=["upcall"], start_labels_at_1=True,
mapper=map_to_ketos_annot_std, trim_table=True)
Let's have a look at our standardized tables
std_annot_train
start | end | label | ||
---|---|---|---|---|
filename | annot_id | |||
NOPP6_EST_20090328_000000.wav | 0 | 188.8115 | 190.5858 | 1 |
1 | 235.7556 | 237.1603 | 1 | |
2 | 398.6924 | 400.1710 | 1 | |
3 | 438.9091 | 440.3138 | 1 | |
4 | 451.0518 | 452.2716 | 1 | |
... | ... | ... | ... | ... |
NOPP6_EST_20090329_031500.wav | 1 | 52.0791 | 53.6686 | 1 |
2 | 76.1057 | 77.2146 | 1 | |
3 | 99.9104 | 101.3520 | 1 | |
4 | 104.6603 | 105.8431 | 1 | |
5 | 120.9983 | 121.9224 | 1 |
1000 rows × 3 columns
std_annot_val
start | end | label | ||
---|---|---|---|---|
filename | annot_id | |||
NOPP6_EST_20090329_084500.wav | 0 | 891.4625 | 892.5714 | 1 |
NOPP6_EST_20090329_090000.wav | 0 | 42.1030 | 43.5076 | 1 |
1 | 52.7486 | 53.8945 | 1 | |
2 | 98.0663 | 98.9165 | 1 | |
3 | 116.4928 | 117.8605 | 1 | |
... | ... | ... | ... | ... |
NOPP6_EST_20090329_130000.wav | 10 | 201.8252 | 203.4886 | 1 |
11 | 235.1851 | 236.1831 | 1 | |
12 | 236.7006 | 237.7726 | 1 | |
13 | 246.4406 | 247.6974 | 1 | |
14 | 264.1833 | 265.8097 | 1 |
500 rows × 3 columns
Notice that the 'label' column now encodes 'upcall' as ones (1), as the ketos format uses integers to represent labels.
4. Creating segments of uniform length¶
If you look back at our std_annot_train
and std_annot_val
you'll notice that annotations have a variety of lengths, since they mark the beginning and end of an upcall and these have variable durations. For our purposes, we want each signal in the database to be represented as spectrograms, all of same length. Each spectrogram will be labelled as containing an upcall or not.
The sl.select
function in ketos can help us to do just that: for each annotated upcall, it will select a portion of the recording surrounding it. It takes a standardized annotatoin table as input and lets you specify the length of the output segments. We'll use 3 seconds, as it is enough to encompass most upcalls.
Our standardized tables only contain annotated upcalls. Later we will also want some examples of segments that only contain background noise, but for now we'll just create the uniform upcall segments, which we'll call 'positives'
positives_train = sl.select(annotations=std_annot_train, length=3.0)
positives_val = sl.select(annotations=std_annot_val, length=3.0, step=0.0, center=False)
Have a look at the results and notice how each entry is now 3.0 seconds long.
positives_train
label | start | end | ||
---|---|---|---|---|
filename | sel_id | |||
NOPP6_EST_20090328_000000.wav | 0 | 1 | 187.866100 | 190.866100 |
1 | 1 | 235.269881 | 238.269881 | |
2 | 1 | 397.243420 | 400.243420 | |
3 | 1 | 438.786305 | 441.786305 | |
4 | 1 | 450.322412 | 453.322412 | |
... | ... | ... | ... | ... |
NOPP6_EST_20090329_031500.wav | 1 | 1 | 51.489415 | 54.489415 |
2 | 1 | 75.586501 | 78.586501 | |
3 | 1 | 98.384298 | 101.384298 | |
4 | 1 | 103.495285 | 106.495285 | |
5 | 1 | 119.217043 | 122.217043 |
1000 rows × 3 columns
positives_val
label | start | end | ||
---|---|---|---|---|
filename | sel_id | |||
NOPP6_EST_20090329_084500.wav | 0 | 1 | 890.762506 | 893.762506 |
NOPP6_EST_20090329_090000.wav | 0 | 1 | 40.845871 | 43.845871 |
1 | 1 | 52.712487 | 55.712487 | |
2 | 1 | 97.185874 | 100.185874 | |
3 | 1 | 115.260259 | 118.260259 | |
... | ... | ... | ... | ... |
NOPP6_EST_20090329_130000.wav | 10 | 1 | 201.792396 | 204.792396 |
11 | 1 | 233.465505 | 236.465505 | |
12 | 1 | 236.568577 | 239.568577 | |
13 | 1 | 245.285266 | 248.285266 | |
14 | 1 | 264.018936 | 267.018936 |
500 rows × 3 columns
5. Augmenting the data¶
Data augmentation is a set of tecnhiques used in machine learning to increase the data available to train models. There are many different techniques that can be used. The sl.select
function we just used offers a simple way to augment the data while you are creating the uniform selections. It creates segments that are longer than the annotated signals and then shifts the start and end of those segments, resulting in multiple segments with the same annotated signal (our upcalls) positioned at different times. This is a very safe technique, as it is not altering the original signal, but it can already help to increase the amount of data available. It also helps to present a larger variety of contexts in which the upcall can appear.
We'll augment the training portion of our annotations by using two additional arguments. The step
specifies how much the signal will be shifted (in seconds). Smaller values will produce more augmented selections, but they will be more similar to the previous selection. The min_overlap
argument specifies the fraction of the augmented signal that needs to overlap the original annotation in order for it to be included in the augmented selections table. A value of 1.0 means 100%, this is, the new annotation will only be included if the entire upcall falls within the stablished interval. Lower values will result in segments that only contain part of the original upcall. We'll set this value to 0.5, meaning that some of our augmented segments might have as little as half of the original call.
positives_train = sl.select(annotations=std_annot_train, length=3.0, step=0.5, min_overlap=0.5, center=False)
positives_train
label | start | end | ||
---|---|---|---|---|
filename | sel_id | |||
NOPP6_EST_20090328_000000.wav | 0 | 1 | 188.146257 | 191.146257 |
1 | 1 | 234.591781 | 237.591781 | |
2 | 1 | 398.622640 | 401.622640 | |
3 | 1 | 438.158151 | 441.158151 | |
4 | 1 | 449.476886 | 452.476886 | |
... | ... | ... | ... | ... |
NOPP6_EST_20090329_031500.wav | 1 | 1 | 51.043483 | 54.043483 |
2 | 1 | 75.518300 | 78.518300 | |
3 | 1 | 99.327243 | 102.327243 | |
4 | 1 | 103.425799 | 106.425799 | |
5 | 1 | 119.186872 | 122.186872 |
1000 rows × 3 columns
Notice that now our positives_train
tables has almost 3x more rows than before.
6. Including background noise¶
Now that we have the positive instances that we need to create our database, we need to include some examples of the negative class, or instances without upcalls.
The sl.create_rndm_backgr_selections
is ideal for our situation. It takes a standardized ketos table describing all sections of the recordings that contain annotations and takes samples from the non-annotaded portions of the files, assuming everything that is not annotated can be used as a 'background' category.
Note:
You might find yourself in a different scenario. For example, your annotations might already include a 'background' class or you might have annoted different classes of sounds and you only want to use a few of them. In any case, ketos provides a variety of other functions that are helpful in different scenarios. Have a look at the documentation for more details. Specially the selection_table
module.
The sl.create_rndm_backgr_selections
also needs the duration of each file, which we can generate using the sl.file_duration
function.
file_durations_train = sl.file_duration_table('data/train')
file_durations_val = sl.file_duration_table('data/val')
file_durations_train
filename | duration | |
---|---|---|
0 | NOPP6_EST_20090328_000000.wav | 900.0 |
1 | NOPP6_EST_20090328_001500.wav | 900.0 |
2 | NOPP6_EST_20090328_003000.wav | 900.0 |
3 | NOPP6_EST_20090328_004500.wav | 900.0 |
4 | NOPP6_EST_20090328_010000.wav | 900.0 |
... | ... | ... |
79 | NOPP6_EST_20090329_021500.wav | 900.0 |
80 | NOPP6_EST_20090329_023000.wav | 900.0 |
81 | NOPP6_EST_20090329_024500.wav | 900.0 |
82 | NOPP6_EST_20090329_030000.wav | 900.0 |
83 | NOPP6_EST_20090329_031500.wav | 900.0 |
84 rows × 2 columns
Now that we have the file durations, we can generate our table of negative segments. We'll specify the same length (3.0 seconds). The num
argument specifies the number of background segments we would like to generate. Let's make this number equal to the number of positive examples in each dataset (len(positive_train)
and len(positive_val)
)
negatives_train=sl.create_rndm_selections(annotations=std_annot_train, files=file_durations_train,
length=3.0, num=len(positives_train), trim_table=True)
negatives_train
start | end | label | ||
---|---|---|---|---|
filename | sel_id | |||
NOPP6_EST_20090328_000000.wav | 0 | 212.665947 | 215.665947 | 0 |
1 | 242.099422 | 245.099422 | 0 | |
2 | 289.042875 | 292.042875 | 0 | |
3 | 435.757319 | 438.757319 | 0 | |
4 | 696.446347 | 699.446347 | 0 | |
... | ... | ... | ... | ... |
NOPP6_EST_20090329_031500.wav | 6 | 499.160905 | 502.160905 | 0 |
7 | 514.513207 | 517.513207 | 0 | |
8 | 632.399909 | 635.399909 | 0 | |
9 | 722.623274 | 725.623274 | 0 | |
10 | 736.540330 | 739.540330 | 0 |
1000 rows × 3 columns
negatives_val=sl.create_rndm_selections(annotations=std_annot_val, files=file_durations_val,
length=3.0, num=len(positives_val), trim_table=True)
negatives_val
start | end | label | ||
---|---|---|---|---|
filename | sel_id | |||
NOPP6_EST_20090329_084500.wav | 0 | 61.824235 | 64.824235 | 0 |
1 | 99.654401 | 102.654401 | 0 | |
2 | 111.862976 | 114.862976 | 0 | |
3 | 118.638613 | 121.638613 | 0 | |
4 | 128.725767 | 131.725767 | 0 | |
... | ... | ... | ... | ... |
NOPP6_EST_20090329_130000.wav | 17 | 576.100760 | 579.100760 | 0 |
18 | 628.530203 | 631.530203 | 0 | |
19 | 825.323492 | 828.323492 | 0 | |
20 | 858.437114 | 861.437114 | 0 | |
21 | 858.588675 | 861.588675 | 0 |
500 rows × 3 columns
There we have it! Now we'll just put the positives_train
and negatives_train
together and do the same to the validation tables.
selections_train = pd.concat([positives_train,negatives_train], sort=False)
selections_val = pd.concat([positives_val,negatives_val], sort=False)
selections_train
label | start | end | ||
---|---|---|---|---|
filename | sel_id | |||
NOPP6_EST_20090328_000000.wav | 0 | 1 | 188.146257 | 191.146257 |
1 | 1 | 234.591781 | 237.591781 | |
2 | 1 | 398.622640 | 401.622640 | |
3 | 1 | 438.158151 | 441.158151 | |
4 | 1 | 449.476886 | 452.476886 | |
... | ... | ... | ... | ... |
NOPP6_EST_20090329_031500.wav | 6 | 0 | 499.160905 | 502.160905 |
7 | 0 | 514.513207 | 517.513207 | |
8 | 0 | 632.399909 | 635.399909 | |
9 | 0 | 722.623274 | 725.623274 | |
10 | 0 | 736.540330 | 739.540330 |
2000 rows × 3 columns
selections_val
label | start | end | ||
---|---|---|---|---|
filename | sel_id | |||
NOPP6_EST_20090329_084500.wav | 0 | 1 | 890.762506 | 893.762506 |
NOPP6_EST_20090329_090000.wav | 0 | 1 | 40.845871 | 43.845871 |
1 | 1 | 52.712487 | 55.712487 | |
2 | 1 | 97.185874 | 100.185874 | |
3 | 1 | 115.260259 | 118.260259 | |
... | ... | ... | ... | ... |
NOPP6_EST_20090329_130000.wav | 14 | 0 | 796.780911 | 799.780911 |
15 | 0 | 814.049033 | 817.049033 | |
16 | 0 | 837.493193 | 840.493193 | |
17 | 0 | 861.608971 | 864.608971 | |
18 | 0 | 883.387020 | 886.387020 |
1000 rows × 3 columns
At this point, we have defined which audio segments we want in our database: a little over 5500 in the training dataset, 50% with upcalls and 50% without, and 1000 for the validation set, maintaining the same ratio.
Now we need to decide how these segments will be represented.
7. Choosing the spectrogram settings¶
As mentioned earlier, we'll represent the segments as spectrograms.
In the .zip file where you found the data, there's also a spectrogram configuration file (spec_config.json
) which contains the settings we want to use.
This configuration file is simply a text file in the .json
format, so you could make a copy of it, change a few parameters and save several settings to use later or to share the with someone else.
spec_cfg = load_audio_representation('spec_config.json', name="spectrogram")
spec_cfg
{'rate': 1000, 'window': 0.256, 'step': 0.032, 'freq_min': 0, 'freq_max': 500, 'window_func': 'hamming', 'type': 'MagSpectrogram'}
The result is a python dictionary. We could change some value, like the step size:
#spec_cfg['step'] = 0.064
But we will stick to the original here.
8. Creating the database¶
Now we have to compute the spectrograms following the settings above for each selection in our selection tables and then save them in a database.
All of this can be done with the dbi.create_database
function in Ketos.
We will start with the training dataset. We need to indicate the name for the database we want to create, where the audio files are, a name for the dataset, the selections table and, finally the audio representation. As specified in our spec_cfg
, this is a Magnitude spectrogram, but ketos can also create databases with Power, Mel and CQT spectrograms, as well as time-domain data (waveforms).
dbi.create_database(output_file='database.h5', data_dir='data/train',
dataset_name='train',selections=selections_train,
audio_repres=spec_cfg)
100%|██████████████████████████████████████| 2000/2000 [00:14<00:00, 139.70it/s]
2000 items saved to database.h5
And we do the same thing for the validation set. Note that, by specifying the same database name, we are telling ketos that we want to add the validation set to the existing database.
dbi.create_database(output_file='database.h5', data_dir='data/val',
dataset_name='validation',selections=selections_val,
audio_repres=spec_cfg)
100%|██████████████████████████████████████| 1000/1000 [00:07<00:00, 138.17it/s]
1000 items saved to database.h5
Now we have our database with spectrograms representing audio segments with and without the North Atlantic Right Whale upcall. The data is divided into 'train' and 'validation'.
db = dbi.open_file("database.h5", 'r')
db
File(filename=database.h5, title='', mode='r', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None)) / (RootGroup) '' /train (Group) '' /train/data (Table(2000,)fletcher32, shuffle, zlib(1)) '' description := { "data": Float32Col(shape=(94, 129), dflt=0.0, pos=0), "filename": StringCol(itemsize=100, shape=(), dflt=b'', pos=1), "id": UInt32Col(shape=(), dflt=0, pos=2), "label": UInt8Col(shape=(), dflt=0, pos=3), "offset": Float64Col(shape=(), dflt=0.0, pos=4)} byteorder := 'little' chunkshape := (5,) /validation (Group) '' /validation/data (Table(1000,)fletcher32, shuffle, zlib(1)) '' description := { "data": Float32Col(shape=(94, 129), dflt=0.0, pos=0), "filename": StringCol(itemsize=100, shape=(), dflt=b'', pos=1), "id": UInt32Col(shape=(), dflt=0, pos=2), "label": UInt8Col(shape=(), dflt=0, pos=3), "offset": Float64Col(shape=(), dflt=0.0, pos=4)} byteorder := 'little' chunkshape := (5,)
Here we can see the data divided into 'train' and 'validation' These are called 'groups' in HDF5 terms. Within each of them there's a dataset called 'data', which contains the spectrograms and respective labels.
db.close() #close the database connection
You will likely not need to directly interact with the database. In a following tutorial, we will use Ketos to build a deep neural network and train it to recognize upcalls. Ketos handles the database interactions, so we won't really have to go into the details of it, but if you would like to learn more about how to get data from this database, take a look at the database_interface module in ketos and the pyTables documentation.