Creating a database

In this tutorial we will use Ketos to build a database of North Atlantic Right Whale upcalls. The audio will be represented as spectrograms and can later be used to train a deep learning based classfier and build a right whale detector.

Note

You can download an executable version (Jupyter Notebook) of this tutorial, the data needed to follow along here.

create_database

Creating a training database¶

In this tutorial, we will use Ketos to create a database that can be used to train a deep learning classifier.

The data is a subset of the recordings used in the 2013 DCLDE challenge, in which participants had to detect calls from the North Atlantic Right Whale. To keep the tutorial simple, we will only use recording from a couple of days containing the characteristic upcall. The size of the database will also be modest. In practice you'll probably want to have more data available to train your classifiers/detectors. But the steps included here should give you a good understanding of the data preparation steps. We will even use a simple data augmentation techinique to increase the size of the training dataset.

Starting with the raw wavfiles and annotations, we will build a database of spectrograms that will be used to train a deep neural network capable of distinguishing upcalls from the background sounds. For our purposes, 'background' includes all sounds that are not upcalls, from other animal vocalizations to ambient noises produced by waves.

You can find the audio files and the annotations in the data folder within .zip file at the top of this page.

This is a common scenario: you have audio recordings and annotations indicating where in these recording the signals of interest are. However, if your data is not in the exact same format, we encourage you to explore the documentation, since Ketos has a variety of tools including some that are not covered in this tutorial.

Contents:¶

1. Importing the packages
2. Loading the annotations
3. Putting the annotations in the Ketos format
4. Creating segments of uniform length
5. Augmenting the data
6. Including background noise
7. Choosing the spectrogram settings
8. Creating the database

1. Importing the packages¶

For this tutorial we will use several modules within ketos and also the pandas package

In [1]:

import pandas as pd
from ketos.data_handling import selection_table as sl
import ketos.data_handling.database_interface as dbi
from ketos.data_handling.parsing import load_audio_representation
from ketos.audio.spectrogram import MagSpectrogram
from ketos.data_handling.parsing import load_audio_representation

2. Loading the annotations¶

Our annotations are saved in two .csv files (with values separated by ;): annotations_train.csv and annotations_val.csv, which we will use to create the training and validation datasets respectively. These files can also be found within the .zip file at the top of the page.

In [3]:

annot_train = pd.read_csv("annotations_train.csv")
annot_val = pd.read_csv("annotations_val.csv")

Let's inspect our annotations

In [4]:

annot_train

Out[4]:

	Unnamed: 0	start	end	label	sound_file	datetime
0	2957	188.8115	190.5858	upcall	NOPP6_EST_20090328_000000.wav	2009-03-28 00:00:00
1	2958	235.7556	237.1603	upcall	NOPP6_EST_20090328_000000.wav	2009-03-28 00:00:00
2	2959	398.6924	400.1710	upcall	NOPP6_EST_20090328_000000.wav	2009-03-28 00:00:00
3	2960	438.9091	440.3138	upcall	NOPP6_EST_20090328_000000.wav	2009-03-28 00:00:00
4	2961	451.0518	452.2716	upcall	NOPP6_EST_20090328_000000.wav	2009-03-28 00:00:00
...	...	...	...	...	...	...
995	3952	52.0791	53.6686	upcall	NOPP6_EST_20090329_031500.wav	2009-03-29 03:15:00
996	3953	76.1057	77.2146	upcall	NOPP6_EST_20090329_031500.wav	2009-03-29 03:15:00
997	3954	99.9104	101.3520	upcall	NOPP6_EST_20090329_031500.wav	2009-03-29 03:15:00
998	3955	120.9983	121.9224	upcall	NOPP6_EST_20090329_031500.wav	2009-03-29 03:15:00
999	3956	104.6603	105.8431	upcall	NOPP6_EST_20090329_031500.wav	2009-03-29 03:15:00

1000 rows × 6 columns

In [5]:

annot_val

Out[5]:

	Unnamed: 0	start	end	label	sound_file	datetime	file_duration
0	4157	891.4625	892.5714	upcall	NOPP6_EST_20090329_084500.wav	2009-03-29 08:45:00	900
1	4158	52.7486	53.8945	upcall	NOPP6_EST_20090329_090000.wav	2009-03-29 09:00:00	900
2	4159	42.1030	43.5076	upcall	NOPP6_EST_20090329_090000.wav	2009-03-29 09:00:00	900
3	4160	98.0663	98.9165	upcall	NOPP6_EST_20090329_090000.wav	2009-03-29 09:00:00	900
4	4161	116.4928	117.8605	upcall	NOPP6_EST_20090329_090000.wav	2009-03-29 09:00:00	900
...	...	...	...	...	...	...	...
495	4652	201.8252	203.4886	upcall	NOPP6_EST_20090329_130000.wav	2009-03-29 13:00:00	900
496	4653	235.1851	236.1831	upcall	NOPP6_EST_20090329_130000.wav	2009-03-29 13:00:00	900
497	4654	236.7006	237.7726	upcall	NOPP6_EST_20090329_130000.wav	2009-03-29 13:00:00	900
498	4655	246.4406	247.6974	upcall	NOPP6_EST_20090329_130000.wav	2009-03-29 13:00:00	900
499	4656	264.1833	265.8097	upcall	NOPP6_EST_20090329_130000.wav	2009-03-29 13:00:00	900

500 rows × 7 columns

The annot_train dataframe contains 1000 rows and the annot_val 500. The columns indicate:

start: start time for the annotation, in seconds from the beginning of the file
end: end time for the annotation, in seconds from the beginning of the file
label: label for the annotation (in our case, all annotated signals are 'upcalls', but the origincal DCLDE2013 dataset also had 'gunshots')
sound_file: name of the audio file
datetime: a timestamp for the beginning of the file (UTC)

3. Putting the annotations in the Ketos format¶

Let's check if our annotations follow the Ketos standard.

If that's the case, the function sl.is_standardized will return True.

In [6]:

sl.is_standardized(annot_train)

 Your table is not in the Ketos format.

            It should have two levels of indices: filename and annot_id.
            It should also contain at least the 'label' column.
            If your annotations have time information, these should appear in the 'start' and 'end' columns

            extra columns are allowed.

            Here is a minimum example:

                                 label
            filename  annot_id                    
            file1.wav 0          2
                      1          1
                      2          2
            file2.wav 0          2
                      1          2
                      2          1


            And here is a table with time information and a few extra columns ('min_freq', 'max_freq' and 'file_time_stamp')

                                 start   end  label  min_freq  max_freq  file_time_stamp
            filename  annot_id                    
            file1.wav 0           7.0   8.1      2    180.6     294.3    2019-02-24 13:15:00
                      1           8.5  12.5      1    174.2     258.7    2019-02-24 13:15:00
                      2          13.1  14.0      2    183.4     292.3    2019-02-24 13:15:00
            file2.wav 0           2.2   3.1      2    148.8     286.6    2019-02-24 13:30:00
                      1           5.8   6.8      2    156.6     278.3    2019-02-24 13:30:00
                      2           9.0  13.0      1    178.2     304.5    2019-02-24 13:30:00

Out[6]:

False

Setting the verbose argument to False will not show the example above:

In [7]:

sl.is_standardized(annot_val, verbose=False)

Out[7]:

False

Neither of our annotations are in the format ketos expects. But we can use the sl.standardize function to convert to the specified format.

The annot_id column is created automatically by the sl.standardize function. From the remaining required columns indicated in the example above, we already have start, end and label. Our sound_file column needs to be renamed to filename, so we will need to provide a dictionary to specify that.

We have one extra column, datetime, that we don't really need to keep, so we'll set trim_table=True, which will discard any columns that are not required by the standardized.

If we wanted to keep the datetime (or any other columns), we would just set trim_table=False. One situation in which you might want to do that is if you need this information to split a dataset into train/test or train/validation/test, because then you can sort all your annotations by time and make sure the training set does not overlap with the validation/test. But in our case, the annotations are already split.

In [8]:

map_to_ketos_annot_std ={'filename': 'sound_file'} 
std_annot_train = sl.standardize(table=annot_train, labels=["upcall"], start_labels_at_1=True,
                                mapper=map_to_ketos_annot_std, trim_table=True)
std_annot_val = sl.standardize(table=annot_val, labels=["upcall"], start_labels_at_1=True,
                                mapper=map_to_ketos_annot_std, trim_table=True)

Let's have a look at our standardized tables

In [9]:

std_annot_train

Out[9]:

		start	end	label
filename	annot_id
NOPP6_EST_20090328_000000.wav	0	188.8115	190.5858	1
	1	235.7556	237.1603	1
	2	398.6924	400.1710	1
	3	438.9091	440.3138	1
	4	451.0518	452.2716	1
...	...	...	...	...
NOPP6_EST_20090329_031500.wav	1	52.0791	53.6686	1
	2	76.1057	77.2146	1
	3	99.9104	101.3520	1
	4	104.6603	105.8431	1
	5	120.9983	121.9224	1

1000 rows × 3 columns

In [9]:

std_annot_val

Out[9]:

		start	end	label
filename	annot_id
NOPP6_EST_20090329_084500.wav	0	891.4625	892.5714	1
NOPP6_EST_20090329_090000.wav	0	42.1030	43.5076	1
	1	52.7486	53.8945	1
	2	98.0663	98.9165	1
	3	116.4928	117.8605	1
...	...	...	...	...
NOPP6_EST_20090329_130000.wav	10	201.8252	203.4886	1
	11	235.1851	236.1831	1
	12	236.7006	237.7726	1
	13	246.4406	247.6974	1
	14	264.1833	265.8097	1

500 rows × 3 columns

Notice that the 'label' column now encodes 'upcall' as ones (1), as the ketos format uses integers to represent labels.

4. Creating segments of uniform length¶

If you look back at our std_annot_train and std_annot_val you'll notice that annotations have a variety of lengths, since they mark the beginning and end of an upcall and these have variable durations. For our purposes, we want each signal in the database to be represented as spectrograms, all of same length. Each spectrogram will be labelled as containing an upcall or not.

The sl.select function in ketos can help us to do just that: for each annotated upcall, it will select a portion of the recording surrounding it. It takes a standardized annotatoin table as input and lets you specify the length of the output segments. We'll use 3 seconds, as it is enough to encompass most upcalls.

Our standardized tables only contain annotated upcalls. Later we will also want some examples of segments that only contain background noise, but for now we'll just create the uniform upcall segments, which we'll call 'positives'

In [10]:

positives_train = sl.select(annotations=std_annot_train, length=3.0)
positives_val = sl.select(annotations=std_annot_val, length=3.0, step=0.0, center=False)

Have a look at the results and notice how each entry is now 3.0 seconds long.

In [11]:

positives_train

Out[11]:

		label	start	end
filename	sel_id
NOPP6_EST_20090328_000000.wav	0	1	187.866100	190.866100
	1	1	235.269881	238.269881
	2	1	397.243420	400.243420
	3	1	438.786305	441.786305
	4	1	450.322412	453.322412
...	...	...	...	...
NOPP6_EST_20090329_031500.wav	1	1	51.489415	54.489415
	2	1	75.586501	78.586501
	3	1	98.384298	101.384298
	4	1	103.495285	106.495285
	5	1	119.217043	122.217043

1000 rows × 3 columns

In [12]:

positives_val

Out[12]:

		label	start	end
filename	sel_id
NOPP6_EST_20090329_084500.wav	0	1	890.762506	893.762506
NOPP6_EST_20090329_090000.wav	0	1	40.845871	43.845871
	1	1	52.712487	55.712487
	2	1	97.185874	100.185874
	3	1	115.260259	118.260259
...	...	...	...	...
NOPP6_EST_20090329_130000.wav	10	1	201.792396	204.792396
	11	1	233.465505	236.465505
	12	1	236.568577	239.568577
	13	1	245.285266	248.285266
	14	1	264.018936	267.018936

500 rows × 3 columns

5. Augmenting the data¶

Data augmentation is a set of tecnhiques used in machine learning to increase the data available to train models. There are many different techniques that can be used. The sl.select function we just used offers a simple way to augment the data while you are creating the uniform selections. It creates segments that are longer than the annotated signals and then shifts the start and end of those segments, resulting in multiple segments with the same annotated signal (our upcalls) positioned at different times. This is a very safe technique, as it is not altering the original signal, but it can already help to increase the amount of data available. It also helps to present a larger variety of contexts in which the upcall can appear.

We'll augment the training portion of our annotations by using two additional arguments. The step specifies how much the signal will be shifted (in seconds). Smaller values will produce more augmented selections, but they will be more similar to the previous selection. The min_overlap argument specifies the fraction of the augmented signal that needs to overlap the original annotation in order for it to be included in the augmented selections table. A value of 1.0 means 100%, this is, the new annotation will only be included if the entire upcall falls within the stablished interval. Lower values will result in segments that only contain part of the original upcall. We'll set this value to 0.5, meaning that some of our augmented segments might have as little as half of the original call.

In [12]:

positives_train = sl.select(annotations=std_annot_train, length=3.0, step=0.5, min_overlap=0.5, center=False)

In [13]:

positives_train

Out[13]:

		label	start	end
filename	sel_id
NOPP6_EST_20090328_000000.wav	0	1	188.146257	191.146257
	1	1	234.591781	237.591781
	2	1	398.622640	401.622640
	3	1	438.158151	441.158151
	4	1	449.476886	452.476886
...	...	...	...	...
NOPP6_EST_20090329_031500.wav	1	1	51.043483	54.043483
	2	1	75.518300	78.518300
	3	1	99.327243	102.327243
	4	1	103.425799	106.425799
	5	1	119.186872	122.186872

1000 rows × 3 columns

Notice that now our positives_train tables has almost 3x more rows than before.

6. Including background noise¶

Now that we have the positive instances that we need to create our database, we need to include some examples of the negative class, or instances without upcalls.

The sl.create_rndm_backgr_selections is ideal for our situation. It takes a standardized ketos table describing all sections of the recordings that contain annotations and takes samples from the non-annotaded portions of the files, assuming everything that is not annotated can be used as a 'background' category.

Note: You might find yourself in a different scenario. For example, your annotations might already include a 'background' class or you might have annoted different classes of sounds and you only want to use a few of them. In any case, ketos provides a variety of other functions that are helpful in different scenarios. Have a look at the documentation for more details. Specially the selection_table module.

The sl.create_rndm_backgr_selections also needs the duration of each file, which we can generate using the sl.file_duration function.

In [14]:

file_durations_train = sl.file_duration_table('data/train')
file_durations_val = sl.file_duration_table('data/val') 

In [15]:

file_durations_train

Out[15]:

	filename	duration
0	NOPP6_EST_20090328_000000.wav	900.0
1	NOPP6_EST_20090328_001500.wav	900.0
2	NOPP6_EST_20090328_003000.wav	900.0
3	NOPP6_EST_20090328_004500.wav	900.0
4	NOPP6_EST_20090328_010000.wav	900.0
...	...	...
79	NOPP6_EST_20090329_021500.wav	900.0
80	NOPP6_EST_20090329_023000.wav	900.0
81	NOPP6_EST_20090329_024500.wav	900.0
82	NOPP6_EST_20090329_030000.wav	900.0
83	NOPP6_EST_20090329_031500.wav	900.0

84 rows × 2 columns

Now that we have the file durations, we can generate our table of negative segments. We'll specify the same length (3.0 seconds). The num argument specifies the number of background segments we would like to generate. Let's make this number equal to the number of positive examples in each dataset (len(positive_train) and len(positive_val))

In [16]:

negatives_train=sl.create_rndm_selections(annotations=std_annot_train, files=file_durations_train, 
                                          length=3.0, num=len(positives_train), trim_table=True)
negatives_train

Out[16]:

		start	end	label
filename	sel_id
NOPP6_EST_20090328_000000.wav	0	212.665947	215.665947	0
	1	242.099422	245.099422	0
	2	289.042875	292.042875	0
	3	435.757319	438.757319	0
	4	696.446347	699.446347	0
...	...	...	...	...
NOPP6_EST_20090329_031500.wav	6	499.160905	502.160905	0
	7	514.513207	517.513207	0
	8	632.399909	635.399909	0
	9	722.623274	725.623274	0
	10	736.540330	739.540330	0

1000 rows × 3 columns

In [17]:

negatives_val=sl.create_rndm_selections(annotations=std_annot_val, files=file_durations_val, 
                                         length=3.0, num=len(positives_val), trim_table=True)
negatives_val

Out[17]:

		start	end	label
filename	sel_id
NOPP6_EST_20090329_084500.wav	0	61.824235	64.824235	0
	1	99.654401	102.654401	0
	2	111.862976	114.862976	0
	3	118.638613	121.638613	0
	4	128.725767	131.725767	0
...	...	...	...	...
NOPP6_EST_20090329_130000.wav	17	576.100760	579.100760	0
	18	628.530203	631.530203	0
	19	825.323492	828.323492	0
	20	858.437114	861.437114	0
	21	858.588675	861.588675	0

500 rows × 3 columns

There we have it! Now we'll just put the positives_train and negatives_train together and do the same to the validation tables.

In [18]:

selections_train = pd.concat([positives_train,negatives_train], sort=False)
selections_val = pd.concat([positives_val,negatives_val], sort=False)

In [19]:

selections_train

Out[19]:

		label	start	end
filename	sel_id
NOPP6_EST_20090328_000000.wav	0	1	188.146257	191.146257
	1	1	234.591781	237.591781
	2	1	398.622640	401.622640
	3	1	438.158151	441.158151
	4	1	449.476886	452.476886
...	...	...	...	...
NOPP6_EST_20090329_031500.wav	6	0	499.160905	502.160905
	7	0	514.513207	517.513207
	8	0	632.399909	635.399909
	9	0	722.623274	725.623274
	10	0	736.540330	739.540330

2000 rows × 3 columns

In [21]:

selections_val

Out[21]:

		label	start	end
filename	sel_id
NOPP6_EST_20090329_084500.wav	0	1	890.762506	893.762506
NOPP6_EST_20090329_090000.wav	0	1	40.845871	43.845871
	1	1	52.712487	55.712487
	2	1	97.185874	100.185874
	3	1	115.260259	118.260259
...	...	...	...	...
NOPP6_EST_20090329_130000.wav	14	0	796.780911	799.780911
	15	0	814.049033	817.049033
	16	0	837.493193	840.493193
	17	0	861.608971	864.608971
	18	0	883.387020	886.387020

1000 rows × 3 columns

At this point, we have defined which audio segments we want in our database: a little over 5500 in the training dataset, 50% with upcalls and 50% without, and 1000 for the validation set, maintaining the same ratio.

Now we need to decide how these segments will be represented.

7. Choosing the spectrogram settings¶

As mentioned earlier, we'll represent the segments as spectrograms. In the .zip file where you found the data, there's also a spectrogram configuration file (spec_config.json) which contains the settings we want to use.

This configuration file is simply a text file in the .json format, so you could make a copy of it, change a few parameters and save several settings to use later or to share the with someone else.

In [20]:

spec_cfg = load_audio_representation('spec_config.json', name="spectrogram")

In [21]:

spec_cfg

Out[21]:

{'rate': 1000,
 'window': 0.256,
 'step': 0.032,
 'freq_min': 0,
 'freq_max': 500,
 'window_func': 'hamming',
 'type': 'MagSpectrogram'}

The result is a python dictionary. We could change some value, like the step size:

In [22]:

#spec_cfg['step'] = 0.064

But we will stick to the original here.

8. Creating the database¶

Now we have to compute the spectrograms following the settings above for each selection in our selection tables and then save them in a database.

All of this can be done with the dbi.create_database function in Ketos.

We will start with the training dataset. We need to indicate the name for the database we want to create, where the audio files are, a name for the dataset, the selections table and, finally the audio representation. As specified in our spec_cfg, this is a Magnitude spectrogram, but ketos can also create databases with Power, Mel and CQT spectrograms, as well as time-domain data (waveforms).

In [23]:

dbi.create_database(output_file='database.h5', data_dir='data/train',
                               dataset_name='train',selections=selections_train,
                               audio_repres=spec_cfg)
                              

100%|██████████████████████████████████████| 2000/2000 [00:14<00:00, 139.70it/s]

2000 items saved to database.h5

And we do the same thing for the validation set. Note that, by specifying the same database name, we are telling ketos that we want to add the validation set to the existing database.

In [24]:

dbi.create_database(output_file='database.h5', data_dir='data/val',
                               dataset_name='validation',selections=selections_val,
                               audio_repres=spec_cfg)
                              

100%|██████████████████████████████████████| 1000/1000 [00:07<00:00, 138.17it/s]

1000 items saved to database.h5

Now we have our database with spectrograms representing audio segments with and without the North Atlantic Right Whale upcall. The data is divided into 'train' and 'validation'.

In [25]:

db = dbi.open_file("database.h5", 'r')

In [26]:

db

Out[26]:

File(filename=database.h5, title='', mode='r', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) ''
/train (Group) ''
/train/data (Table(2000,)fletcher32, shuffle, zlib(1)) ''
  description := {
  "data": Float32Col(shape=(94, 129), dflt=0.0, pos=0),
  "filename": StringCol(itemsize=100, shape=(), dflt=b'', pos=1),
  "id": UInt32Col(shape=(), dflt=0, pos=2),
  "label": UInt8Col(shape=(), dflt=0, pos=3),
  "offset": Float64Col(shape=(), dflt=0.0, pos=4)}
  byteorder := 'little'
  chunkshape := (5,)
/validation (Group) ''
/validation/data (Table(1000,)fletcher32, shuffle, zlib(1)) ''
  description := {
  "data": Float32Col(shape=(94, 129), dflt=0.0, pos=0),
  "filename": StringCol(itemsize=100, shape=(), dflt=b'', pos=1),
  "id": UInt32Col(shape=(), dflt=0, pos=2),
  "label": UInt8Col(shape=(), dflt=0, pos=3),
  "offset": Float64Col(shape=(), dflt=0.0, pos=4)}
  byteorder := 'little'
  chunkshape := (5,)

Here we can see the data divided into 'train' and 'validation' These are called 'groups' in HDF5 terms. Within each of them there's a dataset called 'data', which contains the spectrograms and respective labels.

In [27]:

db.close() #close the database connection

You will likely not need to directly interact with the database. In a following tutorial, we will use Ketos to build a deep neural network and train it to recognize upcalls. Ketos handles the database interactions, so we won't really have to go into the details of it, but if you would like to learn more about how to get data from this database, take a look at the database_interface module in ketos and the pyTables documentation.