Selection Table

selection_table module within the ketos library.

This module provides functions for handling annotation tables and creating selection tables. A Ketos annotation table always has the column ‘label’. For call-level annotations, the table also contains the columns ‘start’ and ‘end’, giving the start and end time of the call measured in seconds since the beginning of the file. The table may also contain the columns ‘freq_min’ and ‘freq_max’, giving the minimum and maximum frequencies of the call in Hz, but this is not required. The user may add any number of additional columns. Note that the table uses two levels of indices, the first index being the filename and the second index an annotation identifier.

Here is a minimum example:

label

filename annot_id file1.wav 0 2

1 1 2 2

file2.wav 0 2

1 2 2 1

And here is a table with time information (call-level annotations) and a few extra columns (‘min_freq’, ‘max_freq’ and ‘file_time_stamp’)

start end label min_freq max_freq file_time_stamp

filename annot_id file1.wav 0 7.0 8.1 2 180.6 294.3 2019-02-24 13:15:00

1 8.5 12.5 1 174.2 258.7 2019-02-24 13:15:00 2 13.1 14.0 2 183.4 292.3 2019-02-24 13:15:00

file2.wav 0 2.2 3.1 2 148.8 286.6 2019-02-24 13:30:00

1 5.8 6.8 2 156.6 278.3 2019-02-24 13:30:00 2 9.0 13.0 1 178.2 304.5 2019-02-24 13:30:00

ketos.data_handling.selection_table.cast_to_str(labels, nested=False)[source]

Convert every label to str format.

If nested is set to True, a flattened version of the input list is also returned.

Args:
labels: list

Input labels

nested: bool

Indicate if the input list contains (or may contain) sublists. False by default. If True, a flattened version of the list is also returned.

Results:
labels_str: list

Labels converted to str format

labels_str_flat: list

Flattened list of labels. Only returned if nested is set to True.

ketos.data_handling.selection_table.create_label_dict(signal_labels, backgr_labels, discard_labels)[source]

Create label dictionary, following the convetion:

  • signal_labels are mapped to 1,2,3,…

  • backgr_labels are mapped to 0

  • discard_labels are mapped to -1

Args:
signal_labels: list, or list of lists

Labels of interest. Will be mapped to 1,2,3,… Several labels can be mapped to the same integer by using nested lists. For example, signal_labels=[A,[B,C]] would result in A being mapped to 1 and B and C both being mapped to 2.

backgr_labels: list

Labels will be grouped into a common “background” class (0).

discard_labels: list

Labels will be grouped into a common “discard” class (-1).

Returns:
label_dict: dict

Dict that maps old labels to new labels.

ketos.data_handling.selection_table.create_rndm_backgr_selections(files, length, num, annotations=None, no_overlap=False, trim_table=False)[source]

Create background selections of uniform length, randomly distributed across the data set and not overlapping with any annotations, including those labelled 0.

The random sampling is performed without regard to already created background selections. Therefore, it is in principle possible that some of the created selections will overlap, although in practice this will only occur with very small probability, unless the number of requested selections (num) is very large and/or the (annotation-free part of) the data set is small in size.

To avoid any overlap, set the ‘no_overlap’ to True, but note that this can lead to longer execution times.

Args:
files: pandas DataFrame

Table with file durations in seconds. Should contain columns named ‘filename’ and ‘duration’.

length: float

Selection length in seconds.

num: int

Number of selections to be created.

annotations: pandas DataFrame

Annotation table. Optional.

no_overlap: bool

If True, randomly selected segments will have no overlap.

trim_table: bool

Keep only the columns prescribed by the Ketos annotation format.

Returns:
table_backgr: pandas DataFrame

Output selection table.

Example:
>>> import pandas as pd
>>> import numpy as np
>>> from ketos.data_handling.selection_table import select
>>> 
>>> #Ensure reproducible results by fixing the random number generator seed.
>>> np.random.seed(3)
>>> 
>>> #Load and inspect the annotations.
>>> df = pd.read_csv("ketos/tests/assets/annot_001.csv")
>>> print(df)
    filename  start   end  label
0  file1.wav    7.0   8.1      1
1  file1.wav    8.5  12.5      0
2  file1.wav   13.1  14.0      1
3  file2.wav    2.2   3.1      1
4  file2.wav    5.8   6.8      1
5  file2.wav    9.0  13.0      0
>>>
>>> #Standardize annotation table format
>>> df, label_dict = standardize(df, return_label_dict=True)
>>> print(df)
                    start   end  label
filename  annot_id                    
file1.wav 0           7.0   8.1      2
          1           8.5  12.5      1
          2          13.1  14.0      2
file2.wav 0           2.2   3.1      2
          1           5.8   6.8      2
          2           9.0  13.0      1
>>>
>>> #Enter file durations into a pandas DataFrame
>>> file_dur = pd.DataFrame({'filename':['file1.wav','file2.wav','file3.wav',], 'duration':[18.,20.,15.]})
>>> 
>>> #Create randomly sampled background selection with fixed 3.0-s length.
>>> df_bgr = create_rndm_backgr_selections(annotations=df, files=file_dur, length=3.0, num=12, trim_table=True) 
>>> print(df_bgr.round(2))
                  start    end  label
filename  sel_id                     
file1.wav 0        1.06   4.06      0
          1        1.31   4.31      0
          2        2.26   5.26      0
file2.wav 0       13.56  16.56      0
          1       14.76  17.76      0
          2       15.50  18.50      0
          3       16.16  19.16      0
file3.wav 0        2.33   5.33      0
          1        7.29  10.29      0
          2        7.44  10.44      0
          3        9.20  12.20      0
          4       10.94  13.94      0
ketos.data_handling.selection_table.empty_annot_table()[source]

Create an empty call-level annotation table

Returns:
df: pandas DataFrame

Empty annotation table

ketos.data_handling.selection_table.empty_selection_table()[source]

Create an empty selection table

Returns:
df: pandas DataFrame

Empty selection table

ketos.data_handling.selection_table.file_duration_table(path, search_subdirs=False)[source]

Create file duration table.

Args:
path: str

Path to folder with audio files (*.wav)

search_subdirs: bool

If True, search include also any audio files in subdirectories. Default is False.

Returns:
df: pandas DataFrame

File duration table. Columns: filename, duration

ketos.data_handling.selection_table.is_standardized(table, has_time=False, verbose=True)[source]

Check if the table has the correct indices and the minimum required columns.

Args:
table: pandas DataFrame

Annotation table.

has_time: bool

Require time information for each annotation, i.e. start and stop times.

verbose: bool

If True and the table is not standardized, print a message with an example table in the standard format.

Returns:
res: bool

True if the table has the standardized Ketos format. False otherwise.

ketos.data_handling.selection_table.label_occurrence(table)[source]

Identify the unique labels occurring in the table and determine how often each label occurs.

The input table must have the standardized Ketos format, see data_handling.selection_table.standardize(). In particular, each annotation should have only a single label value.

Args:
table: pandas DataFrame

Input table.

Results:
occurrence: dict

Dictionary where the labels are the keys and the values are the occurrences.

ketos.data_handling.selection_table.missing_columns(table, has_time=False)[source]

Check if the table has the minimum required columns.

Args:
table: pandas DataFrame

Annotation table.

has_time: bool

Require time information for each annotation, i.e. start and stop times.

Returns:
mis: list

List of missing columns, if any.

ketos.data_handling.selection_table.query(selections, annotations=None, filename=None, label=None, start=None, end=None)[source]

Query selection table for selections from certain audio files and/or with certain labels.

Args:
selections: pandas DataFrame

Selections table

annotations: pandas DataFrame

Annotations table. Optional.

filename: str or list(str)

Filename(s)

label: int or list(int)

Label(s)

start: float

Earliest end time in seconds

end: float

Latest start time in seconds

Returns:

: pandas DataFrame or tuple(pandas DataFrame, pandas DataFrame) Selection table, accompanied by an annotation table if an input annotation table is provided.

ketos.data_handling.selection_table.query_annotated(selections, annotations, filename=None, label=None, start=None, end=None)[source]

Query selection table for selections from certain audio files and/or with certain labels.

Args:
selections: pandas DataFrame

Selections table.

annotations: pandas DataFrame

Annotations table.

filename: str or list(str)

Filename(s)

label: int or list(int)

Label(s)

start: float

Earliest end time in seconds

end: float

Latest start time in seconds

Returns:
df1,df2: tuple(pandas DataFrame, pandas DataFrame)

Selection table and annotation table

ketos.data_handling.selection_table.query_labeled(table, filename=None, label=None, start=None, end=None)[source]

Query selection table for selections from certain audio files and/or with certain labels.

Args:
selections: pandas DataFrame

Selections table, which must have a ‘label’ column.

filename: str or list(str)

Filename(s)

label: int or list(int)

Label(s)

start: float

Earliest end time in seconds

end: float

Latest start time in seconds

Returns:

df: pandas DataFrame Selection table

ketos.data_handling.selection_table.rename_columns(table, mapper)[source]

Renames the table headings to conform with the ketos naming convention.

Args:
table: pandas DataFrame

Annotation table.

mapper: dict

Dictionary mapping the headings of the input table to the standard ketos headings.

Returns:
: pandas DataFrame

Table with new headings

ketos.data_handling.selection_table.segment_annotations(table, num, length, step=None)[source]

Generate a segmented annotation table by stepping across the audio files, using a fixed step size (step) and fixed selection window size (length).

Args:
table: pandas DataFrame

Annotation table.

num: int

Number of segments

length: float

Selection length in seconds.

step: float

Selection step size in seconds. If None, the step size is set equal to the selection length.

Returns:
df: pandas DataFrame

Annotations table

ketos.data_handling.selection_table.segment_files(table, length, step=None, pad=True)[source]

Generate a selection table by stepping across the audio files, using a fixed step size (step) and fixed selection window size (length).

Args:
table: pandas DataFrame

File duration table.

length: float

Selection length in seconds.

step: float

Selection step size in seconds. If None, the step size is set equal to the selection length.

pad: bool

If True (default), the last selection window is allowed to extend beyond the endpoint of the audio file.

Returns:
df: pandas DataFrame

Selection table

ketos.data_handling.selection_table.select(annotations, length, step=0, min_overlap=0, center=False, discard_long=False, keep_id=False)[source]

Generate a selection table by defining intervals of fixed length around every annotated section of the audio data. Each selection created in this way is chracterized by a single, integer-valued, label.

The input table must have the standardized Ketos format and contain call-level annotations, see data_handling.selection_table.standardize().

The output table uses two levels of indexing, the first level being the filename and the second level being a selection id.

The generated selections have uniform length given by the length argument.

Note that the selections may have negative start times and/or stop times that exceed the file duration.

Annotations longer than the specified selection length will be cropped, unless the step is set to a value larger than 0.

Annotations with label -1 are discarded.

Args:
annotations: pandas DataFrame

Input table with call-level annotations.

length: float

Selection length in seconds.

step: float

Produce multiple selections for each annotation by shifting the selection window in steps of length step (in seconds) both forward and backward in time. The default value is 0.

min_overlap: float

Minimum required overlap between the selection interval and the annotation, expressed as a fraction of the selection length. Only used if step > 0. The requirement is imposed on all annotations (labeled 1,2,3,…) except background annotations (labeled 0) which are always required to have an overlap of 1.0.

center: bool

Center annotations. Default is False.

discard_long: bool

Discard all annotations longer than the output length. Default is False.

keep_id: bool

For each generated selection, include the id of the annotation from which the selection was generated.

Results:
table_sel: pandas DataFrame

Output selection table.

Example:
>>> import pandas as pd
>>> from ketos.data_handling.selection_table import select, standardize
>>> 
>>> #Load and inspect the annotations.
>>> df = pd.read_csv("ketos/tests/assets/annot_001.csv")
>>>
>>> #Standardize annotation table format
>>> df, label_dict = standardize(df, return_label_dict=True)
>>> print(df)
                    start   end  label
filename  annot_id                    
file1.wav 0           7.0   8.1      2
          1           8.5  12.5      1
          2          13.1  14.0      2
file2.wav 0           2.2   3.1      2
          1           5.8   6.8      2
          2           9.0  13.0      1
>>> 
>>> #Create a selection table by defining intervals of fixed 
>>> #length around every annotation.
>>> #Set the length to 3.0 sec and require a minimum overlap of 
>>> #0.16*3.0=0.48 sec between selection and annotations.
>>> #Also, create multiple time-shifted versions of the same selection
>>> #using a step size of 1.0 sec.     
>>> df_sel = select(df, length=3.0, step=1.0, min_overlap=0.16, center=True, keep_id=True) 
>>> print(df_sel.round(2))
                  label  start    end  annot_id
filename  sel_id                               
file1.wav 0           2   5.05   8.05         0
          1           1   6.00   9.00         1
          2           2   6.05   9.05         0
          3           1   7.00  10.00         1
          4           2   7.05  10.05         0
          5           1   8.00  11.00         1
          6           1   9.00  12.00         1
          7           1  10.00  13.00         1
          8           1  11.00  14.00         1
          9           2  11.05  14.05         2
          10          1  12.00  15.00         1
          11          2  12.05  15.05         2
          12          2  13.05  16.05         2
file2.wav 0           2   0.15   3.15         0
          1           2   1.15   4.15         0
          2           2   2.15   5.15         0
          3           2   3.80   6.80         1
          4           2   4.80   7.80         1
          5           2   5.80   8.80         1
          6           1   6.50   9.50         2
          7           1   7.50  10.50         2
          8           1   8.50  11.50         2
          9           1   9.50  12.50         2
          10          1  10.50  13.50         2
          11          1  11.50  14.50         2
          12          1  12.50  15.50         2
ketos.data_handling.selection_table.select_by_segmenting(files, length, annotations=None, step=None, discard_empty=False, pad=True)[source]

Generate a selection table by stepping across the audio files, using a fixed step size (step) and fixed selection window size (length).

Unlike the data_handling.selection_table.select() method, selections created by this method are not characterized by a single, integer-valued label, but rather a list of annotations (which can have any length, including zero).

Therefore, the method returns not one, but two tables: A selection table indexed by filename and segment id, and an annotation table indexed by filename, segment id, and annotation id.

Args:
files: pandas DataFrame

Table with file durations in seconds. Should contain columns named ‘filename’ and ‘duration’.

length: float

Selection length in seconds.

annotations: pandas DataFrame

Annotation table.

step: float

Selection step size in seconds. If None, the step size is set equal to the selection length.

discard_empty: bool

If True, only selection that contain annotations will be used. If False (default), all selections are used.

pad: bool

If True (default), the last selection window is allowed to extend beyond the endpoint of the audio file.

Returns:
sel: pandas DataFrame

Selection table

annot: pandas DataFrame

Annotations table. Only returned if annotations is specified.

Example:
>>> import pandas as pd
>>> from ketos.data_handling.selection_table import select_by_segmenting, standardize
>>> 
>>> #Load and inspect the annotations.
>>> annot = pd.read_csv("ketos/tests/assets/annot_001.csv")
>>>
>>> #Standardize annotation table format
>>> annot, label_dict = standardize(annot, return_label_dict=True)
>>> print(annot)
                    start   end  label
filename  annot_id                    
file1.wav 0           7.0   8.1      2
          1           8.5  12.5      1
          2          13.1  14.0      2
file2.wav 0           2.2   3.1      2
          1           5.8   6.8      2
          2           9.0  13.0      1
>>>
>>> #Create file table
>>> files = pd.DataFrame({'filename':['file1.wav', 'file2.wav', 'file3.wav'], 'duration':[11.0, 19.2, 15.1]})
>>> print(files)
    filename  duration
0  file1.wav      11.0
1  file2.wav      19.2
2  file3.wav      15.1
>>>
>>> #Create a selection table by splitting the audio data into segments of 
>>> #uniform length. The length is set to 10.0 sec and the step size to 5.0 sec.
>>> sel = select_by_segmenting(files=files, length=10.0, annotations=annot, step=5.0) 
>>> #Inspect the selection table
>>> print(sel[0].round(2))
                  start   end
filename  sel_id             
file1.wav 0         0.0  10.0
          1         5.0  15.0
file2.wav 0         0.0  10.0
          1         5.0  15.0
          2        10.0  20.0
file3.wav 0         0.0  10.0
          1         5.0  15.0
          2        10.0  20.0
>>> #Inspect the annotations
>>> print(sel[1].round(2))
                           start   end  label
filename  sel_id annot_id                    
file1.wav 0      0           7.0   8.1      2
                 1           8.5  10.0      1
          1      0           2.0   3.1      2
                 1           3.5   7.5      1
                 2           8.1   9.0      2
          2      1           0.0   2.5      1
                 2           3.1   4.0      2
file2.wav 0      0           2.2   3.1      2
                 1           5.8   6.8      2
                 2           9.0  10.0      1
          1      1           0.8   1.8      2
                 2           4.0   8.0      1
          2      2           0.0   3.0      1
ketos.data_handling.selection_table.standardize(table=None, filename=None, sep=',', mapper=None, signal_labels=None, backgr_labels=[], unfold_labels=False, label_sep=',', trim_table=False, return_label_dict=False)[source]

Standardize the annotation table format.

The input table can be passed as a pandas DataFrame or as the filename of a csv file. The table may have either a single label per row, in which case unfold_labels should be set to False, or multiple labels per row (e.g. as a comma-separated list of values), in which case unfold_labels should be set to True and label_sep should be specified.

The table headings are renamed to conform with the ketos standard naming convention, following the name mapping specified by the user.

Signal labels are mapped to integers 1,2,3,… while background labels are mapped to 0, and any remaining labels are mapped to -1.

Note that the standardized output table has two levels of indices, the first index being the filename and the second index the annotation identifier.

Args:
table: pandas DataFrame

Annotation table.

filename: str

Full path to csv file containing the annotation table.

sep: str

Separator. Only relevant if filename is specified. Default is “,”.

mapper: dict

Dictionary mapping the headings of the input table to the standard ketos headings.

signal_labels: list, or list of lists

Labels of interest. Will be mapped to 1,2,3,… Several labels can be mapped to the same integer by using nested lists. For example, signal_labels=[A,[B,C]] would result in A being mapped to 1 and B and C both being mapped to 2.

backgr_labels: list

Labels will be grouped into a common “background” class (0).

unfold_labels: bool

Should be set to True if any of the rows have multiple labels. Shoudl be set to False otherwise (default).

label_sep: str

Character used to separate multiple labels. Only relevant if unfold_labels is set to True. Default is “,”.

trim_table: bool

Keep only the columns prescribed by the Ketos annotation format.

return_label_dict: bool

Return label dictionary. Default is False.

Returns:
table_std: pandas DataFrame

Standardized annotation table

label_dict: dict

Dictionary mapping new labels to old labels. Only returned if return_label_dict is True.

ketos.data_handling.selection_table.time_shift(annot, time_ref, length, step, min_overlap)[source]

Create multiple instances of the same selection by stepping in time, both forward and backward.

The time-shifted instances are returned in a pandas DataFrame with the same columns as the input annotation, plus a column named ‘start_new’ containing the start times of the shifted instances.

Args:
annot: pandas Series or dict

Reference annotation. Must contain the labels/keys ‘start’ and ‘end’.

time_ref: float

Reference time used as starting point for the stepping.

length: float

Output annotation length in seconds.

step: float

Produce multiple instances of the same selection by shifting the annotation window in steps of length step (in seconds) both forward and backward in time. The default value is 0.

min_overlap: float

Minimum required overlap between the selection intervals and the original annotation, expressed as a fraction of the selection length.

Results:
df: pandas DataFrame

Output annotation table. The start times of the time-shifted annotations are stored in the column ‘start_new’.

Example:
>>> import pandas as pd
>>> from ketos.data_handling.selection_table import time_shift
>>> 
>>> #Create a single 2-s long annotation
>>> annot = {'filename':'file1.wav', 'label':1, 'start':12.0, 'end':14.0}
>>>
>>> #Step across this annotation with a step size of 0.2 s, creating 1-s long annotations that 
>>> #overlap by at least 50% with the original 2-s annotation 
>>> df = time_shift(annot, time_ref=13.0, length=1.0, step=0.2, min_overlap=0.5)
>>> print(df.round(2))
    filename  label  start   end  start_new
0  file1.wav      1   12.0  14.0       11.6
1  file1.wav      1   12.0  14.0       11.8
2  file1.wav      1   12.0  14.0       12.0
3  file1.wav      1   12.0  14.0       12.2
4  file1.wav      1   12.0  14.0       12.4
5  file1.wav      1   12.0  14.0       12.6
6  file1.wav      1   12.0  14.0       12.8
7  file1.wav      1   12.0  14.0       13.0
8  file1.wav      1   12.0  14.0       13.2
9  file1.wav      1   12.0  14.0       13.4
ketos.data_handling.selection_table.trim(table)[source]

Keep only the columns prescribed by the Ketos annotation format.

Args:
table: pandas DataFrame

Annotation table.

Returns:
table: pandas DataFrame

Annotation table, after removal of columns.

ketos.data_handling.selection_table.unfold(table, sep=',')[source]

Unfolds rows containing multiple labels.

Args:
table: pandas DataFrame

Annotation table.

sep: str

Character used to separate multiple labels.

Returns:
: pandas DataFrame

Unfolded table

ketos.data_handling.selection_table.use_multi_indexing(df, level_1_name)[source]

Change from single-level indexing to double-level indexing.

The first index level is the filename while the second index level is a cumulative integer.

Args:
table: pandas DataFrame

Singly-indexed table. Must contain a column named ‘filename’.

Returns:
table: pandas DataFrame

Multi-indexed table.