create_database

ketos.data_handling.database_interface.create_database(output_file, data_dir, selections, channel=0, audio_repres={'type': <class 'ketos.audio.waveform.Waveform'>}, annotations=None, unique_labels=None, dataset_name=None, table_name='data', max_size=None, verbose=True, progress_bar=True, discard_wrong_shape=False, allow_resizing=1, include_source=True, include_label=True, include_attrs=False, attrs=None, index_cols=None, mode='a', create_dir=True, max_filename_len=100)[source]

Create a database from a selection table.

Note that all selections must have the same duration. This is necessary to ensure that all the objects stored in the database have the same dimension.

If each selection is chacterized by a single, integer label, these should be included as a column named ‘label’ in the selection table.

In the more general case, where each selection is associated with a set of annotations (as opposed to a single, integer label), the annotation table must be passed using the ‘annotations’ argument. The annotations will be saved to a separate table within the database, with a field named ‘data_index’ linking each annotation to a selection in the data table.

Note that the selection table, and the annotation table, if provided, must both adhere to the Ketos standard, as defined in the Selection Table module.

If ‘dataset_name’ is not specified, the name of the folder containing the audio files (‘data_dir’) will be used.

Warnings will be printed if the method encounters problems loading/writing audio data or if the start/end time of a selection is outside the range of the audio file. The warnings can be suppressed by setting verbose=False.

Args:

output_file:str

The name of the HDF5 file in which the data will be stored. Can include the path (e.g.:’/home/user/data/database_abc.h5’). If the file does not exist, it will be created. If the file already exists, new data will be appended to it.

data_dir:str

Path to folder containing .wav files, or .tar archive file.

selections: pandas DataFrame or list

Selection table.

channel: int

For stereo recordings, this can be used to select which channel to read from

audio_repres: dict or list

A dictionary containing the parameters used to generate the spectrogram or waveform segments. See :class:~ketos.audio.auio_loader.AudioLoader for details on the required and optional fields for each type of signal. It is also possible to specify one or several audio representations as a nested dictionary, in which case the dictionary keys are used as column names in the output table.

annotations: pandas DataFrame

Annotation table. Optional. Should be used if each selection is associated with a set of annotations (as opposed to a single, integer label). Must have the standard ketos form.

unique_labels: list(int)

List of labels occurring in the dataset. If not specified, the labels will be inferred from the selections or the annotations.

dataset_name:str

Name of the node (HDF5 group) within the database (e.g.: ‘train’) Under this node, two tables will be created, ‘data’ and ‘data_annot’, containing the data samples (spectrograms and/or waveforms) and the annotations associated with each sample, respectively.

table_name: str

Table name. Default is ‘data’.

max_size: int

Maximum size of output database file in bytes. If file exceeds this size, it will be split up into several files with _000, _001, etc, appended to the filename. The default values is max_size=1E9 (1 Gbyte). If None, no restriction is imposed on the file size (i.e. the file is never split).

verbose: bool

Print relevant information during execution such as no. of files written to disk

progress_bar: bool

Show progress bar.

discard_wrong_shape: bool

Discard objects that do not have the same shape as previously saved objects. Default is False.

allow_resizing: int

If the object shape differs from previously saved objects, the object will be resized using the resize method of the scikit-image package, provided the mismatch is no greater than allow_resizing in either dimension.

include_source: bool

If True, the name of the wav file from which the waveform or spectrogram was generated and the offset within that file, is saved to the table. Default is True.

include_label: bool

Include integer label column in data table. Default is True.

include_attrs: bool

If True, load data from attribute columns in the selection table. Default is False.

attrs: list(str)

Specify the names of the attribute columns that you wish to load data from. Overwrites include_attrs if specified. If None, all columns will be loaded provided that include_attrs=True.

index_cols: str og list(str)

Create indices for the specified columns in the data table to allow for faster queries. For example, index_cols=”filename” or index_cols=[“filename”, “label”]

mode: str

The mode to open the file. It can be one of the following:

w: Write; a new file is created (an existing file with the same name would be deleted). a: Append; an existing file is opened for reading and writing, and if the file does not

exist it is created. This is the default.

r+: It is similar to a, but the file must already exist.

create_dir: bool

If the output directory does not exist, it will be automatically created. Default is True. Only applies if the mode is w or a,

max_filename_len: int

Maximum allowed length of filename. Only used if include_source is True.