Miscellaneous

‘audio.utils.misc’ module within the ketos library

This module provides utilities to perform various types of operations on audio data, acting either in the time domain (waveform) or in the frequency domain (spectrogram), or both.

ketos.audio.utils.misc.cqt(x, rate, step, bins_per_oct, freq_min, freq_max=None, window_func='hamming')[source]

Compute the CQT spectrogram of an audio signal.

Uses the librosa implementation,

To compute the CQT spectrogram, the user must specify the step size, the minimum and maximum frequencies, f_{min} and f_{max}, and the number of bins per octave, m. While f_{min} and m are fixed to the input values, the step size and f_{max} are adjusted as detailed below, attempting to match the input values as closely as possible.

The total number of bins is given by n = k \cdot m where k denotes the number of octaves, computed as

k = ceil(log_{2}[f_{max}/f_{min}])

For example, with f_{min}=10, f_{max}=16000, and m = 32 the number of octaves is k = 11 and the total number of bins is n = 352. The frequency of a given bin, i, is given by

f_{i} = 2^{i / m} \cdot f_{min}

This implies that the maximum frequency is given by f_{max} = f_{n} = 2^{n/m} \cdot f_{min}. For the above example, we find f_{max} = 20480 Hz, i.e., somewhat larger than the requested maximum value.

Note that if f_{max} exceeds the Nyquist frequency, f_{nyquist} = 0.5 \cdot s, where s is the sampling rate, the number of octaves, k, is reduced to ensure that f_{max} \leq f_{nyquist}.

The CQT algorithm requires the step size to be an integer multiple 2^k. To ensure that this is the case, the step size is computed as follows,

h = ceil(s \cdot x / 2^k ) \cdot 2^k

where s is the sampling rate in Hz, and x is the step size in seconds as specified via the argument winstep. For example, assuming a sampling rate of 32 kHz (s = 32000) and a step size of 0.02 seconds (x = 0.02) and adopting the same frequency limits as above (f_{min}=10 and f_{max}=16000), the actual step size is determined to be h = 2^{11} = 2048, corresponding to a physical bin size of t_{res} = 2048 / 32000 Hz = 0.064 s, i.e., about three times as large as the requested step size.

Args:
x: numpy.array

Audio signal

rate: float

Sampling rate in Hz

step: float

Step size in seconds

bins_per_oct: int

Number of bins per octave

freq_min: float

Minimum frequency in Hz

freq_max: float

Maximum frequency in Hz. If None, it is set equal to half the sampling rate.

window_func: str
Window function (optional). Select between
  • bartlett

  • blackman

  • hamming (default)

  • hanning

Returns:
img: numpy.array

Resulting CQT spectrogram image.

step: float

Adjusted step size in seconds.

ketos.audio.utils.misc.from_decibel(y)[source]
Convert any data array, y, typically a spectrogram, from decibel scale

to linear scale by applying the operation 10^{y/20}.

Args:
ynumpy array

Input array

Returns:
xnumpy array

Converted array

Example:
>>> import numpy as np
>>> from ketos.audio.utils.misc import from_decibel 
>>> img = np.array([[10., 20.],[30., 40.]])
>>> img_db = from_decibel(img)
>>> img_db = np.around(img_db, decimals=2) # only keep up to two decimals
>>> print(img_db)
[[  3.16  10.  ]
 [ 31.62 100.  ]]
ketos.audio.utils.misc.num_samples(time, rate, even=False)[source]

Convert time interval to number of samples.

If the time corresponds to a non-integer number of samples, round to the nearest larger integer value.

Args:
time: float

Timer interval in seconds

rate: float

Sampling rate in Hz

even: bool

Convert to nearest larger even integer.

Returns:
n: int

Number of samples

Example:
>>> from ketos.audio.utils.misc import num_samples
>>> print(num_samples(rate=1000., time=0.0))
0
>>> print(num_samples(rate=1000., time=2.0))
2000
>>> print(num_samples(rate=1000., time=2.001))
2001
>>> print(num_samples(rate=1000., time=2.001, even=True))
2002
ketos.audio.utils.misc.pad_reflect(x, pad_left=0, pad_right=0)[source]

Pad array with its own (inverted) reflection along the first axis (0).

Args:
x: numpy.array

The data to be padded.

pad_left: int

Amount of padding on the left

pad_right: int

Amount of padding on the right

Returns:
x_padded: numpy.array

Padded array

Example:
>>> from ketos.audio.utils.misc import pad_reflect
>>> arr = np.arange(9) #create a simple array
>>> print(arr)
[0 1 2 3 4 5 6 7 8]
>>> arr = pad_reflect(arr, pad_right=3) #pad on the right
>>> print(arr)
[ 0  1  2  3  4  5  6  7  8  9 10 11]
ketos.audio.utils.misc.pad_zero(x, pad_left=0, pad_right=0)[source]

Pad array with zeros along the first axis (0).

Args:
x: numpy.array

The data to be padded.

pad_left: int

Amount of padding on the left

pad_right: int

Amount of padding on the right

Returns:
x_padded: numpy.array

Padded array

Example:
>>> from ketos.audio.utils.misc import pad_zero
>>> arr = np.arange(9) #create a simple array
>>> print(arr)
[0 1 2 3 4 5 6 7 8]
>>> arr = pad_zero(arr, pad_right=3) #pad on the right
>>> print(arr)
[0 1 2 3 4 5 6 7 8 0 0 0]
ketos.audio.utils.misc.segment(x, win_len, step_len, num_segs=None, offset_len=0, pad_mode='reflect', mem_warning=True)[source]

Divide an array into segments of equal length along its first axis (0), each segment being shifted by a fixed amount with respetive to the previous segment.

If offset_len is negative the input array will be padded with its own inverted reflection on the left.

If the combined length of the segments exceeds the length of the input array (minus any positive offset), the array will be padded with its own inverted reflection on the right.

Args:
x: numpy.array

The data to be segmented

win_len: int

Window length in no. of samples

step_len: float

Step size in no. of samples

num_segs: int

Number of segments. Optional.

offset_len: int

Position of the first frame in no. of samples. Defaults to 0, if not specified.

pad_mode: str

Padding mode. Select between ‘reflect’ (default) and ‘zero’.

mem_warning: bool

Print warning if the size of the array exceeds 10% of the available memory.

Returns:
segs: numpy.array

Segmented data, has shape (num_segs, win_len, x.shape[1:])

Example:
>>> from ketos.audio.utils.misc import segment
>>> x = np.arange(10)
>>> print(x)
[0 1 2 3 4 5 6 7 8 9]
>>> y = segment(x, win_len=4, step_len=2, num_segs=3, offset_len=0)    
>>> print(y)
[[0 1 2 3]
 [2 3 4 5]
 [4 5 6 7]]
>>> y = segment(x, win_len=4, step_len=2, num_segs=3, offset_len=-3)    
>>> print(y)
[[-3 -2 -1  0]
 [-1  0  1  2]
 [ 1  2  3  4]]
ketos.audio.utils.misc.segment_args(rate, duration, offset, window, step)[source]

Computes input arguments for audio.utils.misc.make_segment() to produce a centered spectrogram with properties as close as possible to those specified.

Args:
rate: float

Sampling rate in Hz

duration: float

Duration in seconds

offset: float

Offset in seconds

window: float

Window size in seconds

step: float

Window size in seconds

Returns:
: dict
Dictionary with following keys and values:
  • win_len: Window size in number of samples (int)

  • step_len: Step size in number of samples (int)

  • num_segs: Number of steps (int)

  • offset_len: Offset in number of samples (int)

Example:
>>> from ketos.audio.utils.misc import segment_args
>>> args = segment_args(rate=1000., duration=3., offset=0., window=0.1, step=0.02)
>>> for key,value in sorted(args.items()):
...     print(key,':',value)
num_segs : 150
offset_len : -40
step_len : 20
win_len : 100
ketos.audio.utils.misc.spec2wave(image, phase_angle, num_fft, step_len, num_iters, window_func)[source]

Estimate audio signal from magnitude spectrogram.

Implements the algorithm described in

    1. Griffin and J. S. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Trans. ASSP, vol.32, no.2, pp.236–243, Apr. 1984.

Follows closely the implentation of https://github.com/tensorflow/magenta/blob/master/magenta/models/nsynth/utils.py

Args:
image: 2d numpy array

Magnitude spectrogram, linear scale

phase_angle:

Initial condition for phase in degrees

num_fft: int

Number of points used for the Fast-Fourier Transform. Same as window size.

step_len: int

Step size.

num_iters:

Number of iterations to perform.

window_func: string, tuple, number, function, np.ndarray [shape=(num_fft,)]
  • a window specification (string, tuple, or number); see scipy.signal.get_window

  • a window function, such as scipy.signal.hamming

  • a user-specified window vector of length num_fft

Returns:
audio: 1d numpy array

Audio signal

Example:
>>> #Create a simple sinusoidal audio signal with frequency of 10 Hz
>>> import numpy as np
>>> x = np.arange(1000)
>>> audio = 32600 * np.sin(2 * np.pi * 10 * x / 1000) 
>>> #Compute the Short Time Fourier Transform of the audio signal 
>>> #using a window size of 200, step size of 40, and a Hamming window,
>>> from ketos.audio.utils.misc import stft
>>> win_fun = 'hamming'
>>> mag, freq_max, num_fft, _ = stft(x=audio, rate=1000, seg_args={'win_len':200, 'step_len':40}, window_func=win_fun)
>>> #Estimate the original audio signal            
>>> from ketos.audio.utils.misc import spec2wave
>>> audio_est = spec2wave(image=mag, phase_angle=0, num_fft=num_fft, step_len=40, num_iters=25, window_func=win_fun)
>>> #plot the original and the estimated audio signal
>>> import matplotlib.pyplot as plt
>>> plt.clf()
>>> _ = plt.plot(audio)
>>> plt.savefig("ketos/tests/assets/tmp/sig_orig.png")
>>> _ = plt.plot(audio_est)
>>> plt.savefig("ketos/tests/assets/tmp/sig_est.png")
../../../_images/sig_est.png
ketos.audio.utils.misc.stft(x, rate, window=None, step=None, seg_args=None, window_func='hamming', decibel=True)[source]

Compute Short Time Fourier Transform (STFT).

Uses audio.utils.misc.segment_args() to convert the window size and step size into an even integer number of samples.

The number of points used for the Fourier Transform is equal to the number of samples in the window.

Args:
x: numpy.array

Audio signal

rate: float

Sampling rate in Hz

window: float

Window length in seconds

step: float

Step size in seconds

seg_args: dict

Input arguments for audio.utils.misc.segment_args(). Optional. If specified, the arguments window and step are ignored.

window_func: str
Window function (optional). Select between
  • bartlett

  • blackman

  • hamming (default)

  • hanning

decibel: bool

Convert to dB scale

Returns:
img: numpy.array

Short Time Fourier Transform of the input signal.

freq_max: float

Maximum frequency in Hz

num_fft: int

Number of points used for the Fourier Transform.

seg_args: dict

Input arguments used for evaluating audio.utils.misc.segment_args().

ketos.audio.utils.misc.to_decibel(x)[source]
Convert any data array, y, typically a spectrogram, from linear scale

to decibel scale by applying the operation 20\log_{10}(y).

Args:
xnumpy array

Input array

Returns:
ynumpy array

Converted array

Example:
>>> import numpy as np
>>> from ketos.audio.utils.misc import to_decibel 
>>> img = np.array([[10., 20.],[30., 40.]])
>>> img_db = to_decibel(img)
>>> img_db = np.around(img_db, decimals=2) # only keep up to two decimals
>>> print(img_db)
[[20.0 26.02]
 [29.54 32.04]]