Demo notebook for analyzing audio data¶

Introduction¶

Audio data - as recorded by smartphones or other portable devices - can carry important information about individuals’ environments. This may give insights about the activity, sleep, and social interaction. However, using these data can be tricky due to privacy concerns, for example, conversations are highly identifiable. A possible solution is to compute more general characteristics (e.g. frequency) and use those instead to extract features. To address this last part, niimpy includes the function extract_features_audio to clean, downsample, and extract features from audio snippets that have been already anonymized. This function employs other functions to extract the following features:

audio_count_silent: number of times when there has been some sound in the environment
audio_count_speech: number of times when there has been some sound in the environment that matches the range of human speech frequency (65 - 255Hz)
audio_count_loud: number of times when there has been some sound in the environment above 70dB
audio_min_freq: minimum frequency of the recorded audio snippets
audio_max_freq: maximum frequency of the recorded audio snippets
audio_mean_freq: mean frequency of the recorded audio snippets
audio_median_freq: median frequency of the recorded audio snippets
audio_std_freq: standard deviation of the frequency of the recorded audio snippets
audio_min_db: minimum decibels of the recorded audio snippets
audio_max_db: maximum decibels of the recorded audio snippets
audio_mean_db: mean decibels of the recorded audio snippets
audio_median_db: median decibels of the recorded audio snippets
audio_std_db: standard deviations of the recorded audio snippets decibels

In the following, we will analyze audio snippets provided by niimpy as an example to illustrate the use of niimpy’s audio preprocessing functions.

2. Read data¶

Let’s start by reading the example data provided in niimpy. These data have already been shaped in a format that meets the requirements of the data schema. Let’s start by importing the needed modules. Firstly we will import the niimpy package and then we will import the module we will use (audio) and give it a short name for use convinience.

[1]:

import niimpy
from niimpy import config
import niimpy.preprocessing.audio as au
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

Now let’s read the example data provided in niimpy. The example data is in csv format, so we need to use the read_csv function. When reading the data, we can specify the timezone where the data was collected. This will help us handle daylight saving times easier. We can specify the timezone with the argument tz. The output is a dataframe. We can also check the number of rows and columns in the dataframe.

[2]:

data = niimpy.read_csv(config.MULTIUSER_AWARE_AUDIO_PATH, tz='Europe/Helsinki')
data.shape

[2]:

(33, 7)

The data was succesfully read. We can see that there are 33 datapoints with 7 columns in the dataset. However, we do not know yet what the data really looks like, so let’s have a quick look:

[3]:

data.head()

[3]:

	user	device	time	double_decibels	double_frequency	datetime
2020-01-09 02:08:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578528e+09	84	4935	2020-01-09 02:08:03.896000+02:00
2020-01-09 02:38:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578530e+09	89	8734	2020-01-09 02:38:03.896000+02:00
2020-01-09 03:08:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578532e+09	99	1710	2020-01-09 03:08:03.896000+02:00
2020-01-09 03:38:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578534e+09	77	9054	2020-01-09 03:38:03.896000+02:00
2020-01-09 04:08:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578536e+09	80	12265	2020-01-09 04:08:03.896000+02:00

[4]:

data.tail()

[4]:

	user	device	time	is_silent	double_decibels	double_frequency	datetime
2019-08-13 15:02:17.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	1.565698e+09	1	44	2914	2019-08-13 15:02:17.657999872+03:00
2019-08-13 15:28:59.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	1.565699e+09	1	49	7195	2019-08-13 15:28:59.657999872+03:00
2019-08-13 15:59:01.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	1.565701e+09	0	55	91	2019-08-13 15:59:01.657999872+03:00
2019-08-13 16:29:03.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	1.565703e+09	0	76	3853	2019-08-13 16:29:03.657999872+03:00
2019-08-13 16:59:05.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	1.565705e+09	0	84	7419	2019-08-13 16:59:05.657999872+03:00

By exploring the head and tail of the dataframe we can form an idea of its entirety. From the data, we can see that:

rows are observations, indexed by timestamps, i.e. each row represents a snippet that has been recorded at a given time and date
columns are characteristics for each observation, for example, the user whose data we are analyzing
there are at least two different users in the dataframe
there are two main columns: double_decibels and double_frequency.

In fact, we can check the first three elements for each user

[5]:

data.drop_duplicates(['user','time']).groupby('user').head(3)

[5]:

	user	device	time	double_decibels	double_frequency	datetime
2020-01-09 02:08:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578528e+09	84	4935	2020-01-09 02:08:03.896000+02:00
2020-01-09 02:38:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578530e+09	89	8734	2020-01-09 02:38:03.896000+02:00
2020-01-09 03:08:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578532e+09	99	1710	2020-01-09 03:08:03.896000+02:00
2019-08-13 07:28:27.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	1.565671e+09	51	7735	2019-08-13 07:28:27.657999872+03:00
2019-08-13 07:58:29.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	1.565672e+09	90	13609	2019-08-13 07:58:29.657999872+03:00
2019-08-13 08:28:31.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	1.565674e+09	81	7690	2019-08-13 08:28:31.657999872+03:00

Sometimes the data may come in a disordered manner, so just to make sure, let’s order the dataframe and compare the results. We will use the columns “user” and “datetime” since we would like to order the information according to firstly, participants, and then, by time in order of happening. Luckily, in our dataframe, the index and datetime are the same.

[6]:

data.sort_values(by=['user', 'datetime'], inplace=True)
data.drop_duplicates(['user','time']).groupby('user').head(3)

[6]:

	user	device	time	double_decibels	double_frequency	datetime
2019-08-13 07:28:27.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	1.565671e+09	51	7735	2019-08-13 07:28:27.657999872+03:00
2019-08-13 07:58:29.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	1.565672e+09	90	13609	2019-08-13 07:58:29.657999872+03:00
2019-08-13 08:28:31.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	1.565674e+09	81	7690	2019-08-13 08:28:31.657999872+03:00
2020-01-09 02:08:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578528e+09	84	4935	2020-01-09 02:08:03.896000+02:00
2020-01-09 02:38:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578530e+09	89	8734	2020-01-09 02:38:03.896000+02:00
2020-01-09 03:08:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578532e+09	99	1710	2020-01-09 03:08:03.896000+02:00

Ok, it seems like our dataframe was in order. We can start extracting features. However, we need to understand the data format requirements first.

* TIP! Data format requirements (or what should our data look like)¶

Data can take other shapes and formats. However, the niimpy data schema requires it to be in a certain shape. This means the dataframe needs to have at least the following characteristics: 1. One row per call. Each row should store information about one call only 2. Each row’s index should be a timestamp 3. The following five columns are required: - index: date and time when the event happened (timestamp) - user: stores the user name whose data is analyzed. Each user should have a unique name or hash (i.e. one hash for each unique user) - is_silent: stores whether the decibel level is above a set threshold (usually 50dB) - double_decibels: stores the decibels of the recorded snippet - double_frequency: the frequency of the recorded snippet in Hz - NOTE: most of our audio examples come from data recorded with the Aware Framework, if you want to know more about the frequency and decibels, please read https://github.com/denzilferreira/com.aware.plugin.ambient_noise 4. Additional columns are allowed. 5. The names of the columns do not need to be exactly “user”, “is_silent”, “double_decibels” or “double_frequency” as we can pass our own names in an argument (to be explained later).

Below is an example of a dataframe that complies with these minimum requirements

[7]:

example_dataschema = data[['user','is_silent','double_decibels','double_frequency']]
example_dataschema.head(3)

[7]:

	user	double_decibels	double_frequency
2019-08-13 07:28:27.657999872+03:00	iGyXetHE3S8u	51	7735
2019-08-13 07:58:29.657999872+03:00	iGyXetHE3S8u	90	13609
2019-08-13 08:28:31.657999872+03:00	iGyXetHE3S8u	81	7690

4. Extracting features¶

There are two ways to extract features. We could use each function separately or we could use niimpy’s ready-made wrapper. Both ways will require us to specify arguments to pass to the functions/wrapper in order to customize the way the functions work. These arguments are specified in dictionaries. Let’s first understand how to extract features using stand-alone functions.

4.1 Extract features using stand-alone functions¶

We can use niimpy’s functions to compute communication features. Each function will require two inputs: - (mandatory) dataframe that must comply with the minimum requirements (see ‘* TIP! Data requirements above) - (optional) an argument dictionary for stand-alone functions

4.1.1 The argument dictionary for stand-alone functions (or how we specify the way a function works)¶

In this dictionary, we can input two main features to customize the way a stand-alone function works: - the name of the columns to be preprocessed: Since the dataframe may have different columns, we need to specify which column has the data we would like to be preprocessed. To do so, we can simply pass the name of the column to the argument audio_column_name.

the way we resample: resampling options are specified in niimpy as a dictionary. niimpy’s resampling and aggregating relies on pandas.DataFrame.resample, so mastering the use of this pandas function will help us greatly in niimpy’s preprocessing. Please familiarize yourself with the pandas resample function before continuing. Briefly, to use the pandas.DataFrame.resample function, we need a rule. This rule states the intervals we would like to use to resample our data (e.g., 15-seconds, 30-minutes, 1-hour). Neverthless, we can input more details into the function to specify the exact sampling we would like. For example, we could use the close argument if we would like to specify which side of the interval is closed, or we could use the offset argument if we would like to start our binning with an offset, etc. There are plenty of options to use this command, so we strongly recommend having pandas.DataFrame.resample documentation at hand. All features for the pandas.DataFrame.resample will be specified in a dictionary where keys are the arguments’ names for the pandas.DataFrame.resample, and the dictionary’s values are the values for each of these selected arguments. This dictionary will be passed as a value to the key resample_args in niimpy.

Let’s see some basic examples of these dictionaries:

[8]:

feature_dict1:{"audio_column_name":"double_frequency","resample_args":{"rule":"1D"}}
feature_dict2:{"audio_column_name":"random_name","resample_args":{"rule":"30T"}}
feature_dict3:{"audio_column_name":"other_name","resample_args":{"rule":"45T","origin":"end"}}

Here, we have three basic feature dictionaries.

feature_dict1 will be used to analyze the data stored in the column double_frequency in our dataframe. The data will be binned in one day periods
feature_dict2 will be used to analyze the data stored in the column random_name in our dataframe. The data will be aggregated in 30-minutes bins
feature_dict3 will be used to analyze the data stored in the column other_name in our dataframe. The data will be binned in 45-minutes bins, but the binning will start from the last timestamp in the dataframe.

Default values: if no arguments are passed, niimpy’s will aggregate the data in 30-min bins, and will select the audio_column_name according to the most suitable column. For example, if we are computing the minimum frequency, niimpy will select double_frquency as the column name.

4.1.2 Using the functions¶

Now that we understand how the functions are customized, it is time we compute our first audio feature. Suppose that we are interested in extracting the total number of times our recordings were loud every 50 minutes. We will need niimpy’s audio_count_loud function, the data, and we will also need to create a dictionary to customize our function. Let’s create the dictionary first

[9]:

function_features={"audio_column_name":"double_decibels","resample_args":{"rule":"50T"}}

Now let’s use the function to preprocess the data.

[10]:

my_loud_times = au.audio_count_loud(data, function_features)

Let’s look at some values for one of the subjects.

[11]:

my_loud_times[my_loud_times["user"]=="jd9INuQ5BBlW"]

[11]:

	user	device	audio_count_loud
datetime
2020-01-09 01:40:00+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1
2020-01-09 02:30:00+02:00	jd9INuQ5BBlW	3p83yASkOb_B	2
2020-01-09 03:20:00+02:00	jd9INuQ5BBlW	3p83yASkOb_B	2
2020-01-09 04:10:00+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1
2020-01-09 05:00:00+02:00	jd9INuQ5BBlW	3p83yASkOb_B	2
2020-01-09 05:50:00+02:00	jd9INuQ5BBlW	3p83yASkOb_B	2
2020-01-09 06:40:00+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1
2020-01-09 07:30:00+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1
2020-01-09 08:20:00+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1
2020-01-09 09:10:00+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1
2020-01-09 10:00:00+02:00	jd9INuQ5BBlW	OWd1Uau8POix	2

Let’s remember how the original data looks like for this subject

[12]:

data[data["user"]=="jd9INuQ5BBlW"].head(7)

[12]:

	user	device	time	double_decibels	double_frequency	datetime
2020-01-09 02:08:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578528e+09	84	4935	2020-01-09 02:08:03.896000+02:00
2020-01-09 02:38:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578530e+09	89	8734	2020-01-09 02:38:03.896000+02:00
2020-01-09 03:08:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578532e+09	99	1710	2020-01-09 03:08:03.896000+02:00
2020-01-09 03:38:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578534e+09	77	9054	2020-01-09 03:38:03.896000+02:00
2020-01-09 04:08:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578536e+09	80	12265	2020-01-09 04:08:03.896000+02:00
2020-01-09 04:38:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578537e+09	52	7281	2020-01-09 04:38:03.896000+02:00
2020-01-09 05:08:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1.578539e+09	63	14408	2020-01-09 05:08:03.896000+02:00

We see that the bins are indeed 50-minutes bins, however, they are adjusted to fixed, predetermined intervals, i.e. the bin does not start on the time of the first datapoint. Instead, pandas starts the binning at 00:00:00 of everyday and counts 50-minutes intervals from there.

If we want the binning to start from the first datapoint in our dataset, we need the origin parameter and a for loop.

[13]:

users = list(data['user'].unique())
results = []
for user in users:
    start_time = data[data["user"]==user].index.min()
    function_features={"audio_column_name":"double_decibels","resample_args":{"rule":"50T","origin":start_time}}
    results.append(au.audio_count_loud(data[data["user"]==user], function_features))
my_loud_times = pd.concat(results)

[14]:

my_loud_times

[14]:

	user	device	audio_count_loud
datetime
2019-08-13 07:28:27.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	2
2019-08-13 08:18:27.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	2
2019-08-13 09:08:27.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	1
2019-08-13 09:58:27.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	2
2019-08-13 10:48:27.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	2
2019-08-13 11:38:27.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	1
2019-08-13 12:28:27.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	1
2019-08-13 13:18:27.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	0
2019-08-13 14:08:27.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	1
2019-08-13 14:58:27.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	2
2019-08-13 15:48:27.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	2
2019-08-13 16:38:27.657999872+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	1
2020-01-09 02:08:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	2
2020-01-09 02:58:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	2
2020-01-09 03:48:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1
2020-01-09 04:38:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	2
2020-01-09 05:28:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	2
2020-01-09 06:18:03.896000+02:00	jd9INuQ5BBlW	3p83yASkOb_B	1
2020-01-09 07:08:03.896000+02:00	jd9INuQ5BBlW	OWd1Uau8POix	2
2020-01-09 07:58:03.896000+02:00	jd9INuQ5BBlW	OWd1Uau8POix	0
2020-01-09 08:48:03.896000+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1
2020-01-09 09:38:03.896000+02:00	jd9INuQ5BBlW	OWd1Uau8POix	2
2020-01-09 10:28:03.896000+02:00	jd9INuQ5BBlW	OWd1Uau8POix	1

4.2 Extract features using the wrapper¶

We can use niimpy’s ready-made wrapper to extract one or several features at the same time. The wrapper will require two inputs: - (mandatory) dataframe that must comply with the minimum requirements (see ‘* TIP! Data requirements above) - (optional) an argument dictionary for wrapper

4.2.1 The argument dictionary for wrapper (or how we specify the way the wrapper works)¶

This argument dictionary will use dictionaries created for stand-alone functions. If you do not know how to create those argument dictionaries, please read the section 4.1.1 The argument dictionary for stand-alone functions (or how we specify the way a function works) first.

The wrapper dictionary is simple. Its keys are the names of the features we want to compute. Its values are argument dictionaries created for each stand-alone function we will employ. Let’s see some examples of wrapper dictionaries:

[15]:

wrapper_features1 = {au.audio_count_loud:{"audio_column_name":"double_decibels","resample_args":{"rule":"1D"}},
                     au.audio_max_freq:{"audio_column_name":"double_frequency","resample_args":{"rule":"1D"}}}

wrapper_features1 will be used to analyze two features, audio_count_loud and audio_max_freq. For the feature audio_count_loud, we will use the data stored in the column double_decibels in our dataframe and the data will be binned in one day periods. For the feature audio_max_freq, we will use the data stored in the column double_frequency in our dataframe and the data will be binned in one day periods.

[16]:

wrapper_features2 = {au.audio_mean_db:{"audio_column_name":"random_name","resample_args":{"rule":"1D"}},
                     au.audio_count_speech:{"audio_column_name":"double_decibels", "audio_freq_name":"double_frequency", "resample_args":{"rule":"5H","offset":"5min"}}}

wrapper_features2 will be used to analyze two features, audio_mean_db and audio_count_speech. For the feature audio_mean_db, we will use the data stored in the column random_name in our dataframe and the data will be binned in one day periods. For the feature audio_count_speech, we will use the data stored in the column double_decibels in our dataframe and the data will be binned in 5-hour periods with a 5-minute offset. Note that for this feature we will also need another column named “audio_freq_column”, this is because the speech is not only defined by the amplitude of the recording, but the frequency range.

[17]:

wrapper_features3 = {au.audio_mean_db:{"audio_column_name":"one_name","resample_args":{"rule":"1D","offset":"5min"}},
                     au.audio_min_freq:{"audio_column_name":"one_name","resample_args":{"rule":"5H"}},
                     au.audio_count_silent:{"audio_column_name":"another_name","resample_args":{"rule":"30T","origin":"end_day"}}}

wrapper_features3 will be used to analyze three features, audio_mean_db, audio_min_freq, and audio_count_silent. For the feature audio_mean_db, we will use the data stored in the column one_name and the data will be binned in one day periods with a 5-min offset. For the feature audio_min_freq, we will use the data stored in the column one_name in our dataframe and the data will be binned in 5-hour periods. Finally, for the feature audio_count_silent, we will use the data stored in the column another_name in our dataframe and the data will be binned in 30-minute periods and the origin of the bins will be the ceiling midnight of the last day.

Default values: if no arguments are passed, niimpy’s default values are either “double_decibels”, “double_frequency”, or “is_silent” for the communication_column_name, and 30-min aggregation bins. The column name depends on the function to be called. Moreover, the wrapper will compute all the available functions in absence of the argument dictionary.

4.2.2 Using the wrapper¶

Now that we understand how the wrapper is customized, it is time we compute our first communication feature using the wrapper. Suppose that we are interested in extracting the audio_count_loud duration every 50 minutes. We will need niimpy’s extract_features_audio function, the data, and we will also need to create a dictionary to customize our function. Let’s create the dictionary first

[18]:

wrapper_features1 = {au.audio_count_loud:{"audio_column_name":"double_decibels","resample_args":{"rule":"50T"}}}

Now, let’s use the wrapper

[19]:

results_wrapper = au.extract_features_audio(data, features=wrapper_features1)
results_wrapper.head(5)

computing <function audio_count_loud at 0x7f5c3a65f560>...

[19]:

	user	device	audio_count_loud
datetime
2019-08-13 06:40:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	1
2019-08-13 07:30:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	1
2019-08-13 08:20:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	2
2019-08-13 09:10:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	2
2019-08-13 10:00:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	1

Our first attempt was succesful. Now, let’s try something more. Let’s assume we want to compute the audio_count_loud and audio_min_freq in 1-hour bins.

[20]:

wrapper_features2 = {au.audio_count_loud:{"audio_column_name":"double_decibels","resample_args":{"rule":"1H"}},
                     au.audio_min_freq:{"audio_column_name":"double_frequency", "resample_args":{"rule":"1H"}}}
results_wrapper = au.extract_features_audio(data, features=wrapper_features2)
results_wrapper.head(5)

computing <function audio_count_loud at 0x7f5c3a65f560>...
computing <function audio_min_freq at 0x7f5c3a65f600>...

[20]:

	user	device	audio_count_loud	audio_min_freq
datetime
2019-08-13 07:00:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	2	7735.0
2019-08-13 08:00:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	2	7690.0
2019-08-13 09:00:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	2	756.0
2019-08-13 10:00:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	2	3059.0
2019-08-13 11:00:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	2	12278.0

Great! Another successful attempt. We see from the results that more columns were added with the required calculations. This is how the wrapper works when all features are computed with the same bins. Now, let’s see how the wrapper performs when each function has different binning requirements. Let’s assume we need to compute the audio_count_loud every day, and the audio_min_freq every 5 hours with an offset of 5 minutes.

[21]:

wrapper_features3 = {au.audio_count_loud:{"audio_column_name":"double_decibels","resample_args":{"rule":"1D"}},
                     au.audio_min_freq:{"audio_column_name":"double_frequency", "resample_args":{"rule":"5H", "offset":"5min"}}}
results_wrapper = au.extract_features_audio(data, features=wrapper_features3)
results_wrapper.head(5)

computing <function audio_count_loud at 0x7f5c3a65f560>...
computing <function audio_min_freq at 0x7f5c3a65f600>...

[21]:

	user	device	audio_count_loud	audio_min_freq
datetime
2019-08-13 00:00:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	17.0	NaN
2020-01-09 00:00:00+02:00	jd9INuQ5BBlW	3p83yASkOb_B	10.0	NaN
2020-01-09 00:00:00+02:00	jd9INuQ5BBlW	OWd1Uau8POix	6.0	NaN
2019-08-13 05:05:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	NaN	756.0
2019-08-13 10:05:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	NaN	2914.0

The output is once again a dataframe. In this case, two aggregations are shown. The first one is the daily aggregation computed for the audio_count_loud feature. The second one is the 5-hour aggregation period with 5-min offset for the audio_min_freq. We must note that because the audio_min_freqfeature is not required to be aggregated daily, the daily aggregation timestamps have a NaN value. Similarly, because the audio_count_loudis not required to be aggregated in 5-hour windows, its values are NaN for all subjects.

4.2.3 Wrapper and its default option¶

The default option will compute all features in 30-minute aggregation windows. To use the extract_features_audio function with its default options, simply call the function.

[22]:

default = au.extract_features_audio(data, features=None)

computing <function audio_count_silent at 0x7f5c3a65f420>...
computing <function audio_count_speech at 0x7f5c3a65f4c0>...
computing <function audio_count_loud at 0x7f5c3a65f560>...
computing <function audio_min_freq at 0x7f5c3a65f600>...
computing <function audio_max_freq at 0x7f5c3a65f6a0>...
computing <function audio_mean_freq at 0x7f5c3a65f740>...
computing <function audio_median_freq at 0x7f5c3a65f7e0>...
computing <function audio_std_freq at 0x7f5c3a65f880>...
computing <function audio_min_db at 0x7f5c3a65f920>...
computing <function audio_max_db at 0x7f5c3a65f9c0>...
computing <function audio_mean_db at 0x7f5c3a65fa60>...
computing <function audio_median_db at 0x7f5c3a65fb00>...
computing <function audio_std_db at 0x7f5c3a65fba0>...

[23]:

default.head()

[23]:

	user	device	audio_count_silent	audio_count_speech	audio_count_loud	audio_min_freq	audio_max_freq	audio_mean_freq	audio_median_freq	audio_std_freq	audio_min_db	audio_max_db	audio_mean_db	audio_median_db	audio_std_db
datetime
2019-08-13 07:00:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	0	NaN	1	7735.0	7735.0	7735.0	7735.0	NaN	51.0	51.0	51.0	51.0	NaN
2019-08-13 07:30:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	0	NaN	1	13609.0	13609.0	13609.0	13609.0	NaN	90.0	90.0	90.0	90.0	NaN
2019-08-13 08:00:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	0	NaN	1	7690.0	7690.0	7690.0	7690.0	NaN	81.0	81.0	81.0	81.0	NaN
2019-08-13 08:30:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	0	NaN	1	8347.0	8347.0	8347.0	8347.0	NaN	58.0	58.0	58.0	58.0	NaN
2019-08-13 09:00:00+03:00	iGyXetHE3S8u	Cq9vueHh3zVs	1	NaN	1	13592.0	13592.0	13592.0	13592.0	NaN	36.0	36.0	36.0	36.0	NaN

5. Implementing own features¶

If none of the provided functions suits well, We can implement our own customized features easily. To do so, we need to define a function that accepts a dataframe and returns a dataframe. The returned object should be indexed by user and timestamps (multiindex). To make the feature readily available in the default options, we need add the audio prefix to the new function (e.g. audio_my-new-feature). Let’s assume we need a new function that counts sums all frequencies. Let’s first define the function

[24]:

def audio_sum_freq(df,config=None):
    if not "audio_column_name" in config:
        col_name = "double_frequency"
    else:
        col_name = config["audio_column_name"]
    if not "resample_args" in config.keys():
        config["resample_args"] = {"rule":"30T"}

    if len(df)>0:
        result = df.groupby('user')[col_name].resample(**config["resample_args"]).sum()
        result = result.to_frame(name='audio_sum_freq')
        result = result.reset_index("user")
        result.index.rename("datetime", inplace=True)
        return result
    return None

Then, we can call our new function in the stand-alone way or using the extract_features_audio function. Because the stand-alone way is the common way to call functions in python, we will not show it. Instead, we will show how to integrate this new function to the wrapper. Let’s read again the data and assume we want the default behavior of the wrapper.

[25]:

customized_features = au.extract_features_audio(data, features={audio_sum_freq: {}})

computing <function audio_sum_freq at 0x7f5c683977e0>...

[26]:

customized_features.head()

[26]:

	user	audio_sum_freq
datetime
2019-08-13 07:00:00+03:00	iGyXetHE3S8u	7735
2019-08-13 07:30:00+03:00	iGyXetHE3S8u	13609
2019-08-13 08:00:00+03:00	iGyXetHE3S8u	7690
2019-08-13 08:30:00+03:00	iGyXetHE3S8u	8347
2019-08-13 09:00:00+03:00	iGyXetHE3S8u	13592