Demo notebook for analyzing audio data

Introduction

Audio data - as recorded by smartphones or other portable devices - can carry important information about individuals’ environments. This may give insights about the activity, sleep, and social interaction. However, using these data can be tricky due to privacy concerns, for example, conversations are highly identifiable. A possible solution is to compute more general characteristics (e.g. frequency) and use those instead to extract features. To address this last part, niimpy includes the function extract_features_audio to clean, downsample, and extract features from audio snippets that have been already anonymized. This function employs other functions to extract the following features:

  • audio_count_silent: number of times when there has been some sound in the environment

  • audio_count_speech: number of times when there has been some sound in the environment that matches the range of human speech frequency (65 - 255Hz)

  • audio_count_loud: number of times when there has been some sound in the environment above 70dB

  • audio_min_freq: minimum frequency of the recorded audio snippets

  • audio_max_freq: maximum frequency of the recorded audio snippets

  • audio_mean_freq: mean frequency of the recorded audio snippets

  • audio_median_freq: median frequency of the recorded audio snippets

  • audio_std_freq: standard deviation of the frequency of the recorded audio snippets

  • audio_min_db: minimum decibels of the recorded audio snippets

  • audio_max_db: maximum decibels of the recorded audio snippets

  • audio_mean_db: mean decibels of the recorded audio snippets

  • audio_median_db: median decibels of the recorded audio snippets

  • audio_std_db: standard deviations of the recorded audio snippets decibels

In the following, we will analyze audio snippets provided by niimpy as an example to illustrate the use of niimpy’s audio preprocessing functions.

2. Read data

Let’s start by reading the example data provided in niimpy. These data have already been shaped in a format that meets the requirements of the data schema. Let’s start by importing the needed modules. Firstly we will import the niimpy package and then we will import the module we will use (audio) and give it a short name for use convinience.

[1]:
import niimpy
from niimpy import config
import niimpy.preprocessing.audio as au
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

Now let’s read the example data provided in niimpy. The example data is in csv format, so we need to use the read_csv function. When reading the data, we can specify the timezone where the data was collected. This will help us handle daylight saving times easier. We can specify the timezone with the argument tz. The output is a dataframe. We can also check the number of rows and columns in the dataframe.

[2]:
data = niimpy.read_csv(config.MULTIUSER_AWARE_AUDIO_PATH, tz='Europe/Helsinki')
data.shape
[2]:
(33, 7)

The data was succesfully read. We can see that there are 33 datapoints with 7 columns in the dataset. However, we do not know yet what the data really looks like, so let’s have a quick look:

[3]:
data.head()
[3]:
user device time is_silent double_decibels double_frequency datetime
2020-01-09 02:08:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578528e+09 0 84 4935 2020-01-09 02:08:03.896000+02:00
2020-01-09 02:38:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 0 89 8734 2020-01-09 02:38:03.896000+02:00
2020-01-09 03:08:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578532e+09 0 99 1710 2020-01-09 03:08:03.896000+02:00
2020-01-09 03:38:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578534e+09 0 77 9054 2020-01-09 03:38:03.896000+02:00
2020-01-09 04:08:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578536e+09 0 80 12265 2020-01-09 04:08:03.896000+02:00
[4]:
data.tail()
[4]:
user device time is_silent double_decibels double_frequency datetime
2019-08-13 15:02:17.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565698e+09 1 44 2914 2019-08-13 15:02:17.657999872+03:00
2019-08-13 15:28:59.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565699e+09 1 49 7195 2019-08-13 15:28:59.657999872+03:00
2019-08-13 15:59:01.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565701e+09 0 55 91 2019-08-13 15:59:01.657999872+03:00
2019-08-13 16:29:03.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565703e+09 0 76 3853 2019-08-13 16:29:03.657999872+03:00
2019-08-13 16:59:05.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565705e+09 0 84 7419 2019-08-13 16:59:05.657999872+03:00

By exploring the head and tail of the dataframe we can form an idea of its entirety. From the data, we can see that:

  • rows are observations, indexed by timestamps, i.e. each row represents a snippet that has been recorded at a given time and date

  • columns are characteristics for each observation, for example, the user whose data we are analyzing

  • there are at least two different users in the dataframe

  • there are two main columns: double_decibels and double_frequency.

In fact, we can check the first three elements for each user

[5]:
data.drop_duplicates(['user','time']).groupby('user').head(3)
[5]:
user device time is_silent double_decibels double_frequency datetime
2020-01-09 02:08:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578528e+09 0 84 4935 2020-01-09 02:08:03.896000+02:00
2020-01-09 02:38:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 0 89 8734 2020-01-09 02:38:03.896000+02:00
2020-01-09 03:08:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578532e+09 0 99 1710 2020-01-09 03:08:03.896000+02:00
2019-08-13 07:28:27.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565671e+09 0 51 7735 2019-08-13 07:28:27.657999872+03:00
2019-08-13 07:58:29.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565672e+09 0 90 13609 2019-08-13 07:58:29.657999872+03:00
2019-08-13 08:28:31.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565674e+09 0 81 7690 2019-08-13 08:28:31.657999872+03:00

Sometimes the data may come in a disordered manner, so just to make sure, let’s order the dataframe and compare the results. We will use the columns “user” and “datetime” since we would like to order the information according to firstly, participants, and then, by time in order of happening. Luckily, in our dataframe, the index and datetime are the same.

[6]:
data.sort_values(by=['user', 'datetime'], inplace=True)
data.drop_duplicates(['user','time']).groupby('user').head(3)
[6]:
user device time is_silent double_decibels double_frequency datetime
2019-08-13 07:28:27.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565671e+09 0 51 7735 2019-08-13 07:28:27.657999872+03:00
2019-08-13 07:58:29.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565672e+09 0 90 13609 2019-08-13 07:58:29.657999872+03:00
2019-08-13 08:28:31.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.565674e+09 0 81 7690 2019-08-13 08:28:31.657999872+03:00
2020-01-09 02:08:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578528e+09 0 84 4935 2020-01-09 02:08:03.896000+02:00
2020-01-09 02:38:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 0 89 8734 2020-01-09 02:38:03.896000+02:00
2020-01-09 03:08:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578532e+09 0 99 1710 2020-01-09 03:08:03.896000+02:00

Ok, it seems like our dataframe was in order. We can start extracting features. However, we need to understand the data format requirements first.

* TIP! Data format requirements (or what should our data look like)

Data can take other shapes and formats. However, the niimpy data schema requires it to be in a certain shape. This means the dataframe needs to have at least the following characteristics: 1. One row per call. Each row should store information about one call only 2. Each row’s index should be a timestamp 3. The following five columns are required: - index: date and time when the event happened (timestamp) - user: stores the user name whose data is analyzed. Each user should have a unique name or hash (i.e. one hash for each unique user) - is_silent: stores whether the decibel level is above a set threshold (usually 50dB) - double_decibels: stores the decibels of the recorded snippet - double_frequency: the frequency of the recorded snippet in Hz - NOTE: most of our audio examples come from data recorded with the Aware Framework, if you want to know more about the frequency and decibels, please read https://github.com/denzilferreira/com.aware.plugin.ambient_noise 4. Additional columns are allowed. 5. The names of the columns do not need to be exactly “user”, “is_silent”, “double_decibels” or “double_frequency” as we can pass our own names in an argument (to be explained later).

Below is an example of a dataframe that complies with these minimum requirements

[7]:
example_dataschema = data[['user','is_silent','double_decibels','double_frequency']]
example_dataschema.head(3)
[7]:
user is_silent double_decibels double_frequency
2019-08-13 07:28:27.657999872+03:00 iGyXetHE3S8u 0 51 7735
2019-08-13 07:58:29.657999872+03:00 iGyXetHE3S8u 0 90 13609
2019-08-13 08:28:31.657999872+03:00 iGyXetHE3S8u 0 81 7690

4. Extracting features

There are two ways to extract features. We could use each function separately or we could use niimpy’s ready-made wrapper. Both ways will require us to specify arguments to pass to the functions/wrapper in order to customize the way the functions work. These arguments are specified in dictionaries. Let’s first understand how to extract features using stand-alone functions.

4.1 Extract features using stand-alone functions

We can use niimpy’s functions to compute communication features. Each function will require two inputs: - (mandatory) dataframe that must comply with the minimum requirements (see ‘* TIP! Data requirements above) - (optional) an argument dictionary for stand-alone functions

4.1.1 The argument dictionary for stand-alone functions (or how we specify the way a function works)

In this dictionary, we can input two main features to customize the way a stand-alone function works: - the name of the columns to be preprocessed: Since the dataframe may have different columns, we need to specify which column has the data we would like to be preprocessed. To do so, we can simply pass the name of the column to the argument audio_column_name.

  • the way we resample: resampling options are specified in niimpy as a dictionary. niimpy’s resampling and aggregating relies on pandas.DataFrame.resample, so mastering the use of this pandas function will help us greatly in niimpy’s preprocessing. Please familiarize yourself with the pandas resample function before continuing. Briefly, to use the pandas.DataFrame.resample function, we need a rule. This rule states the intervals we would like to use to resample our data (e.g., 15-seconds, 30-minutes, 1-hour). Neverthless, we can input more details into the function to specify the exact sampling we would like. For example, we could use the close argument if we would like to specify which side of the interval is closed, or we could use the offset argument if we would like to start our binning with an offset, etc. There are plenty of options to use this command, so we strongly recommend having pandas.DataFrame.resample documentation at hand. All features for the pandas.DataFrame.resample will be specified in a dictionary where keys are the arguments’ names for the pandas.DataFrame.resample, and the dictionary’s values are the values for each of these selected arguments. This dictionary will be passed as a value to the key resample_args in niimpy.

Let’s see some basic examples of these dictionaries:

[8]:
feature_dict1:{"audio_column_name":"double_frequency","resample_args":{"rule":"1D"}}
feature_dict2:{"audio_column_name":"random_name","resample_args":{"rule":"30T"}}
feature_dict3:{"audio_column_name":"other_name","resample_args":{"rule":"45T","origin":"end"}}

Here, we have three basic feature dictionaries.

  • feature_dict1 will be used to analyze the data stored in the column double_frequency in our dataframe. The data will be binned in one day periods

  • feature_dict2 will be used to analyze the data stored in the column random_name in our dataframe. The data will be aggregated in 30-minutes bins

  • feature_dict3 will be used to analyze the data stored in the column other_name in our dataframe. The data will be binned in 45-minutes bins, but the binning will start from the last timestamp in the dataframe.

Default values: if no arguments are passed, niimpy’s will aggregate the data in 30-min bins, and will select the audio_column_name according to the most suitable column. For example, if we are computing the minimum frequency, niimpy will select double_frquency as the column name.

4.1.2 Using the functions

Now that we understand how the functions are customized, it is time we compute our first audio feature. Suppose that we are interested in extracting the total number of times our recordings were loud every 50 minutes. We will need niimpy’s audio_count_loud function, the data, and we will also need to create a dictionary to customize our function. Let’s create the dictionary first

[9]:
function_features={"audio_column_name":"double_decibels","resample_args":{"rule":"50T"}}

Now let’s use the function to preprocess the data.

[10]:
my_loud_times = au.audio_count_loud(data, function_features)

Let’s look at some values for one of the subjects.

[11]:
my_loud_times[my_loud_times["user"]=="jd9INuQ5BBlW"]
[11]:
user device audio_count_loud
datetime
2020-01-09 01:40:00+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1
2020-01-09 02:30:00+02:00 jd9INuQ5BBlW 3p83yASkOb_B 2
2020-01-09 03:20:00+02:00 jd9INuQ5BBlW 3p83yASkOb_B 2
2020-01-09 04:10:00+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1
2020-01-09 05:00:00+02:00 jd9INuQ5BBlW 3p83yASkOb_B 2
2020-01-09 05:50:00+02:00 jd9INuQ5BBlW 3p83yASkOb_B 2
2020-01-09 06:40:00+02:00 jd9INuQ5BBlW OWd1Uau8POix 1
2020-01-09 07:30:00+02:00 jd9INuQ5BBlW OWd1Uau8POix 1
2020-01-09 08:20:00+02:00 jd9INuQ5BBlW OWd1Uau8POix 1
2020-01-09 09:10:00+02:00 jd9INuQ5BBlW OWd1Uau8POix 1
2020-01-09 10:00:00+02:00 jd9INuQ5BBlW OWd1Uau8POix 2

Let’s remember how the original data looks like for this subject

[12]:
data[data["user"]=="jd9INuQ5BBlW"].head(7)
[12]:
user device time is_silent double_decibels double_frequency datetime
2020-01-09 02:08:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578528e+09 0 84 4935 2020-01-09 02:08:03.896000+02:00
2020-01-09 02:38:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 0 89 8734 2020-01-09 02:38:03.896000+02:00
2020-01-09 03:08:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578532e+09 0 99 1710 2020-01-09 03:08:03.896000+02:00
2020-01-09 03:38:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578534e+09 0 77 9054 2020-01-09 03:38:03.896000+02:00
2020-01-09 04:08:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578536e+09 0 80 12265 2020-01-09 04:08:03.896000+02:00
2020-01-09 04:38:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578537e+09 0 52 7281 2020-01-09 04:38:03.896000+02:00
2020-01-09 05:08:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578539e+09 0 63 14408 2020-01-09 05:08:03.896000+02:00

We see that the bins are indeed 50-minutes bins, however, they are adjusted to fixed, predetermined intervals, i.e. the bin does not start on the time of the first datapoint. Instead, pandas starts the binning at 00:00:00 of everyday and counts 50-minutes intervals from there.

If we want the binning to start from the first datapoint in our dataset, we need the origin parameter and a for loop.

[13]:
users = list(data['user'].unique())
results = []
for user in users:
    start_time = data[data["user"]==user].index.min()
    function_features={"audio_column_name":"double_decibels","resample_args":{"rule":"50T","origin":start_time}}
    results.append(au.audio_count_loud(data[data["user"]==user], function_features))
my_loud_times = pd.concat(results)
[14]:
my_loud_times
[14]:
user device audio_count_loud
datetime
2019-08-13 07:28:27.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 2
2019-08-13 08:18:27.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 2
2019-08-13 09:08:27.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1
2019-08-13 09:58:27.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 2
2019-08-13 10:48:27.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 2
2019-08-13 11:38:27.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1
2019-08-13 12:28:27.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1
2019-08-13 13:18:27.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 0
2019-08-13 14:08:27.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1
2019-08-13 14:58:27.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 2
2019-08-13 15:48:27.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 2
2019-08-13 16:38:27.657999872+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1
2020-01-09 02:08:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 2
2020-01-09 02:58:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 2
2020-01-09 03:48:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1
2020-01-09 04:38:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 2
2020-01-09 05:28:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 2
2020-01-09 06:18:03.896000+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1
2020-01-09 07:08:03.896000+02:00 jd9INuQ5BBlW OWd1Uau8POix 2
2020-01-09 07:58:03.896000+02:00 jd9INuQ5BBlW OWd1Uau8POix 0
2020-01-09 08:48:03.896000+02:00 jd9INuQ5BBlW OWd1Uau8POix 1
2020-01-09 09:38:03.896000+02:00 jd9INuQ5BBlW OWd1Uau8POix 2
2020-01-09 10:28:03.896000+02:00 jd9INuQ5BBlW OWd1Uau8POix 1

4.2 Extract features using the wrapper

We can use niimpy’s ready-made wrapper to extract one or several features at the same time. The wrapper will require two inputs: - (mandatory) dataframe that must comply with the minimum requirements (see ‘* TIP! Data requirements above) - (optional) an argument dictionary for wrapper

4.2.1 The argument dictionary for wrapper (or how we specify the way the wrapper works)

This argument dictionary will use dictionaries created for stand-alone functions. If you do not know how to create those argument dictionaries, please read the section 4.1.1 The argument dictionary for stand-alone functions (or how we specify the way a function works) first.

The wrapper dictionary is simple. Its keys are the names of the features we want to compute. Its values are argument dictionaries created for each stand-alone function we will employ. Let’s see some examples of wrapper dictionaries:

[15]:
wrapper_features1 = {au.audio_count_loud:{"audio_column_name":"double_decibels","resample_args":{"rule":"1D"}},
                     au.audio_max_freq:{"audio_column_name":"double_frequency","resample_args":{"rule":"1D"}}}
  • wrapper_features1 will be used to analyze two features, audio_count_loud and audio_max_freq. For the feature audio_count_loud, we will use the data stored in the column double_decibels in our dataframe and the data will be binned in one day periods. For the feature audio_max_freq, we will use the data stored in the column double_frequency in our dataframe and the data will be binned in one day periods.

[16]:
wrapper_features2 = {au.audio_mean_db:{"audio_column_name":"random_name","resample_args":{"rule":"1D"}},
                     au.audio_count_speech:{"audio_column_name":"double_decibels", "audio_freq_name":"double_frequency", "resample_args":{"rule":"5H","offset":"5min"}}}
  • wrapper_features2 will be used to analyze two features, audio_mean_db and audio_count_speech. For the feature audio_mean_db, we will use the data stored in the column random_name in our dataframe and the data will be binned in one day periods. For the feature audio_count_speech, we will use the data stored in the column double_decibels in our dataframe and the data will be binned in 5-hour periods with a 5-minute offset. Note that for this feature we will also need another column named “audio_freq_column”, this is because the speech is not only defined by the amplitude of the recording, but the frequency range.

[17]:
wrapper_features3 = {au.audio_mean_db:{"audio_column_name":"one_name","resample_args":{"rule":"1D","offset":"5min"}},
                     au.audio_min_freq:{"audio_column_name":"one_name","resample_args":{"rule":"5H"}},
                     au.audio_count_silent:{"audio_column_name":"another_name","resample_args":{"rule":"30T","origin":"end_day"}}}
  • wrapper_features3 will be used to analyze three features, audio_mean_db, audio_min_freq, and audio_count_silent. For the feature audio_mean_db, we will use the data stored in the column one_name and the data will be binned in one day periods with a 5-min offset. For the feature audio_min_freq, we will use the data stored in the column one_name in our dataframe and the data will be binned in 5-hour periods. Finally, for the feature audio_count_silent, we will use the data stored in the column another_name in our dataframe and the data will be binned in 30-minute periods and the origin of the bins will be the ceiling midnight of the last day.

Default values: if no arguments are passed, niimpy’s default values are either “double_decibels”, “double_frequency”, or “is_silent” for the communication_column_name, and 30-min aggregation bins. The column name depends on the function to be called. Moreover, the wrapper will compute all the available functions in absence of the argument dictionary.

4.2.2 Using the wrapper

Now that we understand how the wrapper is customized, it is time we compute our first communication feature using the wrapper. Suppose that we are interested in extracting the audio_count_loud duration every 50 minutes. We will need niimpy’s extract_features_audio function, the data, and we will also need to create a dictionary to customize our function. Let’s create the dictionary first

[18]:
wrapper_features1 = {au.audio_count_loud:{"audio_column_name":"double_decibels","resample_args":{"rule":"50T"}}}

Now, let’s use the wrapper

[19]:
results_wrapper = au.extract_features_audio(data, features=wrapper_features1)
results_wrapper.head(5)
computing <function audio_count_loud at 0x7f5c3a65f560>...
[19]:
user device audio_count_loud
datetime
2019-08-13 06:40:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1
2019-08-13 07:30:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1
2019-08-13 08:20:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 2
2019-08-13 09:10:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 2
2019-08-13 10:00:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1

Our first attempt was succesful. Now, let’s try something more. Let’s assume we want to compute the audio_count_loud and audio_min_freq in 1-hour bins.

[20]:
wrapper_features2 = {au.audio_count_loud:{"audio_column_name":"double_decibels","resample_args":{"rule":"1H"}},
                     au.audio_min_freq:{"audio_column_name":"double_frequency", "resample_args":{"rule":"1H"}}}
results_wrapper = au.extract_features_audio(data, features=wrapper_features2)
results_wrapper.head(5)
computing <function audio_count_loud at 0x7f5c3a65f560>...
computing <function audio_min_freq at 0x7f5c3a65f600>...
[20]:
user device audio_count_loud audio_min_freq
datetime
2019-08-13 07:00:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 2 7735.0
2019-08-13 08:00:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 2 7690.0
2019-08-13 09:00:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 2 756.0
2019-08-13 10:00:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 2 3059.0
2019-08-13 11:00:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 2 12278.0

Great! Another successful attempt. We see from the results that more columns were added with the required calculations. This is how the wrapper works when all features are computed with the same bins. Now, let’s see how the wrapper performs when each function has different binning requirements. Let’s assume we need to compute the audio_count_loud every day, and the audio_min_freq every 5 hours with an offset of 5 minutes.

[21]:
wrapper_features3 = {au.audio_count_loud:{"audio_column_name":"double_decibels","resample_args":{"rule":"1D"}},
                     au.audio_min_freq:{"audio_column_name":"double_frequency", "resample_args":{"rule":"5H", "offset":"5min"}}}
results_wrapper = au.extract_features_audio(data, features=wrapper_features3)
results_wrapper.head(5)
computing <function audio_count_loud at 0x7f5c3a65f560>...
computing <function audio_min_freq at 0x7f5c3a65f600>...
[21]:
user device audio_count_loud audio_min_freq
datetime
2019-08-13 00:00:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 17.0 NaN
2020-01-09 00:00:00+02:00 jd9INuQ5BBlW 3p83yASkOb_B 10.0 NaN
2020-01-09 00:00:00+02:00 jd9INuQ5BBlW OWd1Uau8POix 6.0 NaN
2019-08-13 05:05:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs NaN 756.0
2019-08-13 10:05:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs NaN 2914.0

The output is once again a dataframe. In this case, two aggregations are shown. The first one is the daily aggregation computed for the audio_count_loud feature. The second one is the 5-hour aggregation period with 5-min offset for the audio_min_freq. We must note that because the audio_min_freqfeature is not required to be aggregated daily, the daily aggregation timestamps have a NaN value. Similarly, because the audio_count_loudis not required to be aggregated in 5-hour windows, its values are NaN for all subjects.

4.2.3 Wrapper and its default option

The default option will compute all features in 30-minute aggregation windows. To use the extract_features_audio function with its default options, simply call the function.

[22]:
default = au.extract_features_audio(data, features=None)
computing <function audio_count_silent at 0x7f5c3a65f420>...
computing <function audio_count_speech at 0x7f5c3a65f4c0>...
computing <function audio_count_loud at 0x7f5c3a65f560>...
computing <function audio_min_freq at 0x7f5c3a65f600>...
computing <function audio_max_freq at 0x7f5c3a65f6a0>...
computing <function audio_mean_freq at 0x7f5c3a65f740>...
computing <function audio_median_freq at 0x7f5c3a65f7e0>...
computing <function audio_std_freq at 0x7f5c3a65f880>...
computing <function audio_min_db at 0x7f5c3a65f920>...
computing <function audio_max_db at 0x7f5c3a65f9c0>...
computing <function audio_mean_db at 0x7f5c3a65fa60>...
computing <function audio_median_db at 0x7f5c3a65fb00>...
computing <function audio_std_db at 0x7f5c3a65fba0>...
[23]:
default.head()
[23]:
user device audio_count_silent audio_count_speech audio_count_loud audio_min_freq audio_max_freq audio_mean_freq audio_median_freq audio_std_freq audio_min_db audio_max_db audio_mean_db audio_median_db audio_std_db
datetime
2019-08-13 07:00:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 0 NaN 1 7735.0 7735.0 7735.0 7735.0 NaN 51.0 51.0 51.0 51.0 NaN
2019-08-13 07:30:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 0 NaN 1 13609.0 13609.0 13609.0 13609.0 NaN 90.0 90.0 90.0 90.0 NaN
2019-08-13 08:00:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 0 NaN 1 7690.0 7690.0 7690.0 7690.0 NaN 81.0 81.0 81.0 81.0 NaN
2019-08-13 08:30:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 0 NaN 1 8347.0 8347.0 8347.0 8347.0 NaN 58.0 58.0 58.0 58.0 NaN
2019-08-13 09:00:00+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1 NaN 1 13592.0 13592.0 13592.0 13592.0 NaN 36.0 36.0 36.0 36.0 NaN

5. Implementing own features

If none of the provided functions suits well, We can implement our own customized features easily. To do so, we need to define a function that accepts a dataframe and returns a dataframe. The returned object should be indexed by user and timestamps (multiindex). To make the feature readily available in the default options, we need add the audio prefix to the new function (e.g. audio_my-new-feature). Let’s assume we need a new function that counts sums all frequencies. Let’s first define the function

[24]:
def audio_sum_freq(df,config=None):
    if not "audio_column_name" in config:
        col_name = "double_frequency"
    else:
        col_name = config["audio_column_name"]
    if not "resample_args" in config.keys():
        config["resample_args"] = {"rule":"30T"}

    if len(df)>0:
        result = df.groupby('user')[col_name].resample(**config["resample_args"]).sum()
        result = result.to_frame(name='audio_sum_freq')
        result = result.reset_index("user")
        result.index.rename("datetime", inplace=True)
        return result
    return None

Then, we can call our new function in the stand-alone way or using the extract_features_audio function. Because the stand-alone way is the common way to call functions in python, we will not show it. Instead, we will show how to integrate this new function to the wrapper. Let’s read again the data and assume we want the default behavior of the wrapper.

[25]:
customized_features = au.extract_features_audio(data, features={audio_sum_freq: {}})
computing <function audio_sum_freq at 0x7f5c683977e0>...
[26]:
customized_features.head()
[26]:
user audio_sum_freq
datetime
2019-08-13 07:00:00+03:00 iGyXetHE3S8u 7735
2019-08-13 07:30:00+03:00 iGyXetHE3S8u 13609
2019-08-13 08:00:00+03:00 iGyXetHE3S8u 7690
2019-08-13 08:30:00+03:00 iGyXetHE3S8u 8347
2019-08-13 09:00:00+03:00 iGyXetHE3S8u 13592