Demo notebook for Niimpy Exploration layer modules¶

Introduction¶

To study and quantify human behavior using longitudinal multimodal digital data, it is essential to get to know the data well first. These data from various sources or sensors, such as smartphones and watches and activity trackers, yields data with different types and properties. The data may be a mixture of categorical, ordinal and numerical data, typically consisting of time series measured for multiple subjetcs from different groups. While the data is typically dense, it is also heterogenous and contains lots of missing values. Therefore, the analysis has to be conducted on many different levels.

This notebook introduces the Niimpy toolbox exploration module, which seeks to address the aforementioned issues. The module has functionalities for exploratory data analysis (EDA) of digital behavioral data. The module aims to produce a summary of the data characteristics, inspecting the structures underlying the data, to detecting patterns and changes in the patterns, and to assess the data quality (e.g., missing data, outliers). This information is highly essential for assessing data validity, data filtering and selection, and for data preprocessing. The module includes functions for plotting catogorical data, data counts, timeseries lineplots, punchcards and visualizing missing data.

Exploration module functions are supposed to run after data preprocessing, but they can be run also on the raw observations. All the functions are implemented by using Plotly Python Open sourde Library. Plotly enables interactive visualizations which in turn makers it easier to explore different aspects of the data (e.g.,specific timerange and summary statistics).

This notebook uses several sample dataframes for module demonstration. The sample data is already preprocessed, or will be preprocessed in notebook sections before visualizations. When the sample data is loaded, some of the key characteristics of the data are displayed.

All eploration module functions require the data to follow data schema. defined in the Niimpy toolbox documentation. The user must ensure that the input data follows the specified schema.

Sub-module overview¶

The following table shows accepted data types, visualization functions and the purpose of each exploration sub-module. All submodules are located inside niimpy/exploration/eda -folder.

Sub-module	Data type	Functions	For what
catogorical.py	Categorical	Barplot	Observations counts and distributions
countplot.py	Categorical* / Numerical	Barplot/Boxplot	Observation counts and distibutions
lineplot.py	Numerical	Lineplot	Trend, cyclicity, patterns
punchcard.py	Categorical* / Numerical	Heatmap	Temporal patterns of counts or values
missingness.py	Categorical / Numerical	Barplot / Heatmap	Missing data patterns

Data types denoted with * are not compatible with every function within the module. *** ### *NOTES*

This notebook uses following definitions referring to data: * Feature refers to dataframe column that stores observations (e.g., numerical sensor values, questionnaire answers) * User refers to unique identifier for each subject in the data. Dataframe should also have a column named as user. * Group refers to unique group idenfier. If subjects are grouped, dataframe shoudl have a column named as group.

Imports¶

Here we import modules needed for running this notebook.

[1]:

import numpy as np
import pandas as pd
import plotly
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
import warnings
warnings.filterwarnings("ignore")
import niimpy
from niimpy import config
from niimpy.preprocessing import survey
from niimpy.exploration import setup_dataframe
from niimpy.exploration.eda import categorical, countplot, lineplot, missingness, punchcard

Plotly settings¶

Next code block defines default settings for plotly visualizations. Feel free to adjust the settings according to your needs.

[2]:

pio.renderers.default = "png"
pio.templates.default = "seaborn"
px.defaults.template = "ggplot2"
px.defaults.color_continuous_scale = px.colors.sequential.RdBu
px.defaults.width = 1200
px.defaults.height = 482
warnings.filterwarnings("ignore")

1) Categorical plot¶

This section introduces categorical plot module visualizes categorical data, such as questionnaire data responses. We will demonstrate functions by using a mock survey dataframe, containing answers for: * Patient Health Questionnaire-2 (PHQ-2) * Perceived Stress Scale (PSS10) * Generalized Anxiety Disorder-2 (GAD-2)

The data will be preprocessed, and then it’s basic characteristics will be summarized before visualizations.

1.1) Reading the data¶

We’ll start by importing the data:

[3]:

df = niimpy.read_csv(config.SURVEY_PATH, tz='Europe/Helsinki')
df.head()

[3]:

	user	age	gender	Little interest or pleasure in doing things.	Feeling down; depressed or hopeless.	Feeling nervous; anxious or on edge.	Not being able to stop or control worrying.	In the last month; how often have you felt that you were unable to control the important things in your life?	In the last month; how often have you felt confident about your ability to handle your personal problems?	In the last month; how often have you felt that things were going your way?	In the last month; how often have you been able to control irritations in your life?	In the last month; how often have you felt that you were on top of things?	In the last month; how often have you been angered because of things that were outside of your control?	In the last month; how often have you felt difficulties were piling up so high that you could not overcome them?
0	1	20	Male	several-days	more-than-half-the-days	not-at-all	nearly-every-day	almost-never	sometimes	fairly-often	never	sometimes	very-often	fairly-often
1	2	32	Male	more-than-half-the-days	more-than-half-the-days	not-at-all	several-days	never	never	very-often	sometimes	never	fairly-often	never
2	3	15	Male	more-than-half-the-days	not-at-all	several-days	not-at-all	never	very-often	very-often	fairly-often	never	never	almost-never
3	4	35	Female	not-at-all	nearly-every-day	not-at-all	several-days	very-often	fairly-often	very-often	never	sometimes	never	fairly-often
4	5	23	Male	more-than-half-the-days	not-at-all	more-than-half-the-days	several-days	almost-never	very-often	almost-never	sometimes	sometimes	very-often	never

Then check some basic descriptive statistics:

[4]:

df.describe()

[4]:

	user	age
count	1000.000000	1000.000000
mean	500.500000	26.911000
std	288.819436	4.992595
min	1.000000	12.000000
25%	250.750000	23.000000
50%	500.500000	27.000000
75%	750.250000	30.000000
max	1000.000000	43.000000

The dataframe’s columns are raw questions from a survey. Some questions belong to a specific category, so we will annotate them with ids. The id is constructed from a prefix (the questionnaire category: GAD, PHQ, PSQI etc.), followed by the question number (1,2,3). Similarly, we will also convert the answers to meaningful numerical values.

Note: It’s important that the dataframe follows the below schema before passing into niimpy.

[5]:

# Convert column name to id, based on provided mappers from niimpy
column_map = {**survey.PHQ2_MAP, **survey.PSQI_MAP, **survey.PSS10_MAP, **survey.PANAS_MAP, **survey.GAD2_MAP}
df = survey.clean_survey_column_names(df)
df = df.rename(column_map, axis = 1)
df.head()

[5]:

	user	age	gender	PHQ2_1	PHQ2_2	GAD2_1	GAD2_2	PSS10_2	PSS10_4	PSS10_5	PSS10_6	PSS10_7	PSS10_8	PSS10_9
0	1	20	Male	several-days	more-than-half-the-days	not-at-all	nearly-every-day	almost-never	sometimes	fairly-often	never	sometimes	very-often	fairly-often
1	2	32	Male	more-than-half-the-days	more-than-half-the-days	not-at-all	several-days	never	never	very-often	sometimes	never	fairly-often	never
2	3	15	Male	more-than-half-the-days	not-at-all	several-days	not-at-all	never	very-often	very-often	fairly-often	never	never	almost-never
3	4	35	Female	not-at-all	nearly-every-day	not-at-all	several-days	very-often	fairly-often	very-often	never	sometimes	never	fairly-often
4	5	23	Male	more-than-half-the-days	not-at-all	more-than-half-the-days	several-days	almost-never	very-often	almost-never	sometimes	sometimes	very-often	never

We can use the convert_survey_to_numerical_answer helper method to convert the answers into a numerical value. We use the ID_MAP_PREFIX mapping dictionary provided by Niimpy, which describes how each text answer should be mapped to a number.

[6]:

# Transform raw answers to numerical values
num_df = survey.convert_survey_to_numerical_answer(
    df, id_map=survey.ID_MAP_PREFIX, use_prefix=True
)
num_df.head()

[6]:

	user	age	gender	PHQ2_1	PHQ2_2	GAD2_1	GAD2_2	PSS10_2	PSS10_4	PSS10_5	PSS10_6	PSS10_7	PSS10_8	PSS10_9
0	1	20	Male	1	2	0	3	1	2	3	0	2	4	3
1	2	32	Male	2	2	0	1	0	0	4	2	0	3	0
2	3	15	Male	2	0	1	0	0	4	4	3	0	0	1
3	4	35	Female	0	3	0	1	4	3	4	0	2	0	3
4	5	23	Male	2	0	2	1	1	4	1	2	2	4	0

For each of these surveys, the overall score is calculated as the sum of the numerical value. We can calculate this for each survey using the sum_survey_scores function.

[7]:

sum_df = survey.sum_survey_scores(num_df, ["PHQ2", "PSS10", "GAD2"])
sum_df.head()

[7]:

	user	PHQ2	PSS10	GAD2
0	1	3	15	3
1	2	4	9	1
2	3	2	12	1
3	4	3	16	1
4	5	2	14	3

1.1. Questionnaire summary¶

We can now make some plots for the preprocessed data frame. First, we can display the summary for the specific question (PHQ-2 first question).

[8]:

fig = categorical.questionnaire_summary(num_df,
                                        question = 'PHQ2_1',
                                        title='PHQ2 question: Little interest or pleasure in doing things / <br> answer value frequencies',
                                        xlabel='value',
                                        ylabel='count',
                                        width=600,
                                        height=400)
fig.show()

../../_images/demo_notebooks_Exploration_22_0.png

The figure shows that the answer values (from 0 to 3) almost uniform in distribution.

1.2. Questionnaire grouped summary¶

We can also display the summary for each subgroup (gender).

[9]:

fig = categorical.questionnaire_grouped_summary(num_df,
                                                question='PSS10_9',
                                                group='gender',
                                                title='PSS10_9 Question / <br> Score frequency distributions by group',
                                                xlabel='score',
                                                ylabel='count',
                                                width=800,
                                                height=400)
fig.show()

../../_images/demo_notebooks_Exploration_26_0.png

The figure shows that the differences between subgroups are not very large.

1.3. Questionnaire grouped summary score distribution¶

With some quick preprocessing, we can display the score distribution of each questionaire.

We’ll extract PSS-10 questionnaire answers from the dataframe, using the sum_survey_scores function from the niimpy.preprocessing.survey module, and set the gender variable from the original dataframe.

[9]:

sum_df = survey.sum_survey_scores(num_df, ["PSS10"])
sum_df["gender"] = num_df["gender"]
sum_df.head()

[9]:

	user	PSS10	gender
0	1	15	Male
1	2	9	Male
2	3	12	Male
3	4	16	Female
4	5	14	Male

And then visualize aggregated summary score distributions, grouped by gender:

[10]:

fig = categorical.questionnaire_grouped_summary(sum_df,
                                                question='PSS10',
                                                group='gender',
                                                title='PSS10',
                                                xlabel='score',
                                                ylabel='count',
                                                width=800,
                                                height=400)
fig.show()

../../_images/demo_notebooks_Exploration_32_0.png

The figure shows that the grouped summary score distrubutions are close to each other.

2) Countplot¶

This section introduces Countplot module. The module contain functions for user and group level observation count (number of datapoints per user or group) visualization and observation value distributions. Observation counts use barplots for user level and a boxplots for group level visualizations. Boxplots are used for group level value distributions. The module assumes that the visualized data is numerical.

Data¶

We will use sample from StudentLife dataset to demonstrate the module functions. The sample contains hourly aggregated activity data (values from 0 to 5, where 0 corresponds to no activity, and 5 to high activity) and group information based on pre- and post-study PHQ-9 test scores. Study subjects have been grouped by the depression symptom severity into groups: none, mild, moderate, moderately severe, and severe. Preprocessed data sample is included in the Niimpy toolbox sampledata folder.

[11]:

# Load data
sl = niimpy.read_csv(config.SL_ACTIVITY_PATH, tz='Europe/Helsinki')
sl.set_index('timestamp',inplace=True)
sl.index = pd.to_datetime(sl.index)
sl_loc = sl.tz_localize(None)

[12]:

sl_loc.head()

[12]:

	user	activity	group
timestamp
2013-03-27 06:00:00	u00	2	none
2013-03-27 07:00:00	u00	1	none
2013-03-27 08:00:00	u00	2	none
2013-03-27 09:00:00	u00	3	none
2013-03-27 10:00:00	u00	4	none

Before visualizations, we’ll inspect the data.

[13]:

sl_loc.describe()

[13]:

	activity
count	55907.000000
mean	0.750264
std	1.298238
min	0.000000
25%	0.000000
50%	0.000000
75%	1.000000
max	5.000000

[14]:

sl_loc.group.unique()

[14]:

array(['none', 'severe', 'mild', 'moderately severe', 'moderate'],
      dtype=object)

2.1. User level observation count¶

At first we visualize the number of observations for each subject.

[15]:

fig = countplot.countplot(sl,
                          fig_title='Activity event counts by user',
                          plot_type='count',
                          points='all',
                          aggregation='user',
                          user=None,
                          column=None,
                          binning=False)

fig.show()

../../_images/demo_notebooks_Exploration_42_0.png

The barplot shows that there are differences in user total activity counts. The user u24 has the lowest event count of 710 and users u02 and u59 have the highest count of 1584.

2.2. Group level observation count¶

Next we’ll inspect group level daily activity event count distributions by using boxplots. For the improved clarity, we select a timerange of one week from the data.

[16]:

sl_one_week = sl_loc.loc['2013-03-28':'2013-4-3']

fig = countplot.countplot(sl_one_week,
                          fig_title='Group level daily activity event count distributions',
                          plot_type='value',
                          points='all',
                          aggregation='group',
                          user=None,
                          column='activity',
                          binning='D')

fig.show()

../../_images/demo_notebooks_Exploration_45_0.png

The boxplot shows some variability in group level event count distributions across the days spanning from Mar 28 to Apr 3 2013.

2.3. Group level value distributions¶

Finally we visualize group level activity value distributions for whole time range.

[17]:

fig = countplot.countplot(sl,
                          fig_title='Group level activity score distributions',
                          plot_type='value',
                          points='outliers',
                          aggregation='group',
                          user=None,
                          column='activity',
                          binning=False)

fig.show()

../../_images/demo_notebooks_Exploration_48_0.png

The boxplot shows that activity score distribution for groups mild and moderately severe differ from the rest.

3. Lineplot¶

This section introduces Lineplot module functions. We use the same StudentLife dataset derived activity data as in previous section.

3.1. Lineplot¶

Lineplot functions display numerical feature values on time axis. The user can optionally resample (downsample) and smoothen the data for better visual clarity.

3.1.1. Single user single feature¶

At first, we’ll visualize single user single feature data, without resampling or smoothing.

[18]:

fig = lineplot.timeplot(sl_loc,
                        users=['u01'],
                        columns=['activity'],
                        title='User: {} activity lineplot'.format('u01'),
                        xlabel='Date',
                        ylabel='Value',
                        resample=False,
                        interpolate=False,
                        window=1,
                        reset_index=False)

fig.show()

../../_images/demo_notebooks_Exploration_55_0.png

The figure showing all the activity datapoints is difficult to interpet. By zooming in the time range, the daily patters come apparent. There is no or low activity during the night.

3.1.2. Single user single feature index resetted¶

Next, we’ll plot visualize the same data using resampling by hour, and 24 hour rolling window smoothing for improved visualization clarity. We also reset the index, showing now hours from the first activity feature observation.

[19]:

fig = lineplot.timeplot(sl_loc,
                        users=['u00'],
                        columns=['activity'],
                        title='User: {} activity lineplot / <br> resetted index'.format('u01'),
                        xlabel='Date',
                        ylabel='Value',
                        resample='H',
                        interpolate=True,
                        window=24,
                        reset_index=True)

fig.show()

../../_images/demo_notebooks_Exploration_58_0.png

By zooming in the smoothed lineplot, daily activity patterns are easier to detect.

3.1.3. Single user single feature, aggregated by day¶

Next visualization shows resamplig by day and 7 day rolling window smoothing, making the activity time series trend visible.

[20]:

fig = lineplot.timeplot(sl_loc,
                        users=['u00'],
                        columns=['activity'],
                        title='User: {} activity lineplot / <br> rolling window (7 days) smoothing'.format('u01'),
                        xlabel='Date',
                        ylabel='Value',
                        resample='D',
                        interpolate=True,
                        window=7)

fig.show()

../../_images/demo_notebooks_Exploration_61_0.png

Daily aggregated and smoothed data makes the user activity trend visible. There is a peak at May 9 and the crest at May 23.

3.2. Multiple subjects single feature¶

The following visualization superimposes three subject’s activity on same figure.

[21]:

fig = lineplot.timeplot(sl_loc,
                        users=['u00','u01'],
                        columns=['activity'],
                        title='User: {}, {} activity lineplot / <br> rolling window (7 days) smoothing'.format('u00','u01'),
                        xlabel='Date',
                        ylabel='Value',
                        resample='D',
                        interpolate=True,
                        window=7)

fig.show()

../../_images/demo_notebooks_Exploration_64_0.png

The figure shows that the user daily averaged activity is quite similar in the beginning of inspected time range. In first two weeks of May, the activity shows opposing trends (user u00 activity increases and user u01 decreases).

3.3. Group level hourly averages¶

Next we’ll compare group level hourly average activity.

[22]:

fig = lineplot.timeplot(sl_loc,
                        users='Group',
                        columns=['activity'],
                        title='User group activity / <br> hourly averages',
                        xlabel='Date',
                        ylabel='Value',
                        resample='D',
                        interpolate=True,
                        window=7,
                        reset_index=False,
                        by='hour')

fig.show()

../../_images/demo_notebooks_Exploration_67_0.png

The time plot reveals that the hourly averaged group level activity follows circadian rhytmn (less activity during the night). Moderately severe group seems to be least active group during the latter half of the day.

3.4. Group level weekday averages¶

And finally,

[23]:

fig = lineplot.timeplot(sl_loc,
                        users='Group',
                        columns=['activity'],
                        title='User Activity',
                        xlabel='Date',
                        ylabel='Value',
                        resample='D',
                        interpolate=True,
                        window=7,
                        reset_index=False,
                        by='weekday')

fig.show()

../../_images/demo_notebooks_Exploration_70_0.png

The timeplot shows that there is some differences between the average group level activity, e.g., group mild being more active than moderately severe. Additionally, activity during Sundays is at lower level in comparison with weekdays.

4. Punchcard¶

This section introduces Punchcard module functions. The functions aggregate the data and show the averaged value for each timepoint. We use the same StudentLife dataset derived activity data as in two previous sections.

4.1. Single user punchcard¶

At first we visualize one daily aggregated mean activity for single subject. We’ll change the plot color to grayscale for improved clarity.

[24]:

px.defaults.color_continuous_scale = px.colors.sequential.gray

[25]:

fig = punchcard.punchcard_plot(sl,
                               user_list=['u00'],
                               columns=['activity'],
                               title="User {} activity punchcard".format('u00'),
                               resample='D',
                               normalize=False,
                               agg_func=np.mean,
                               timerange=False)

fig.show()

../../_images/demo_notebooks_Exploration_75_0.png

The punchcard reveals that May 5th has the highest average activity and May 18th, 20th, and 21th have the lowest activity.

4.2. Multiple user punchcard¶

Next, we’ll visualize mean activity for multiple subjects.

[26]:

fig = punchcard.punchcard_plot(sl,
                               user_list=['u00','u01','u02'],
                               columns=['activity'],
                               title="Users {}, {}, and {} activity punchcard".format('u00','u01','u02'),
                               resample='D',
                               normalize=False,
                               agg_func=np.mean,
                               timerange=False)

fig.show()

../../_images/demo_notebooks_Exploration_78_0.png

The punchard allows comparison of daily average activity for multiple subjects. It seems that there is not evident common pattern in the activity.

4.3. Single user punchcard showing two features¶

Lastly, we’ll visualize daily aggregated single user activity side by side with activity of previous week. We start by shifting the activity by one week and by adding it to the original dataframe.

[27]:

sl_loc['previous_week_activity'] = sl_loc['activity'].shift(periods=7, fill_value=0)

[28]:

fig = punchcard.punchcard_plot(sl_loc,
                               user_list=['u00'],
                               columns=['activity','previous_week_activity'],
                               title="User {} activity and previous week activity punchcard".format('u00'),
                               resample='D',
                               normalize=False,
                               agg_func=np.mean,
                               timerange=False)

fig.show()

../../_images/demo_notebooks_Exploration_82_0.png

The punchcard show weekly repeating patterns in subjects activity.

5) Missingness¶

This sections introduces Missingness module for missing data inspection. The module features data missingness visualizations by frequency and by timepoint. Additionally, it offers an option for missing data correlation visualization.

Data¶

For data missingness visualizations, we’ll create a mock dataframe with missing values using niimpy.util.create_missing_dataframe function.

[29]:

df_m = setup_dataframe.create_missing_dataframe(nrows=2*24*14, ncols=5, density=0.7, index_type='dt', freq='10T')
df_m.columns = ['User_1','User_2','User_3','User_4','User_5',]

We will quickly inspect the dataframe before the visualizations.

[30]:

df_m

[30]:

	User_1	User_2	User_3	User_4	User_5
2022-01-01 00:00:00	95.449550	NaN	63.620984	84.779703	10.134786
2022-01-01 00:10:00	NaN	NaN	16.542671	30.059590	48.531147
2022-01-01 00:20:00	NaN	39.576178	62.931902	14.169682	30.127549
2022-01-01 00:30:00	NaN	NaN	3.209067	76.309938	64.975323
2022-01-01 00:40:00	8.108314	68.563101	11.451206	67.939258	51.267042
...	...	...	...	...	...
2022-01-05 15:10:00	29.951958	87.768113	3.413069	NaN	71.765351
2022-01-05 15:20:00	66.755720	54.265762	NaN	41.359552	85.334592
2022-01-05 15:30:00	65.078709	NaN	NaN	25.390743	27.214379
2022-01-05 15:40:00	43.437510	93.718360	3.737043	93.463233	61.265460
2022-01-05 15:50:00	52.235363	3.903921	NaN	NaN	13.510993

672 rows × 5 columns

[31]:

df_m.describe()

[31]:

	User_1	User_2	User_3	User_4	User_5
count	467.000000	459.000000	505.000000	460.000000	461.000000
mean	52.139713	49.797069	49.664041	47.506335	51.783715
std	28.177069	29.022774	29.374702	28.842758	27.133421
min	1.115046	1.356182	1.018417	1.055085	1.089756
25%	28.128912	24.049961	23.363264	22.883851	28.179801
50%	54.304329	52.259609	48.483288	43.471076	52.341558
75%	75.519421	74.290567	77.026377	74.314602	74.746701
max	99.517549	99.943515	99.674461	99.967264	99.863755

5.1. Data frequency by feature¶

First, we create a histogram to visualize data frequency per column. Here, frequency of 1 indicates no missing data points and 0 that all data points are missing.

[32]:

fig = missingness.bar(df_m,
                      xaxis_title='User',
                      yaxis_title='Frequency')
fig.show()

../../_images/demo_notebooks_Exploration_92_0.png

The data frequency is nearly similar for each user, User_5 having the highest frequency.

5.2. Average frequency by user¶

Next, we will show average data frequency for all users.

[33]:

fig = missingness.bar(df_m,
                      sampling_freq='30T',
                      xaxis_title='Time',
                      yaxis_title='Frequency')
fig.show()

../../_images/demo_notebooks_Exploration_95_0.png

The overall data frequency suggests no clear pattern for data missingness.

5.3. Missingness matrix¶

We can also create a missingness matrix visualization for the dataframe. The nullity matrix show data missingess by a timepoint.

[34]:

fig = missingness.matrix(df_m,
                         sampling_freq='30T',
                         xaxis_title="User ID",
                         yaxis_title="Time")
fig.show()

../../_images/demo_notebooks_Exploration_98_0.png

5.4. Missing data correlations¶

Finally, we plot a heatmap to display the correlations between missing data.

Correlation ranges from -1 to 1: * -1 means that if one variable appears then the other will be missing. * 0 means that there is no correlation between the missingness of two variables. * 1 means that the two variables will always appear together.

Data¶

For the correlations, we use NYC collision factors sample data.

[35]:

collisions = pd.read_csv("https://raw.githubusercontent.com/ResidentMario/missingno-data/master/nyc_collision_factors.csv")

First, we’ll inspect the data frame.

[36]:

collisions.head()

[36]:

	DATE	TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME	CROSS STREET NAME	OFF STREET NAME	...	CONTRIBUTING FACTOR VEHICLE 1	CONTRIBUTING FACTOR VEHICLE 2	CONTRIBUTING FACTOR VEHICLE 3	CONTRIBUTING FACTOR VEHICLE 4	CONTRIBUTING FACTOR VEHICLE 5	VEHICLE TYPE CODE 1	VEHICLE TYPE CODE 2	VEHICLE TYPE CODE 3	VEHICLE TYPE CODE 4	VEHICLE TYPE CODE 5
0	11/10/2016	16:11:00	BROOKLYN	11208.0	40.662514	-73.872007	(40.6625139, -73.8720068)	WORTMAN AVENUE	MONTAUK AVENUE	NaN	...	Failure to Yield Right-of-Way	Unspecified	NaN	NaN	NaN	TAXI	PASSENGER VEHICLE	NaN	NaN	NaN
1	11/10/2016	05:11:00	MANHATTAN	10013.0	40.721323	-74.008344	(40.7213228, -74.0083444)	HUBERT STREET	HUDSON STREET	NaN	...	Failure to Yield Right-of-Way	NaN	NaN	NaN	NaN	PASSENGER VEHICLE	NaN	NaN	NaN	NaN
2	04/16/2016	09:15:00	BROOKLYN	11201.0	40.687999	-73.997563	(40.6879989, -73.9975625)	HENRY STREET	WARREN STREET	NaN	...	Lost Consciousness	Lost Consciousness	NaN	NaN	NaN	PASSENGER VEHICLE	VAN	NaN	NaN	NaN
3	04/15/2016	10:20:00	QUEENS	11375.0	40.719228	-73.854542	(40.7192276, -73.8545422)	NaN	NaN	67-64 FLEET STREET	...	Failure to Yield Right-of-Way	Failure to Yield Right-of-Way	Failure to Yield Right-of-Way	NaN	NaN	PASSENGER VEHICLE	PASSENGER VEHICLE	PASSENGER VEHICLE	NaN	NaN
4	04/15/2016	10:35:00	BROOKLYN	11210.0	40.632147	-73.952731	(40.6321467, -73.9527315)	BEDFORD AVENUE	CAMPUS ROAD	NaN	...	Failure to Yield Right-of-Way	Failure to Yield Right-of-Way	NaN	NaN	NaN	PASSENGER VEHICLE	PASSENGER VEHICLE	NaN	NaN	NaN

5 rows × 26 columns

[37]:

collisions.dtypes

[37]:

DATE                              object
TIME                              object
BOROUGH                           object
ZIP CODE                         float64
LATITUDE                         float64
LONGITUDE                        float64
LOCATION                          object
ON STREET NAME                    object
CROSS STREET NAME                 object
OFF STREET NAME                   object
NUMBER OF PERSONS INJURED          int64
NUMBER OF PERSONS KILLED           int64
NUMBER OF PEDESTRIANS INJURED      int64
NUMBER OF PEDESTRIANS KILLED       int64
NUMBER OF CYCLISTS INJURED       float64
NUMBER OF CYCLISTS KILLED        float64
CONTRIBUTING FACTOR VEHICLE 1     object
CONTRIBUTING FACTOR VEHICLE 2     object
CONTRIBUTING FACTOR VEHICLE 3     object
CONTRIBUTING FACTOR VEHICLE 4     object
CONTRIBUTING FACTOR VEHICLE 5     object
VEHICLE TYPE CODE 1               object
VEHICLE TYPE CODE 2               object
VEHICLE TYPE CODE 3               object
VEHICLE TYPE CODE 4               object
VEHICLE TYPE CODE 5               object
dtype: object

We will then inspect the basic statistics.

[38]:

collisions.describe()

[38]:

	ZIP CODE	LATITUDE	LONGITUDE	NUMBER OF PERSONS INJURED	NUMBER OF PERSONS KILLED	NUMBER OF PEDESTRIANS INJURED	NUMBER OF PEDESTRIANS KILLED	NUMBER OF CYCLISTS INJURED	NUMBER OF CYCLISTS KILLED
count	6919.000000	7303.000000	7303.000000	7303.000000	7303.000000	7303.000000	7303.000000	0.0	0.0
mean	10900.746640	40.717653	-73.921406	0.350678	0.000959	0.133644	0.000822	NaN	NaN
std	551.568724	0.069437	0.083317	0.707873	0.030947	0.362129	0.028653	NaN	NaN
min	10001.000000	40.502341	-74.248277	0.000000	0.000000	0.000000	0.000000	NaN	NaN
25%	10310.000000	40.670865	-73.980744	0.000000	0.000000	0.000000	0.000000	NaN	NaN
50%	11211.000000	40.723260	-73.933888	0.000000	0.000000	0.000000	0.000000	NaN	NaN
75%	11355.000000	40.759527	-73.864463	1.000000	0.000000	0.000000	0.000000	NaN	NaN
max	11694.000000	40.909628	-73.702590	16.000000	1.000000	3.000000	1.000000	NaN	NaN

Finally, we will visualize the nullity (how strongly the presence or absence of one variable affects the presence of another) correlations by a heatmap and a dendrogram.

[39]:

fig = missingness.heatmap(collisions)
fig.show()

../../_images/demo_notebooks_Exploration_108_0.png

The nullity heatmap and dendrogram reveals a data correlation structure, e.g., vehicle type codes and contributing factor vehicle are highly correlated. Features having complete data are not shown on the figure.