Basic transformations

This page shows some basic transformations you can do once you have read data. Really, it is simply a pandas crash course, since pandas provides all the operations you may need and there is no need for us to re-invent things. Pandas provides a solid but flexible base for us to build advanced operations on top of.

You can read more at the Pandas documentation.

Extracting single rows and columns

Let’s first import mobile phone battery status data.

[1]:
TZ = 'Europe/Helsinki'
[2]:
import niimpy
from niimpy import config
import warnings
warnings.filterwarnings("ignore")
[3]:
# Read the data
df = niimpy.read_csv(config.MULTIUSER_AWARE_BATTERY_PATH, tz='Europe/Helsinki')

Then check first rows of the dataframe.

[4]:
df.head()
[4]:
user device time battery_level battery_status battery_health battery_adaptor datetime
2020-01-09 02:20:02.924999936+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 74 3 2 0 2020-01-09 02:20:02.924999936+02:00
2020-01-09 02:21:30.405999872+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 73 3 2 0 2020-01-09 02:21:30.405999872+02:00
2020-01-09 02:24:12.805999872+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 72 3 2 0 2020-01-09 02:24:12.805999872+02:00
2020-01-09 02:35:38.561000192+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 72 2 2 0 2020-01-09 02:35:38.561000192+02:00
2020-01-09 02:35:38.953000192+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 72 2 2 2 2020-01-09 02:35:38.953000192+02:00

Get a single column, in this case all users:

[5]:
df['user']
[5]:
2020-01-09 02:20:02.924999936+02:00    jd9INuQ5BBlW
2020-01-09 02:21:30.405999872+02:00    jd9INuQ5BBlW
2020-01-09 02:24:12.805999872+02:00    jd9INuQ5BBlW
2020-01-09 02:35:38.561000192+02:00    jd9INuQ5BBlW
2020-01-09 02:35:38.953000192+02:00    jd9INuQ5BBlW
                                           ...
2019-08-09 00:30:48.073999872+03:00    iGyXetHE3S8u
2019-08-09 00:32:40.717999872+03:00    iGyXetHE3S8u
2019-08-09 00:34:23.114000128+03:00    iGyXetHE3S8u
2019-08-09 00:36:05.505000192+03:00    iGyXetHE3S8u
2019-08-09 00:37:37.671000064+03:00    iGyXetHE3S8u
Name: user, Length: 505, dtype: object

Get a single row, in this case the 5th (the first row is zero):

[6]:
df.iloc[4]
[6]:
user                                      jd9INuQ5BBlW
device                                    3p83yASkOb_B
time                                    1578530138.953
battery_level                                       72
battery_status                                       2
battery_health                                       2
battery_adaptor                                      2
datetime           2020-01-09 02:35:38.953000192+02:00
Name: 2020-01-09 02:35:38.953000192+02:00, dtype: object

Listing unique users

We can list unique users by using pandas.unique() function.

[7]:
df['user'].unique()
[7]:
array(['jd9INuQ5BBlW', 'iGyXetHE3S8u'], dtype=object)

List unique values

Same applies to other data features/columns.

[8]:
df['battery_status'].unique()
[8]:
array([ 3,  2,  5, -1, -3])

Extract data of only one subject

We can extract data of only one subject by following:

[9]:
df[df['user'] == 'jd9INuQ5BBlW']
[9]:
user device time battery_level battery_status battery_health battery_adaptor datetime
2020-01-09 02:20:02.924999936+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 74 3 2 0 2020-01-09 02:20:02.924999936+02:00
2020-01-09 02:21:30.405999872+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 73 3 2 0 2020-01-09 02:21:30.405999872+02:00
2020-01-09 02:24:12.805999872+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 72 3 2 0 2020-01-09 02:24:12.805999872+02:00
2020-01-09 02:35:38.561000192+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 72 2 2 0 2020-01-09 02:35:38.561000192+02:00
2020-01-09 02:35:38.953000192+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 72 2 2 2 2020-01-09 02:35:38.953000192+02:00
... ... ... ... ... ... ... ... ...
2020-01-09 23:02:13.938999808+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578604e+09 73 3 2 0 2020-01-09 23:02:13.938999808+02:00
2020-01-09 23:10:37.262000128+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578604e+09 73 3 2 0 2020-01-09 23:10:37.262000128+02:00
2020-01-09 23:22:13.966000128+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578605e+09 72 3 2 0 2020-01-09 23:22:13.966000128+02:00
2020-01-09 23:32:13.959000064+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578606e+09 71 3 2 0 2020-01-09 23:32:13.959000064+02:00
2020-01-09 23:39:06.800000+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578606e+09 71 3 2 0 2020-01-09 23:39:06.800000+02:00

373 rows × 8 columns

Renaming a column or columns

Dataframe column can be renamed using pandas.DataFrame.rename() function.

[10]:
df.rename(columns={'time': 'timestamp'}, inplace=True)
df.head()
[10]:
user device timestamp battery_level battery_status battery_health battery_adaptor datetime
2020-01-09 02:20:02.924999936+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 74 3 2 0 2020-01-09 02:20:02.924999936+02:00
2020-01-09 02:21:30.405999872+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 73 3 2 0 2020-01-09 02:21:30.405999872+02:00
2020-01-09 02:24:12.805999872+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578529e+09 72 3 2 0 2020-01-09 02:24:12.805999872+02:00
2020-01-09 02:35:38.561000192+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 72 2 2 0 2020-01-09 02:35:38.561000192+02:00
2020-01-09 02:35:38.953000192+02:00 jd9INuQ5BBlW 3p83yASkOb_B 1.578530e+09 72 2 2 2 2020-01-09 02:35:38.953000192+02:00

Change datatypes

Let’s then check the dataframe datatypes:

[11]:
df.dtypes
[11]:
user                                        object
device                                      object
timestamp                                  float64
battery_level                                int64
battery_status                               int64
battery_health                               int64
battery_adaptor                              int64
datetime           datetime64[ns, Europe/Helsinki]
dtype: object

We can change the datatypes with pandas.astype() function. Here we change battery_health datatype to categorical:

[12]:
df.astype({'battery_health': 'category'}).dtypes
[12]:
user                                        object
device                                      object
timestamp                                  float64
battery_level                                int64
battery_status                               int64
battery_health                            category
battery_adaptor                              int64
datetime           datetime64[ns, Europe/Helsinki]
dtype: object

Transforming a column to a new value

Dataframe values can be transformed (decoded etc.) into new values by using pandas.transform()function.

Here we add one to the column values.

[13]:
df['battery_adaptor'].transform(lambda x: x + 1)
[13]:
2020-01-09 02:20:02.924999936+02:00    1
2020-01-09 02:21:30.405999872+02:00    1
2020-01-09 02:24:12.805999872+02:00    1
2020-01-09 02:35:38.561000192+02:00    1
2020-01-09 02:35:38.953000192+02:00    3
                                      ..
2019-08-09 00:30:48.073999872+03:00    2
2019-08-09 00:32:40.717999872+03:00    2
2019-08-09 00:34:23.114000128+03:00    2
2019-08-09 00:36:05.505000192+03:00    2
2019-08-09 00:37:37.671000064+03:00    2
Name: battery_adaptor, Length: 505, dtype: int64

Resample

Dataframe down/upsampling can be done with pandas.resample() function.

Here we downsample the data by hour and aggregate the mean:

[14]:
df['battery_level'].resample('H').agg("mean")
[14]:
2019-08-05 14:00:00+03:00    46.000000
2019-08-05 15:00:00+03:00    44.000000
2019-08-05 16:00:00+03:00    43.000000
2019-08-05 17:00:00+03:00    42.000000
2019-08-05 18:00:00+03:00    41.000000
                               ...
2020-01-09 19:00:00+02:00    86.166667
2020-01-09 20:00:00+02:00    82.000000
2020-01-09 21:00:00+02:00    78.428571
2020-01-09 22:00:00+02:00    75.000000
2020-01-09 23:00:00+02:00    72.000000
Freq: H, Name: battery_level, Length: 3779, dtype: float64

Groupby

For groupwise data inspection, we can use pandas.DataFrame.groupby() function.

Let’s first load dataframe having several subjects belonging to different groups.

[15]:
df = niimpy.read_csv(config.SL_ACTIVITY_PATH, tz='Europe/Helsinki')
df.set_index('timestamp', inplace=True)
df
[15]:
user activity group
timestamp
2013-03-27 06:00:00-05:00 u00 2 none
2013-03-27 07:00:00-05:00 u00 1 none
2013-03-27 08:00:00-05:00 u00 2 none
2013-03-27 09:00:00-05:00 u00 3 none
2013-03-27 10:00:00-05:00 u00 4 none
... ... ... ...
2013-05-31 18:00:00-05:00 u59 5 mild
2013-05-31 19:00:00-05:00 u59 5 mild
2013-05-31 20:00:00-05:00 u59 4 mild
2013-05-31 21:00:00-05:00 u59 5 mild
2013-05-31 22:00:00-05:00 u59 1 mild

55907 rows × 3 columns

We can summarize the data by grouping the observations by group and user, and then aggregating the mean:

[16]:
df.groupby(['group','user']).agg("mean")
[16]:
activity
group user
mild u02 0.922348
u04 1.466960
u07 0.914457
u16 0.702918
u20 0.277946
u24 0.938028
u27 0.653724
u31 0.929495
u35 0.519455
u43 0.809045
u49 1.159767
u58 0.620621
u59 1.626263
moderate u18 0.445323
u52 1.051735
moderately severe u17 0.489510
u23 0.412884
none u00 1.182973
u03 0.176737
u05 0.606742
u09 1.095908
u10 0.662612
u14 1.005859
u15 0.295990
u30 0.933036
u32 1.113593
u36 0.936281
u42 0.378851
u44 0.292580
u47 0.396026
u51 0.828662
u56 0.840967
severe u01 1.063660
u19 0.571792
u33 0.733115
u34 0.454789
u45 0.441134
u53 0.389404

Summary statistics

There are many ways you may want to get an overview of your data.

Let’s first load mobile phone screen activity data.

[17]:
df = niimpy.read_csv(config.MULTIUSER_AWARE_SCREEN_PATH, tz='Europe/Helsinki')
df
[17]:
user device time screen_status datetime
2020-01-09 02:06:41.573999872+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578528e+09 0 2020-01-09 02:06:41.573999872+02:00
2020-01-09 02:09:29.152000+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578529e+09 1 2020-01-09 02:09:29.152000+02:00
2020-01-09 02:09:32.790999808+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578529e+09 3 2020-01-09 02:09:32.790999808+02:00
2020-01-09 02:11:41.996000+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578529e+09 0 2020-01-09 02:11:41.996000+02:00
2020-01-09 02:16:19.010999808+02:00 jd9INuQ5BBlW OWd1Uau8POix 1.578529e+09 1 2020-01-09 02:16:19.010999808+02:00
... ... ... ... ... ...
2019-09-08 17:17:14.216000+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.567952e+09 1 2019-09-08 17:17:14.216000+03:00
2019-09-08 17:17:31.966000128+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.567952e+09 0 2019-09-08 17:17:31.966000128+03:00
2019-09-08 20:50:07.360000+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.567965e+09 3 2019-09-08 20:50:07.360000+03:00
2019-09-08 20:50:08.139000064+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.567965e+09 1 2019-09-08 20:50:08.139000064+03:00
2019-09-08 20:53:12.960000+03:00 iGyXetHE3S8u Cq9vueHh3zVs 1.567965e+09 0 2019-09-08 20:53:12.960000+03:00

277 rows × 5 columns

Hourly data

It is easy to get the amount of data (observations) in each hour

[18]:
hourly = df.groupby([df.index.date, df.index.hour]).size()
hourly
[18]:
2019-08-05  14    19
2019-08-08  21     6
            22    12
2019-08-09  7      6
2019-08-10  15     3
2019-08-12  22     3
2019-08-13  7     12
            8      3
            9      5
2019-08-14  23     3
2019-08-15  12     3
2019-08-17  15     6
2019-08-18  19     3
2019-08-24  8      3
            9      3
            12     3
            13     3
2019-08-25  11     5
            12     4
2019-08-26  11     6
2019-08-31  19     3
2019-09-05  23     3
2019-09-07  8      3
2019-09-08  11     3
            17     6
            20     3
2020-01-09  2     27
            10     6
            11     3
            12     3
            14    17
            15    35
            16     4
            17     8
            18     4
            20     4
            21    19
            22     3
            23    12
dtype: int64
[19]:
# The index is the (day, hour) pairs and the
# value is the number at that time
print('%s had %d data points'%(hourly.index[0], hourly.iloc[0]))
(datetime.date(2019, 8, 5), 14) had 19 data points

Occurence

In niimpy, occurence is a way to see the completeness of data.

Occurence is defined as such: * Divides all time into hours * Divides all hours into five 12-minute intervals * Count the number of 12-minute intervals that have data. This is \(occurrence\) * For each hour, report \(occurrence\). “5”is taken to mean that data is present somewhat regularly, while “0” means we have no data.

This isn’t the perfect measure, but is reasonably effective and simple to calculate. For data which isn’t continuous (like screen data we are actually using), it shows how much the sensor has been used.

Column meanings: day is the date, hour is hour of day, occurrence is the measure described above, count is total number of data points in this hour, withdata is which of the 12-min intervals (0-4) have data.

Note that the “uniformly present data” is not true for all data sources.

[20]:
occurrences = niimpy.util.occurrence(df.index)
occurrences.head()
[20]:
day hour occurrence
2019-08-05 14:00:00 2019-08-05 14 4
2019-08-08 21:00:00 2019-08-08 21 1
2019-08-08 22:00:00 2019-08-08 22 2
2019-08-09 07:00:00 2019-08-09 7 2
2019-08-10 15:00:00 2019-08-10 15 1

We can create a simplified presentation (pivot table) for the data by using pandas.pivot()function:

[21]:
occurrences.head().pivot(columns=['hour', 'day'])
[21]:
occurrence
hour 14 21 22 7 15
day 2019-08-05 2019-08-08 2019-08-08 2019-08-09 2019-08-10
2019-08-05 14:00:00 4.0 NaN NaN NaN NaN
2019-08-08 21:00:00 NaN 1.0 NaN NaN NaN
2019-08-08 22:00:00 NaN NaN 2.0 NaN NaN
2019-08-09 07:00:00 NaN NaN NaN 2.0 NaN
2019-08-10 15:00:00 NaN NaN NaN NaN 1.0