Basic transformations¶
This page shows some basic transformations you can do once you have read data. Really, it is simply a pandas crash course, since pandas provides all the operations you may need and there is no need for us to re-invent things. Pandas provides a solid but flexible base for us to build advanced operations on top of.
You can read more at the Pandas documentation.
Extracting single rows and columns¶
Let’s first import mobile phone battery status data.
[1]:
TZ = 'Europe/Helsinki'
[2]:
import niimpy
from niimpy import config
import warnings
warnings.filterwarnings("ignore")
[3]:
# Read the data
df = niimpy.read_csv(config.MULTIUSER_AWARE_BATTERY_PATH, tz='Europe/Helsinki')
Then check first rows of the dataframe.
[4]:
df.head()
[4]:
user | device | time | battery_level | battery_status | battery_health | battery_adaptor | datetime | |
---|---|---|---|---|---|---|---|---|
2020-01-09 02:20:02.924999936+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578529e+09 | 74 | 3 | 2 | 0 | 2020-01-09 02:20:02.924999936+02:00 |
2020-01-09 02:21:30.405999872+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578529e+09 | 73 | 3 | 2 | 0 | 2020-01-09 02:21:30.405999872+02:00 |
2020-01-09 02:24:12.805999872+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578529e+09 | 72 | 3 | 2 | 0 | 2020-01-09 02:24:12.805999872+02:00 |
2020-01-09 02:35:38.561000192+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578530e+09 | 72 | 2 | 2 | 0 | 2020-01-09 02:35:38.561000192+02:00 |
2020-01-09 02:35:38.953000192+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578530e+09 | 72 | 2 | 2 | 2 | 2020-01-09 02:35:38.953000192+02:00 |
Get a single column, in this case all users:
[5]:
df['user']
[5]:
2020-01-09 02:20:02.924999936+02:00 jd9INuQ5BBlW
2020-01-09 02:21:30.405999872+02:00 jd9INuQ5BBlW
2020-01-09 02:24:12.805999872+02:00 jd9INuQ5BBlW
2020-01-09 02:35:38.561000192+02:00 jd9INuQ5BBlW
2020-01-09 02:35:38.953000192+02:00 jd9INuQ5BBlW
...
2019-08-09 00:30:48.073999872+03:00 iGyXetHE3S8u
2019-08-09 00:32:40.717999872+03:00 iGyXetHE3S8u
2019-08-09 00:34:23.114000128+03:00 iGyXetHE3S8u
2019-08-09 00:36:05.505000192+03:00 iGyXetHE3S8u
2019-08-09 00:37:37.671000064+03:00 iGyXetHE3S8u
Name: user, Length: 505, dtype: object
Get a single row, in this case the 5th (the first row is zero):
[6]:
df.iloc[4]
[6]:
user jd9INuQ5BBlW
device 3p83yASkOb_B
time 1578530138.953
battery_level 72
battery_status 2
battery_health 2
battery_adaptor 2
datetime 2020-01-09 02:35:38.953000192+02:00
Name: 2020-01-09 02:35:38.953000192+02:00, dtype: object
Listing unique users¶
We can list unique users by using pandas.unique()
function.
[7]:
df['user'].unique()
[7]:
array(['jd9INuQ5BBlW', 'iGyXetHE3S8u'], dtype=object)
List unique values¶
Same applies to other data features/columns.
[8]:
df['battery_status'].unique()
[8]:
array([ 3, 2, 5, -1, -3])
Extract data of only one subject¶
We can extract data of only one subject by following:
[9]:
df[df['user'] == 'jd9INuQ5BBlW']
[9]:
user | device | time | battery_level | battery_status | battery_health | battery_adaptor | datetime | |
---|---|---|---|---|---|---|---|---|
2020-01-09 02:20:02.924999936+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578529e+09 | 74 | 3 | 2 | 0 | 2020-01-09 02:20:02.924999936+02:00 |
2020-01-09 02:21:30.405999872+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578529e+09 | 73 | 3 | 2 | 0 | 2020-01-09 02:21:30.405999872+02:00 |
2020-01-09 02:24:12.805999872+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578529e+09 | 72 | 3 | 2 | 0 | 2020-01-09 02:24:12.805999872+02:00 |
2020-01-09 02:35:38.561000192+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578530e+09 | 72 | 2 | 2 | 0 | 2020-01-09 02:35:38.561000192+02:00 |
2020-01-09 02:35:38.953000192+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578530e+09 | 72 | 2 | 2 | 2 | 2020-01-09 02:35:38.953000192+02:00 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-01-09 23:02:13.938999808+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1.578604e+09 | 73 | 3 | 2 | 0 | 2020-01-09 23:02:13.938999808+02:00 |
2020-01-09 23:10:37.262000128+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1.578604e+09 | 73 | 3 | 2 | 0 | 2020-01-09 23:10:37.262000128+02:00 |
2020-01-09 23:22:13.966000128+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1.578605e+09 | 72 | 3 | 2 | 0 | 2020-01-09 23:22:13.966000128+02:00 |
2020-01-09 23:32:13.959000064+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1.578606e+09 | 71 | 3 | 2 | 0 | 2020-01-09 23:32:13.959000064+02:00 |
2020-01-09 23:39:06.800000+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1.578606e+09 | 71 | 3 | 2 | 0 | 2020-01-09 23:39:06.800000+02:00 |
373 rows × 8 columns
Renaming a column or columns¶
Dataframe column can be renamed using pandas.DataFrame.rename()
function.
[10]:
df.rename(columns={'time': 'timestamp'}, inplace=True)
df.head()
[10]:
user | device | timestamp | battery_level | battery_status | battery_health | battery_adaptor | datetime | |
---|---|---|---|---|---|---|---|---|
2020-01-09 02:20:02.924999936+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578529e+09 | 74 | 3 | 2 | 0 | 2020-01-09 02:20:02.924999936+02:00 |
2020-01-09 02:21:30.405999872+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578529e+09 | 73 | 3 | 2 | 0 | 2020-01-09 02:21:30.405999872+02:00 |
2020-01-09 02:24:12.805999872+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578529e+09 | 72 | 3 | 2 | 0 | 2020-01-09 02:24:12.805999872+02:00 |
2020-01-09 02:35:38.561000192+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578530e+09 | 72 | 2 | 2 | 0 | 2020-01-09 02:35:38.561000192+02:00 |
2020-01-09 02:35:38.953000192+02:00 | jd9INuQ5BBlW | 3p83yASkOb_B | 1.578530e+09 | 72 | 2 | 2 | 2 | 2020-01-09 02:35:38.953000192+02:00 |
Change datatypes¶
Let’s then check the dataframe datatypes:
[11]:
df.dtypes
[11]:
user object
device object
timestamp float64
battery_level int64
battery_status int64
battery_health int64
battery_adaptor int64
datetime datetime64[ns, Europe/Helsinki]
dtype: object
We can change the datatypes with pandas.astype()
function. Here we change battery_health datatype to categorical:
[12]:
df.astype({'battery_health': 'category'}).dtypes
[12]:
user object
device object
timestamp float64
battery_level int64
battery_status int64
battery_health category
battery_adaptor int64
datetime datetime64[ns, Europe/Helsinki]
dtype: object
Transforming a column to a new value¶
Dataframe values can be transformed (decoded etc.) into new values by using pandas.transform()
function.
Here we add one to the column values.
[13]:
df['battery_adaptor'].transform(lambda x: x + 1)
[13]:
2020-01-09 02:20:02.924999936+02:00 1
2020-01-09 02:21:30.405999872+02:00 1
2020-01-09 02:24:12.805999872+02:00 1
2020-01-09 02:35:38.561000192+02:00 1
2020-01-09 02:35:38.953000192+02:00 3
..
2019-08-09 00:30:48.073999872+03:00 2
2019-08-09 00:32:40.717999872+03:00 2
2019-08-09 00:34:23.114000128+03:00 2
2019-08-09 00:36:05.505000192+03:00 2
2019-08-09 00:37:37.671000064+03:00 2
Name: battery_adaptor, Length: 505, dtype: int64
Resample¶
Dataframe down/upsampling can be done with pandas.resample()
function.
Here we downsample the data by hour and aggregate the mean:
[14]:
df['battery_level'].resample('H').agg("mean")
[14]:
2019-08-05 14:00:00+03:00 46.000000
2019-08-05 15:00:00+03:00 44.000000
2019-08-05 16:00:00+03:00 43.000000
2019-08-05 17:00:00+03:00 42.000000
2019-08-05 18:00:00+03:00 41.000000
...
2020-01-09 19:00:00+02:00 86.166667
2020-01-09 20:00:00+02:00 82.000000
2020-01-09 21:00:00+02:00 78.428571
2020-01-09 22:00:00+02:00 75.000000
2020-01-09 23:00:00+02:00 72.000000
Freq: H, Name: battery_level, Length: 3779, dtype: float64
Groupby¶
For groupwise data inspection, we can use pandas.DataFrame.groupby()
function.
Let’s first load dataframe having several subjects belonging to different groups.
[15]:
df = niimpy.read_csv(config.SL_ACTIVITY_PATH, tz='Europe/Helsinki')
df.set_index('timestamp', inplace=True)
df
[15]:
user | activity | group | |
---|---|---|---|
timestamp | |||
2013-03-27 06:00:00-05:00 | u00 | 2 | none |
2013-03-27 07:00:00-05:00 | u00 | 1 | none |
2013-03-27 08:00:00-05:00 | u00 | 2 | none |
2013-03-27 09:00:00-05:00 | u00 | 3 | none |
2013-03-27 10:00:00-05:00 | u00 | 4 | none |
... | ... | ... | ... |
2013-05-31 18:00:00-05:00 | u59 | 5 | mild |
2013-05-31 19:00:00-05:00 | u59 | 5 | mild |
2013-05-31 20:00:00-05:00 | u59 | 4 | mild |
2013-05-31 21:00:00-05:00 | u59 | 5 | mild |
2013-05-31 22:00:00-05:00 | u59 | 1 | mild |
55907 rows × 3 columns
We can summarize the data by grouping the observations by group and user, and then aggregating the mean:
[16]:
df.groupby(['group','user']).agg("mean")
[16]:
activity | ||
---|---|---|
group | user | |
mild | u02 | 0.922348 |
u04 | 1.466960 | |
u07 | 0.914457 | |
u16 | 0.702918 | |
u20 | 0.277946 | |
u24 | 0.938028 | |
u27 | 0.653724 | |
u31 | 0.929495 | |
u35 | 0.519455 | |
u43 | 0.809045 | |
u49 | 1.159767 | |
u58 | 0.620621 | |
u59 | 1.626263 | |
moderate | u18 | 0.445323 |
u52 | 1.051735 | |
moderately severe | u17 | 0.489510 |
u23 | 0.412884 | |
none | u00 | 1.182973 |
u03 | 0.176737 | |
u05 | 0.606742 | |
u09 | 1.095908 | |
u10 | 0.662612 | |
u14 | 1.005859 | |
u15 | 0.295990 | |
u30 | 0.933036 | |
u32 | 1.113593 | |
u36 | 0.936281 | |
u42 | 0.378851 | |
u44 | 0.292580 | |
u47 | 0.396026 | |
u51 | 0.828662 | |
u56 | 0.840967 | |
severe | u01 | 1.063660 |
u19 | 0.571792 | |
u33 | 0.733115 | |
u34 | 0.454789 | |
u45 | 0.441134 | |
u53 | 0.389404 |
Summary statistics¶
There are many ways you may want to get an overview of your data.
Let’s first load mobile phone screen activity data.
[17]:
df = niimpy.read_csv(config.MULTIUSER_AWARE_SCREEN_PATH, tz='Europe/Helsinki')
df
[17]:
user | device | time | screen_status | datetime | |
---|---|---|---|---|---|
2020-01-09 02:06:41.573999872+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1.578528e+09 | 0 | 2020-01-09 02:06:41.573999872+02:00 |
2020-01-09 02:09:29.152000+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1.578529e+09 | 1 | 2020-01-09 02:09:29.152000+02:00 |
2020-01-09 02:09:32.790999808+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1.578529e+09 | 3 | 2020-01-09 02:09:32.790999808+02:00 |
2020-01-09 02:11:41.996000+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1.578529e+09 | 0 | 2020-01-09 02:11:41.996000+02:00 |
2020-01-09 02:16:19.010999808+02:00 | jd9INuQ5BBlW | OWd1Uau8POix | 1.578529e+09 | 1 | 2020-01-09 02:16:19.010999808+02:00 |
... | ... | ... | ... | ... | ... |
2019-09-08 17:17:14.216000+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1.567952e+09 | 1 | 2019-09-08 17:17:14.216000+03:00 |
2019-09-08 17:17:31.966000128+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1.567952e+09 | 0 | 2019-09-08 17:17:31.966000128+03:00 |
2019-09-08 20:50:07.360000+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1.567965e+09 | 3 | 2019-09-08 20:50:07.360000+03:00 |
2019-09-08 20:50:08.139000064+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1.567965e+09 | 1 | 2019-09-08 20:50:08.139000064+03:00 |
2019-09-08 20:53:12.960000+03:00 | iGyXetHE3S8u | Cq9vueHh3zVs | 1.567965e+09 | 0 | 2019-09-08 20:53:12.960000+03:00 |
277 rows × 5 columns
Hourly data¶
It is easy to get the amount of data (observations) in each hour
[18]:
hourly = df.groupby([df.index.date, df.index.hour]).size()
hourly
[18]:
2019-08-05 14 19
2019-08-08 21 6
22 12
2019-08-09 7 6
2019-08-10 15 3
2019-08-12 22 3
2019-08-13 7 12
8 3
9 5
2019-08-14 23 3
2019-08-15 12 3
2019-08-17 15 6
2019-08-18 19 3
2019-08-24 8 3
9 3
12 3
13 3
2019-08-25 11 5
12 4
2019-08-26 11 6
2019-08-31 19 3
2019-09-05 23 3
2019-09-07 8 3
2019-09-08 11 3
17 6
20 3
2020-01-09 2 27
10 6
11 3
12 3
14 17
15 35
16 4
17 8
18 4
20 4
21 19
22 3
23 12
dtype: int64
[19]:
# The index is the (day, hour) pairs and the
# value is the number at that time
print('%s had %d data points'%(hourly.index[0], hourly.iloc[0]))
(datetime.date(2019, 8, 5), 14) had 19 data points
Occurence¶
In niimpy, occurence is a way to see the completeness of data.
Occurence is defined as such: * Divides all time into hours * Divides all hours into five 12-minute intervals * Count the number of 12-minute intervals that have data. This is \(occurrence\) * For each hour, report \(occurrence\). “5”is taken to mean that data is present somewhat regularly, while “0” means we have no data.
This isn’t the perfect measure, but is reasonably effective and simple to calculate. For data which isn’t continuous (like screen data we are actually using), it shows how much the sensor has been used.
Column meanings: day
is the date, hour
is hour of day, occurrence
is the measure described above, count
is total number of data points in this hour, withdata
is which of the 12-min intervals (0-4) have data.
Note that the “uniformly present data” is not true for all data sources.
[20]:
occurrences = niimpy.util.occurrence(df.index)
occurrences.head()
[20]:
day | hour | occurrence | |
---|---|---|---|
2019-08-05 14:00:00 | 2019-08-05 | 14 | 4 |
2019-08-08 21:00:00 | 2019-08-08 | 21 | 1 |
2019-08-08 22:00:00 | 2019-08-08 | 22 | 2 |
2019-08-09 07:00:00 | 2019-08-09 | 7 | 2 |
2019-08-10 15:00:00 | 2019-08-10 | 15 | 1 |
We can create a simplified presentation (pivot table) for the data by using pandas.pivot()
function:
[21]:
occurrences.head().pivot(columns=['hour', 'day'])
[21]:
occurrence | |||||
---|---|---|---|---|---|
hour | 14 | 21 | 22 | 7 | 15 |
day | 2019-08-05 | 2019-08-08 | 2019-08-08 | 2019-08-09 | 2019-08-10 |
2019-08-05 14:00:00 | 4.0 | NaN | NaN | NaN | NaN |
2019-08-08 21:00:00 | NaN | 1.0 | NaN | NaN | NaN |
2019-08-08 22:00:00 | NaN | NaN | 2.0 | NaN | NaN |
2019-08-09 07:00:00 | NaN | NaN | NaN | 2.0 | NaN |
2019-08-10 15:00:00 | NaN | NaN | NaN | NaN | 1.0 |