Surveys¶

Surveys consist of columns * id for the question identifier * answer for the answer of the question * q which is the text of the question presented to the user (optional) * As usual, the DataFrame index is the timestamp of the answer. It is the convention that all responses in a one single survey instance have the same timestamp, and this is used to link surveys together.

The raw on-disk format is “long”, that is, one row per answer, which is “tidy data”. This provides the most flexible format, but often you need to do other transformations.

Load data¶

[1]:

# Artificial example survey data
import niimpy
from niimpy import config
import niimpy.preprocessing.survey as survey
from niimpy.preprocessing.survey import *
import warnings
warnings.filterwarnings("ignore")

[2]:

df = niimpy.read_csv(config.SURVEY_PATH, tz='Europe/Helsinki')
df.head()

[2]:

	user	age	gender	Little interest or pleasure in doing things.	Feeling down; depressed or hopeless.	Feeling nervous; anxious or on edge.	Not being able to stop or control worrying.	In the last month; how often have you felt that you were unable to control the important things in your life?	In the last month; how often have you felt confident about your ability to handle your personal problems?	In the last month; how often have you felt that things were going your way?	In the last month; how often have you been able to control irritations in your life?	In the last month; how often have you felt that you were on top of things?	In the last month; how often have you been angered because of things that were outside of your control?	In the last month; how often have you felt difficulties were piling up so high that you could not overcome them?
0	1	20	Male	several-days	more-than-half-the-days	not-at-all	nearly-every-day	almost-never	sometimes	fairly-often	never	sometimes	very-often	fairly-often
1	2	32	Male	more-than-half-the-days	more-than-half-the-days	not-at-all	several-days	never	never	very-often	sometimes	never	fairly-often	never
2	3	15	Male	more-than-half-the-days	not-at-all	several-days	not-at-all	never	very-often	very-often	fairly-often	never	never	almost-never
3	4	35	Female	not-at-all	nearly-every-day	not-at-all	several-days	very-often	fairly-often	very-often	never	sometimes	never	fairly-often
4	5	23	Male	more-than-half-the-days	not-at-all	more-than-half-the-days	several-days	almost-never	very-often	almost-never	sometimes	sometimes	very-often	never

Preprocessing¶

Currently the dataframe columns are raw questions and answers from the survey. We will use Niimpy to convert them to a numerical format, but first the dataframe should follow the general Niimpy Schema. The rows should be indexed by a datetime index, rather than a number.

Since the data does not contain a timestamp, we must assume that each user has only completed the survey once. If the surveys were completed on January 1st 2020, for example, we would replace the index with this date.

[3]:

# Assign the same time index to all survey responses
df.index = [pd.Timestamp("1.1.2020", tz='Europe/Helsinki')]*df.shape[0]

Next we will convert the questions to a standard identifier format Niimpy will understand. The questions are from PHQ2, GAD2 and PSS10 standard surveys. Niimpy provides mappings from raw question text to question ids for these surveys. The identifiers is constructed from a prefix (the questionnaire category: GAD, PHQ, PSQI etc.), followed by the question number (1,2,3). You can define your own identifiers or use the ones provided by Niimpy.

Before applying the mapping, the column names should be cleaned using the clean_survey_column_names function. This removes punctuation in the question text.

[4]:

# For example, the mapping dictionary for PHQ2 is
PHQ2_MAP

[4]:

{'Little interest or pleasure in doing things': 'PHQ2_1',
 'Feeling down depressed or hopeless': 'PHQ2_2'}

[5]:

# Convert column name to id, based on provided mappers from niimpy
column_map = {**PHQ2_MAP, **PSS10_MAP, **GAD2_MAP}
df = survey.clean_survey_column_names(df)
df = df.rename(column_map, axis = 1)
df.head()

[5]:

	user	age	gender	PHQ2_1	PHQ2_2	GAD2_1	GAD2_2	PSS10_2	PSS10_4	PSS10_5	PSS10_6	PSS10_7	PSS10_8	PSS10_9
2020-01-01 00:00:00+02:00	1	20	Male	several-days	more-than-half-the-days	not-at-all	nearly-every-day	almost-never	sometimes	fairly-often	never	sometimes	very-often	fairly-often
2020-01-01 00:00:00+02:00	2	32	Male	more-than-half-the-days	more-than-half-the-days	not-at-all	several-days	never	never	very-often	sometimes	never	fairly-often	never
2020-01-01 00:00:00+02:00	3	15	Male	more-than-half-the-days	not-at-all	several-days	not-at-all	never	very-often	very-often	fairly-often	never	never	almost-never
2020-01-01 00:00:00+02:00	4	35	Female	not-at-all	nearly-every-day	not-at-all	several-days	very-often	fairly-often	very-often	never	sometimes	never	fairly-often
2020-01-01 00:00:00+02:00	5	23	Male	more-than-half-the-days	not-at-all	more-than-half-the-days	several-days	almost-never	very-often	almost-never	sometimes	sometimes	very-often	never

Now the dataframe follows the Niimpy standard schema. Next we will use niimpy to convert the raw answers to numerical values for further analysis. For this, we need a mapping {raw_answer: numerical_answer}, which niimpy provides within the survey module. You can also use your own mapping.

Based on the question’s id, niimpy maps the raw answers to their numerical presentation.

[6]:

# The mapping dictionary included in Niimpy is
ID_MAP_PREFIX

[6]:

{'PSS': {'never': 0,
  'almost never': 1,
  'sometimes': 2,
  'fairly often': 3,
  'very often': 4},
 'PHQ2': {'not at all': 0,
  'several days': 1,
  'more than half the days': 2,
  'nearly every day': 3},
 'GAD2': {'not at all': 0,
  'several days': 1,
  'more than half the days': 2,
  'nearly every day': 3}}

[7]:

# Transform raw answers to numerical values
transformed_df = survey.convert_survey_to_numerical_answer(
    df, id_map=ID_MAP_PREFIX, use_prefix=True
)
transformed_df.head()

[7]:

	user	age	gender	PHQ2_1	PHQ2_2	GAD2_1	GAD2_2	PSS10_2	PSS10_4	PSS10_5	PSS10_6	PSS10_7	PSS10_8	PSS10_9
2020-01-01 00:00:00+02:00	1	20	Male	1	2	0	3	1	2	3	0	2	4	3
2020-01-01 00:00:00+02:00	2	32	Male	2	2	0	1	0	0	4	2	0	3	0
2020-01-01 00:00:00+02:00	3	15	Male	2	0	1	0	0	4	4	3	0	0	1
2020-01-01 00:00:00+02:00	4	35	Female	0	3	0	1	4	3	4	0	2	0	3
2020-01-01 00:00:00+02:00	5	23	Male	2	0	2	1	1	4	1	2	2	4	0

Survey score sums¶

Next we can calucate the sum of each survey using the survey ID in the column name.

[8]:

sum_df = sum_survey_scores(transformed_df, ["PHQ2", "PSS10", "GAD2"])
sum_df.head()

[8]:

	user	PHQ2	PSS10	GAD2
2020-01-01 00:00:00+02:00	1	3	15	3
2020-01-01 00:00:00+02:00	2	4	9	1
2020-01-01 00:00:00+02:00	3	2	12	1
2020-01-01 00:00:00+02:00	4	3	16	1
2020-01-01 00:00:00+02:00	5	2	14	3

Survey statistics¶

Another common preprocessing step is to resample results to reduce noise or simplify the data. The survey.survey_statistic function split the results by time intervals and return relevant statistics of each survey sum or each question column over that interval.

Note that since the example data contains a single time for each participant, the standard deviation is NaN and the other statistics are predictable.

[9]:

survey.survey_statistic(sum_df, {
    "columns": ["PHQ2", "PSS10", "GAD2"]
})

[9]:

	user	PHQ2_mean	PHQ2_min	PHQ2_max	PHQ2_std	PSS10_mean	PSS10_min	PSS10_max	PSS10_std	GAD2_mean	GAD2_min	GAD2_max	GAD2_std
2020-01-01 00:00:00+02:00	1	3.0	3.0	3.0	NaN	15.0	15.0	15.0	NaN	3.0	3.0	3.0	NaN
2020-01-01 00:00:00+02:00	2	4.0	4.0	4.0	NaN	9.0	9.0	9.0	NaN	1.0	1.0	1.0	NaN
2020-01-01 00:00:00+02:00	3	2.0	2.0	2.0	NaN	12.0	12.0	12.0	NaN	1.0	1.0	1.0	NaN
2020-01-01 00:00:00+02:00	4	3.0	3.0	3.0	NaN	16.0	16.0	16.0	NaN	1.0	1.0	1.0	NaN
2020-01-01 00:00:00+02:00	5	2.0	2.0	2.0	NaN	14.0	14.0	14.0	NaN	3.0	3.0	3.0	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2020-01-01 00:00:00+02:00	996	3.0	3.0	3.0	NaN	17.0	17.0	17.0	NaN	2.0	2.0	2.0	NaN
2020-01-01 00:00:00+02:00	997	0.0	0.0	0.0	NaN	13.0	13.0	13.0	NaN	1.0	1.0	1.0	NaN
2020-01-01 00:00:00+02:00	998	2.0	2.0	2.0	NaN	13.0	13.0	13.0	NaN	2.0	2.0	2.0	NaN
2020-01-01 00:00:00+02:00	999	4.0	4.0	4.0	NaN	21.0	21.0	21.0	NaN	5.0	5.0	5.0	NaN
2020-01-01 00:00:00+02:00	1000	4.0	4.0	4.0	NaN	14.0	14.0	14.0	NaN	2.0	2.0	2.0	NaN

1000 rows × 13 columns

survey_statistic also works for indidual questions. You can specify the questionnaire that you want statistics of by passing a value into the prefix parameter or pass a list of questions as the columns parameter.

[10]:

d = survey.survey_statistic(transformed_df, {
    "prefix":'PHQ',
})
pd.DataFrame(d)

[10]:

	user	PHQ2_1_mean	PHQ2_1_min	PHQ2_1_max	PHQ2_1_std	PHQ2_2_mean	PHQ2_2_min	PHQ2_2_max	PHQ2_2_std
2020-01-01 00:00:00+02:00	1	1.0	1.0	1.0	NaN	2.0	2.0	2.0	NaN
2020-01-01 00:00:00+02:00	2	2.0	2.0	2.0	NaN	2.0	2.0	2.0	NaN
2020-01-01 00:00:00+02:00	3	2.0	2.0	2.0	NaN	0.0	0.0	0.0	NaN
2020-01-01 00:00:00+02:00	4	0.0	0.0	0.0	NaN	3.0	3.0	3.0	NaN
2020-01-01 00:00:00+02:00	5	2.0	2.0	2.0	NaN	0.0	0.0	0.0	NaN
...	...	...	...	...	...	...	...	...	...
2020-01-01 00:00:00+02:00	996	0.0	0.0	0.0	NaN	3.0	3.0	3.0	NaN
2020-01-01 00:00:00+02:00	997	0.0	0.0	0.0	NaN	0.0	0.0	0.0	NaN
2020-01-01 00:00:00+02:00	998	1.0	1.0	1.0	NaN	1.0	1.0	1.0	NaN
2020-01-01 00:00:00+02:00	999	2.0	2.0	2.0	NaN	2.0	2.0	2.0	NaN
2020-01-01 00:00:00+02:00	1000	2.0	2.0	2.0	NaN	2.0	2.0	2.0	NaN

1000 rows × 9 columns