Surveys¶
Surveys consist of columns * id
for the question identifier * answer
for the answer of the question * q
which is the text of the question presented to the user (optional) * As usual, the DataFrame index is the timestamp of the answer. It is the convention that all responses in a one single survey instance have the same timestamp, and this is used to link surveys together.
The raw on-disk format is “long”, that is, one row per answer, which is “tidy data”. This provides the most flexible format, but often you need to do other transformations.
Load data¶
[1]:
# Artificial example survey data
import niimpy
from niimpy import config
import niimpy.preprocessing.survey as survey
from niimpy.preprocessing.survey import *
import warnings
warnings.filterwarnings("ignore")
[2]:
df = niimpy.read_csv(config.SURVEY_PATH, tz='Europe/Helsinki')
df.head()
[2]:
user | age | gender | Little interest or pleasure in doing things. | Feeling down; depressed or hopeless. | Feeling nervous; anxious or on edge. | Not being able to stop or control worrying. | In the last month; how often have you felt that you were unable to control the important things in your life? | In the last month; how often have you felt confident about your ability to handle your personal problems? | In the last month; how often have you felt that things were going your way? | In the last month; how often have you been able to control irritations in your life? | In the last month; how often have you felt that you were on top of things? | In the last month; how often have you been angered because of things that were outside of your control? | In the last month; how often have you felt difficulties were piling up so high that you could not overcome them? | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 20 | Male | several-days | more-than-half-the-days | not-at-all | nearly-every-day | almost-never | sometimes | fairly-often | never | sometimes | very-often | fairly-often |
1 | 2 | 32 | Male | more-than-half-the-days | more-than-half-the-days | not-at-all | several-days | never | never | very-often | sometimes | never | fairly-often | never |
2 | 3 | 15 | Male | more-than-half-the-days | not-at-all | several-days | not-at-all | never | very-often | very-often | fairly-often | never | never | almost-never |
3 | 4 | 35 | Female | not-at-all | nearly-every-day | not-at-all | several-days | very-often | fairly-often | very-often | never | sometimes | never | fairly-often |
4 | 5 | 23 | Male | more-than-half-the-days | not-at-all | more-than-half-the-days | several-days | almost-never | very-often | almost-never | sometimes | sometimes | very-often | never |
Preprocessing¶
The dataframe’s columns are raw questions from a survey. Some questions belong to a specific category, so we will annotate them with ids. The id is constructed from a prefix (the questionnaire category: GAD, PHQ, PSQI etc.), followed by the question number (1,2,3). Similarly, we will also the answers to meaningful numerical values.
Note: It’s important that the dataframe follows the below schema before passing into niimpy.
[3]:
# Convert column name to id, based on provided mappers from niimpy
col_id = {**PHQ2_MAP, **PSQI_MAP, **PSS10_MAP, **PANAS_MAP, **GAD2_MAP}
selected_cols = [col for col in df.columns if col in col_id.keys()]
# Convert from wide to long format
transformed_df = pd.melt(df, id_vars=['user', 'age', 'gender'], value_vars=selected_cols, var_name='question', value_name='raw_answer')
# Assign questions to codes
transformed_df['id'] = transformed_df['question'].replace(col_id)
transformed_df.head()
[3]:
user | age | gender | question | raw_answer | id | |
---|---|---|---|---|---|---|
0 | 1 | 20 | Male | Little interest or pleasure in doing things. | several-days | PHQ2_1 |
1 | 2 | 32 | Male | Little interest or pleasure in doing things. | more-than-half-the-days | PHQ2_1 |
2 | 3 | 15 | Male | Little interest or pleasure in doing things. | more-than-half-the-days | PHQ2_1 |
3 | 4 | 35 | Female | Little interest or pleasure in doing things. | not-at-all | PHQ2_1 |
4 | 5 | 23 | Male | Little interest or pleasure in doing things. | more-than-half-the-days | PHQ2_1 |
Moreover, niimpy
can convert the raw answers to numerical values for further analysis. For this, we need a mapping {raw_answer: numerical_answer}
, which niimpy
provides within the survey
module that you can easily adjust to your own needs.
Based on the question’s id, niimpy
maps the raw answers to their numerical presentation.
[4]:
# Transform raw answers to numerical values
transformed_df['answer'] = survey.survey_convert_to_numerical_answer(transformed_df, answer_col = 'raw_answer',
question_id = 'id', id_map=ID_MAP_PREFIX, use_prefix=True)
transformed_df.head()
[4]:
user | age | gender | question | raw_answer | id | answer | |
---|---|---|---|---|---|---|---|
0 | 1 | 20 | Male | Little interest or pleasure in doing things. | several-days | PHQ2_1 | 1 |
1 | 2 | 32 | Male | Little interest or pleasure in doing things. | more-than-half-the-days | PHQ2_1 | 2 |
2 | 3 | 15 | Male | Little interest or pleasure in doing things. | more-than-half-the-days | PHQ2_1 | 2 |
3 | 4 | 35 | Female | Little interest or pleasure in doing things. | not-at-all | PHQ2_1 | 0 |
4 | 5 | 23 | Male | Little interest or pleasure in doing things. | more-than-half-the-days | PHQ2_1 | 2 |
Print survey statistics¶
Now that we have finally preprocessed the survey, we can extract some meaningful statistic from it.
First, we can compute the mean, standard deviation, min, and max values of all questionnaires.
[5]:
d = survey.survey_print_statistic(transformed_df, question_id_col = 'id', answer_col = 'answer')
pd.DataFrame(d)
[5]:
PHQ2 | PSS10 | GAD2 | |
---|---|---|---|
min | 0.0000 | 4.000000 | 0.000000 |
max | 6.0000 | 27.000000 | 6.000000 |
avg | 3.0520 | 14.006000 | 3.042000 |
std | 1.5855 | 3.687759 | 1.536423 |
You can specify the questionnaire that you want statistics of by passing a value into the prefix
parameter.
[6]:
d = survey.survey_print_statistic(transformed_df, question_id_col = 'id', answer_col = 'answer', prefix='PHQ')
pd.DataFrame(d)
[6]:
PHQ | |
---|---|
avg | 3.0520 |
max | 6.0000 |
min | 0.0000 |
std | 1.5855 |