Using Dates to Match Assessment Data to Project Stays

Share on:

In this post, I am going to offer up some sample data and the code I have come up with for comment and/or other ideas or maybe it will help others. If you aim to run my sample data and code, you will need R (free) and RStudio (the free and open source version is all you need!) installed on your computer.

Some backstory- our Homeless Management Information System (HMIS) data is collected in a product called ServicePoint. Being that it is data about humans, sometimes it can get very complex, as humans are complex beings with ever changing needs and situations, not to mention there's actual humans entering the data. :) This particular dataset we are working with comes to us in the following format:

In an object called "Assessment_Data", are the following variables: Client_ID, Data_Element, Value, Date_Effective, Date_Added. It is almost 1 million records with Data_Elements ranging from Residence Prior, to Domestic Violence data. These questions are asked of clients at various stages of their involvement with the homeless services agencies in our area. Obviously over time, the answers will vary, as things change for us complicated humans.

In another object called "Enrollment", are the following variables: Client_ID, Enrollment_ID, Entry_Date, Exit_Date, Project_ID, Data_Collection_Stage, (and some unrelated other things.) The "key" of this table would really be the Enrollment_ID and the Data_Collection_Stage, as an Enrollment_ID could be repeated in the rows if any of the data collected at entry changed during the stay.

For this exercise, I joined the Assessment_Data and Enrollment data so that only the relevant columns are shown. Then I populated it with fake data.

So the problem here is that there is no ID field tying any of the Assessment data to an Enrollment ID. AND even if there was, we would also need to know what Data Collection Stage it applies to. If, when a client enters a project, the answer to "Receiving any income?" is "No", but during the project stay, they gain an income, then the client will have two answers for that question, with the "Effective Date" being different for each answer. Any time we go to look at what the answer is for any given point in time (like an Entry into a project or an Exit from a project), we want to see the most recent answer from the assessment data.

So with all that in mind, I'm going to share my sample data:

 1library("tidyverse")
 2library("lubridate")
 3client <- c("Mary", "Mary", "Mary", "Mary", "Rex", "Rex", "Jennifer", "Jennifer", "Jennifer", "Jennifer", "Jennifer", "Jennifer", "Hannah", "Hannah", "Hannah", "Hannah", "Hannah", "Hannah", "Lydia", "Lydia")
 4enrollment_id <- c(1552, 1552, 4828, 4828, 3831, 3171, 3336, 3336, 3336, 2554, 2554, 2554, 1778, 1778, 1131, 1131, 4062, 4062, 3199, 3199)
 5entry <- c("1/1/2016","1/1/2016","1/1/2018","1/1/2018","5/1/2017","3/1/2018","2/1/2018","2/1/2018","2/1/2018","3/15/2018","3/15/2018","3/15/2018","1/1/2018","1/1/2018","9/1/2016", "9/1/2016","1/1/2017","1/1/2017","6/1/2018","6/1/2018")
 6exit <- c("9/1/2018","9/1/2018","3/1/2018","3/1/2018","6/1/2017","5/1/2018","3/1/2018","3/1/2018","3/1/2018", "5/1/2018", "5/1/2018","5/1/2018","9/1/2018","9/1/2018","10/1/2016","10/1/2016","9/1/2017","9/1/2017","9/1/2018","9/1/2018")
 7monthly_income <- c(600, 0, 600, 0, 300, 700, 1400, 200, 1400, 1400, 200, 1400, 1200, 1500, 1200, 1500, 1200, 1500, 1600, 1800)
 8date_eff<- c("1/1/2013","1/1/2018","1/1/2013","1/1/2018","5/1/2017","3/1/2018","1/1/2016","2/1/2018","4/1/2018","1/1/2016","2/1/2018","4/1/2018","9/1/2016","9/1/2017","9/1/2016","9/1/2017","9/1/2016","9/1/2017","6/1/2018","9/1/2018")
 9manual <- c(1, 2, 0, 1, 1, 1, 0, 1, 0, 0, 1, 2, 0, 1, 1, 0, 1, 3, 1, 3)
10testdata <- data.frame(client, enrollment_id, entry, exit, monthly_income, date_eff, manual)
11rm(client, enrollment_id, entry, exit, monthly_income, date_eff, manual)

A client can have more than one Enrollment, and an Enrollment can have three Data Collection Stages. An Enrollment can only have one Entry Date and one Exit Date. The Exit Date can be null (though that is not represented in this sample data). The "manual" column represents what I'm saying the Data Collection Stage should be.

First, I tried a LOT of things. What I landed on is creating a tmp table, where I iteratively add a column to the testdata object that returns the date of the answer we want for that particular data collection stage. There are differing rules for each stage, and it was easier to separate them out than try to do them all in the same statement. Please offer suggestions if I'm missing something here. Anyway, the first Data Collection Stage is 1 and it says "this is the data element that was true as of the most recent assessment data as compared to the ENTRY date." Here's the code I wrote for that one:

1tmp <- testdata %>%
2group_by(enrollment_id) %>%
3mutate(datacollectionstage1 = max(mdy(date_eff) \ [mdy(entry) >= mdy(date_eff)]))

For the second Data Collection Stage, we are looking for the data element that changed DURING the program stay, so between the ENTRY and EXIT date. And we want the most recent one, not all of them. This one throws a warning that makes me nervous but I think it's just saying there's only ever one assessment answer between the Entry and Exit, and that is true, but in the live data, it is possible for there to be multiple answers between the Entry and Exit. Anyway, here's that code:

1tmp <- tmp %>%
2group_by(enrollment_id) %>% 
3mutate(datacollectionstage2 = max(mdy(date_eff) \ [mdy(entry) < mdy(date_eff) & 
4                                                          mdy(exit) > mdy(date_eff)]))
5
6# The warning I'm getting is:
7# In max.default(numeric(0), na.rm = FALSE): no non-missing arguments to max; returning -Inf

For the third Data Collection Stage, we are looking for any assessment data that changed as of the EXIT Date. Any data that changed AFTER the Exit Date is dead to us, so this makes this one kind of easy.

1tmp <- tmp %>%
2group_by(enrollment_id) %>%
3mutate(datacollectionstage3 = case_when(mdy(date_eff) == mdy(exit) ~ mdy(date_eff)))

So if you're watching the tmp table as you're going, you can see that it is placing dates into the rows where it makes sense and leaving all the rest as null.

(images/tmp-1024x408.png)

So how to get all of this so that it no longer has the three columns, but one column indicating which Enrollment and Data Collection Stage should have which assessment value??? I used this to overwrite my "testdata" file, added the collectionstage variable, matching the assessment effective dates to the datacollectionstage dates, and then removed all the unnecessary variables. If you leave the "manual" column in, you can see they line up perfectly.

 1testdata <- tmp %>%
 2mutate(collectionstage = 
 3case_when(
 4mdy(date_eff) == ymd(datacollectionstage1) ~ 1,
 5mdy(date_eff) == ymd(datacollectionstage2) ~ 2,
 6mdy(date_eff) == ymd(datacollectionstage3) ~ 3
 7),
 8datacollectionstage1 = NULL,
 9datacollectionstage2 = NULL,
10datacollectionstage3 = NULL,
11manual = NULL) %>% # leave this column in if you want to compare
12filter(!is.na(collectionstage))
13rm(tmp)

(images/final.png)

I think this is all right, but I will have to apply this logic to a LOT of assessment questions- this was obviously only one! Also I will need to think about how to make it so that in each space where there was no change for that data collection stage, it will show the previous data collection stage's answer. Once I'm done with that piece, I am thinking I will create a function that does this so that I can just put in the name of the data element and have it do this for each one in turn.

Thanks for reading! Hope this helps someone and please let me know if you can see room for improvement or have ideas about how to apply this across the other data elements!