Hard-coded dates in your code makes errors related to dates easy to make and hard to find, even if you are good about commenting your code. Recently I re-organized how I track all the dates I use and I wanted to share that process here!
The immediate problem
When I began R minor, I started by pulling in 2 years of data and I wrote all my reports with the assumption that there would be exactly 2 years of data to work with. With fiscal years and reporting deadlines fluctuating, though, I quickly realized this was going to have to change and that I needed to have more control over the process.
After this, I went to a "we're going back to the beginning of the year two years ago until federal reporting is done" kind of system. Recently, we came to the time when federal reporting was done and it was time to stop including the data from 3 years ago.
But I felt pretty nervous about it because I knew I had some dates that depended on the way things were. I could not, off the top of my head, have told you exactly what effects moving up that start date would have on everything. And that is what they mean by "brittle code".
The broad strategy
If I want to be confident that I can consciously decide, not only what date to pull data back to, but, say, what date we need users to check their Data Quality back to, or the date our CoC decided to start collecting a certain piece of data, or any of these kinds of things, I need to know where all of that information lives. It should really all be in the same place.
So I created a dates.R script that simply gives names to various dates. It is organized into different kinds of dates:
- dates that are specific to our particular circumstances, that we decide based on events, deadlines, etc.
- dates that are calculated
- dates that are pulled from metadata, about files or datasets
So I decided we would need a good naming convention so whenever I am wanting to refer to an existing date in my code somewhere, I can type a prefix and see what kinds of things are available. Given the three kinds of dates I listed above, I landed on 3 prefixes: "hc" for hard-coded dates, "calc" for calculated dates, and "meta" for data pulled from metadata.
Here are a few examples from each type of date:
1library(lubridate) 2 3# hc = hard-coded here, used elsewhere 4# meta = result comes from meta data 5# calc = result is calculated 6 7# Hard-coded Dates -------------------------------------------------------- 8 9hc_data_goes_back_to <- mdy("01012019") 10 11hc_check_dq_back_to <- mdy("10012019") # the default ReportStart for DQ reporting 12 13### cut ### 14 15# Dates from Metadata ----------------------------------------------------- 16 17meta_HUDCSV_Export_Date <- read_csv("data/Export.csv", 18 col_types = c("iicccccccTDDcciii")) %>% 19 mutate(ExportDate = ymd_hms(ExportDate)) %>% 20 pull(ExportDate) 21 22### cut ### 23 24# Calculated Dates -------------------------------------------------------- 25 26calc_full_date_range <- interval(ymd(meta_HUDCSV_Export_End), 27 ymd(calc_data_goes_back_to)) 28 29### cut ###
This script is sourced into the main data set I start with for writing reports so that I always have access to them. Go here to see the full script.
The easiest step was creating the script; the more interesting part was finding all the places where I had hard-coded a date!
Did you know you can do a project-wide search in RStudio? You can! It's: Ctrl-Shift-F and it will search every script, every file for your search string and show it to you in a Find tab. You can click the line you're interested in and RStudio will open the script and bring you right to the line.
So, I did a project-wide search for things like '2017', '2018', and on up over the course of several days. I didn't replace anything really, but just used the search to find what dates I was hard coding so I could put those dates into my dates script. This was refined as I continued to find new dates I had completely forgotten about.
I found that I was, at least, good about commenting my code as to what the hard-coded dates represented, but I was surprised at how many there were, spread out across so many scripts. Another thing I found was that the use of "ReportStart" and "ReportEnd" at the tops of many of my scripts (which were often run, one after the other) was definitely a bad practice, even if it made it convenient for running a single script.
Once I had found all the dates in all 3 of my main repos that I use (COHHIO_HMIS, Rminor, and Rminor_elevated), and had them organized in a good way in the dates.R script, I started calling the script everywhere and putting the newly named dates into use.
Here's what some [simplified] code had looked like before and after:
1#BEFORE: 2missing_vaccine_exited <- served_in_date_range %>% 3 filter(is.na(ExitDate) | 4 ymd(ExitDate) >= ymd("20210205")) # BoS started collecting this 5 6# AFTER: 7missing_vaccine_exited <- served_in_date_range %>% 8 filter(is.na(ExitDate) | 9 ymd(ExitDate) >= ymd(hc_bos_start_vaccine_data))
At the end of it, I had no random dates in my code, and felt confident about moving the Export Start Date up a year. My users were super relieved that we moved the default Data Quality default Start Date up a year as well.
My favorite thing about this change is knowing all my dates are in one place and that if something does change, I can change it there and know that it will propagate correctly. This all goes to make my code less brittle and easier to understand.