R : Tidyverse :: Minecraft : Return to Middle Earth
I'm writing this to my homeless services colleagues who may be considering R. A basic understanding of what the Tidyverse is could help with focusing on where to begin learning about the data manipulation part of what we do.
R is a language and environment actively maintained by contributors to the project and the R Foundation. Its plotting and data analysis tools are relied upon by between 1 and 2 million users in the world. It's powerful and well-respected.
A cool thing about R (and many open source projects) is one's ability to create and share packages that are useful to others. For example, one could write an R package that contains common functions we would tend to need in homeless services data, like calculating chronicity, or filtering clients based on when they were served, as a couple of examples.
If you ever played Minecraft in the old days of Mojang, it's like that: there's "vanilla" Minecraft, or the base game, where there's just a few biomes, trees, water, some basic mobs, your basic ores, and recipes for basic tools. And then there are "mods" that have been created by players that you can add to your game that do things like add different kinds of ores and new recipes, new biomes, different sounds, anything you can think of. The problem is not all mods work well together in the same game, and many times mods are created and abandoned, not keeping up with the Minecraft updates.
Carrying with the Minecraft comparison, there are also "modpacks", which is basically a collection of various singular mods that are known to work well together in the same game. If you're familiar with Minecraft and you've ever played Hexxit or the Minecraft Pokemon or Return to Middle Earth, then you've used a modpack to increase the capabilities of Minecraft.
Back to R, the Tidyverse is a collection of available R packages that work well together to increase the usability of R. (Kind of like a Minecraft modpack.) Just like Return to Middle Earth is centered around the Lord of the Rings trilogy, the Tidyverse's core philosophy is centered around the concept of "tidy" data. Tidy data is data with a consistent form: in brief, every variable goes in a column, and every column is a variable. There are, of course, many ways to collect, gather, and store data. The Tidyverse helps you get your data into the most efficient shape.
Tidy data is data with a consistent form: in brief, every variable goes in a column, and every column is a variable.
The packages included in the Tidyverse (core) are:
- ggplot2 - creating graphics based on the Grammar of Graphics
- dplyr - provides a grammar of data manipulation
- tidyr - provides a set of functions that helps you get to "tidy data"
- readr - helps to parse data found in the wild
- purr - set of tools for working with functions and vectors
- tibble - Tibbles are data.frames that are lazy and surly: they do less and complain more forcing you to confront problems earlier, typically leading to cleaner, more expressive code
- stringr - designed to make working with strings as easy as possible
- forcats - tools that solve common problems with factors and categorical data
There are other packages that are included in the tidyverse as well, but must be loaded separately if you want the functionality it brings. I will just name a few that have been relevant in my work with HMIS:
- lubridate - for working with dates, date-times, and date intervals
- readxl - for reading in Excel files
- xml2 - for reading in xml data
You can install the core packages of the tidyverse by running the following:
To load the core packages of the tidyverse, you would run:
Often my scripts begin with:
Lubridate is very important to our work because of the extent to which we use dates and intervals in our calculations. Lubridate is not included in the core set of Tidyverse packages because of the scientific and academic communities who use R and who do not generally work with dates. Lubridate is especially useful to those of us in homeless services for its ability to work with date intervals like program stays, overlaps, etc.
I know I've mentioned this before, but I highly recommend working through the R for Data Science book for a really good grounding in the Tidyverse. Buy it online or refer to it here: https://r4ds.had.co.nz/