R | Alan Yeung

Trying out timeplyr

The timeplyr R package, created by my colleague Nick, was accepted on CRAN in October 2023. A direct quote from the CRAN page is that it provides a set of fast tidy functions for wrangling, completing and summarising date and date-time data. It looks like a really neat package for working with time series data in a way consistent with what people have become used to with the tidyverse. From my chats with Nick, I believe some of the ideas for this package were inspired by problems that came up repeatedly while working with COVID-19 data.

Grouped Sequences in dplyr Part 2

I just wrote a post about grouped sequences in dplyr and following that, I’ve been made aware of another couple of solutions to this problem (credit John Mackintosh). The solution involves using the consecutive_id() function, available in dplyr since v1.1.0. In the help page for this function, it’s mentioned that it was inspired by rleid() function from the data.table package. These functions work similarly to the rle() function I used last time (in what I called ‘the complicated solution’) but provide neater outputs.

Grouped Sequences in dplyr

For a piece of work I had to calculate the number of matches that a team plays away from home in a row, which we will call days_on_the_road. I was not sure how to do this with dplyr but it’s basically a ‘grouped sequence’. For this post, I’ve created some dummy data to illustrate this idea. The num_matches_away variable is what we want to mimic using some data manipulation.

A couple of case_when() tricks

Combining case_when() and across() If you want to use case_when() and across() different variables, then here is an example that can do this with the help of the get() and cur_column() functions. library(tidyverse) iris_df <- as_tibble(iris) %>% mutate(flag_Petal.Length = as.integer(Petal.Length > 1.5), flag_Petal.Width = as.integer(Petal.Width > 0.2)) iris_df %>% mutate(across(c(Petal.Length, Petal.Width), ~case_when( get(glue::glue("flag_{cur_column()}")) == 1 ~ NA_real_, TRUE ~ .x ))) %>% select(contains("Petal")) ## # A tibble: 150 × 4 ## Petal.

Summarising Dates with Missing Values

This blog post is just a note that when you try to do a grouped summary of a date variable but some groups have all missing values, it will return Inf. This means that the summary will not show up as an NA and this can cause issues in analysis if you are not careful. library(tidyverse) df <- tibble::tribble( ~id, ~dt, 1L, "01/01/2001", 1L, NA, 2L, NA, 2L, NA ) %>% mutate(dt = dmy(dt)) z1 <- df %>% group_by(id) %>% summarise(dt_min = min(dt, na.

Rafa 21 Grand Slams and gganimate

I’ve been a Nadal fan for a long time – right back to the days of the pirate-pants so yeah, really a long time. In all this time, Rafa has never been ahead in the grand slam race vs his biggest rivals… but that finally changed after the 2022 Australian Open! The win there was unexpected and came out of nowhere. The final against Medvedev has to go down as one of the best comebacks ever.

Filtering with string statements in dplyr

A question came up recently at work about how to use a filter statement entered as a complete string variable inside dplyr’s filter() function – for example dplyr::filter(my_data, "var1 == 'a'"). There does not seem to be much out there on this and I was not sure how to do it either but luckily jakeybob had a neat solution that seems to work well. some_data %>% filter(eval(rlang::parse_expr(selection_statement))) Let’s see it in action using the iris flowers dataset.

Updating packages on a drat repo

This is just a small note (mainly for myself but hopefully may be of some use to a few others!) to remind of how to update a package on a drat repo. Create the source file for the package you want to host on the drat repo using devtools::build(). Clone the drat repo hosting the package (in my case https://github.com/alan-y/drat). Use drat::insertPackage("package-source.tar.gz", getwd()) to add the package to the drat repo (getwd() works for me if my working directory is at the top level of the drat repo).

Scotland's Most Popular Babynames

Downloading the data Shiny App I recently saw this great post on Nathan Yau’s FlowingData website which guesses a person’s name based on what the name starts with. It also needs you to select a gender and a decade for when you were born before it can guess. Of course, it isn’t really a guess and is really just based on proportions calculated after restricting the data to what has been selected.

Trying the ckanr Package

How resources are grouped in CKAN Initialising ckanr and exploring groups of resources Connect to CKAN with dplyr and download from one resource Downloading all resources from a dataset In previous blog posts (Hacking dbplyr for CKAN, Getting Open Data into R from CKAN) I have been exploring how to download data from the NHS Scotland open data platform into R. I’ve recently discovered that ROpenSci has a package to help with just this called ckanr and I wish I’d known about it earlier as it is really pretty handy!