Summarising Dates with Missing Values

This blog post is just a note that when you try to do a grouped summary of a date variable but some groups have all missing values, it will return Inf. This means that the summary will not show up as an NA and this can cause issues in analysis if you are not careful.

library(tidyverse)

df <- tibble::tribble(
    ~id,          ~dt,
    1L, "01/01/2001",
    1L,           NA,
    2L,           NA,
    2L,           NA
) %>% 
    mutate(dt = dmy(dt))

z1 <- df %>% 
    group_by(id) %>% 
    summarise(dt_min = min(dt, na.rm = TRUE),
              .groups = "drop")

z1
# A tibble: 2 × 2
#      id dt_min    
#   <int> <date>    
# 1     1 2001-01-01
# 2     2 Inf

sum(is.na(z1$dt_min))
# [1] 0

There are a couple of ways around this. Firstly you can use an if() statement.

z2 <- df %>% 
    group_by(id) %>% 
    summarise(dt_min = if (all(is.na(dt))) NA_Date_ else min(dt, na.rm = TRUE),
              .groups = "drop")

z2
# A tibble: 2 × 2
#      id dt_min    
#   <int> <date>    
# 1     1 2001-01-01
# 2     2 NA  
sum(is.na(z2$dt_min))
# [1] 1

Or you can summary functions from the hablar package.

z3 <- df %>% 
    group_by(id) %>% 
    summarise(dt_min = hablar::min_(dt),
              .groups = "drop")

z3
# A tibble: 2 × 2
#      id dt_min    
#   <int> <date>    
# 1     1 2001-01-01
# 2     2 NA  
sum(is.na(z3$dt_min))
# [1] 1

Is there a reason why R decides to return Inf when summarising dates? Are there any other solutions to summarising date variables that contain missing values? Leave me a comment if you know thanks.

Avatar
Alan Yeung
Research Fellow and Healthcare Scientist

Applied statistician, currently working mainly on blood borne viruses and drugs. Supporter of all things R.

Related