Filtering with string statements in dplyr

A question came up recently at work about how to use a filter statement entered as a complete string variable inside dplyr’s filter() function – for example dplyr::filter(my_data, "var1 == 'a'"). There does not seem to be much out there on this and I was not sure how to do it either but luckily jakeybob had a neat solution that seems to work well.

some_data %>% 
    filter(eval(rlang::parse_expr(selection_statement)))

Let’s see it in action using the iris flowers dataset. First note how many records there are for each species (n = 50 for each) so we can check that the filtering has worked later.

library(tidyverse)

iris2 <- as_tibble(iris)
count(iris2, Species)
# # A tibble: 3 x 2
#   Species        n
#   <fct>      <int>
# 1 setosa        50
# 2 versicolor    50
# 3 virginica     50

Now filter to get only setosa records and we can see only 50 records so that’s worked.

selection_statement <- "Species == 'setosa'"

iris2 %>% 
    filter(rlang::eval_tidy(rlang::parse_expr(selection_statement)))
# # A tibble: 50 x 5
#    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#  1          5.1         3.5          1.4         0.2 setosa 
#  2          4.9         3            1.4         0.2 setosa 
#  3          4.7         3.2          1.3         0.2 setosa 
#  4          4.6         3.1          1.5         0.2 setosa 
#  5          5           3.6          1.4         0.2 setosa

I thought this method might fail if we create a variable called Species in the global environment but it still works completely fine which is great!

Species <- "abc"

iris2 %>% 
    filter(rlang::eval_tidy(rlang::parse_expr(selection_statement)))
# # A tibble: 50 x 5
#    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#  1          5.1         3.5          1.4         0.2 setosa 
#  2          4.9         3            1.4         0.2 setosa 
#  3          4.7         3.2          1.3         0.2 setosa 
#  4          4.6         3.1          1.5         0.2 setosa 
#  5          5           3.6          1.4         0.2 setosa

So it makes me wonder why there is nothing much out there on this? My feeling is that something will make this method fail but what is it? Where does it fail? Let me know in the comments if you know please, thanks!

Avatar
Alan Yeung
Research Fellow and Healthcare Scientist

Applied statistician, currently working mainly on blood borne viruses and drugs. Supporter of all things R.