R: filtering with NA values
NA - Not Available/Not applicable is R’s way of denoting empty or missing values. When doing comparisons - such as equal to, greater than, etc. - extra care and thought needs to go into how missing values (NAs) are handled. More explanations about this can be found in the Chapter 2: R basics of our book that is freely available at the HealthyR website
This post lists a couple of different ways of keeping or discarding rows based on how important the variables with missing values are to you.
For example, I want to keep rows that have a value for important_a
and/or important_b
(so rows 1, 3, 4).
I don’t care whether whatever_c
is empty or not, but I do want to keep it.
library(tidyverse)
## Warning: replacing previous import
## 'lifecycle::last_warnings' by 'rlang::last_warnings'
## when loading 'pillar'
## Warning: replacing previous import
## 'lifecycle::last_warnings' by 'rlang::last_warnings'
## when loading 'tibble'
## Warning: replacing previous import
## 'lifecycle::last_warnings' by 'rlang::last_warnings'
## when loading 'hms'
mydata = tibble(important_a = c("Value", NA, "Value", NA, NA),
important_b = c(NA, NA, "Value", "Value", NA),
whatever_c = c(NA, "Value", NA, NA, NA))
mydata %>% knitr::kable()
important_a | important_b | whatever_c |
---|---|---|
Value | NA | NA |
NA | NA | Value |
Value | Value | NA |
NA | Value | NA |
NA | NA | NA |
Functions for missing values that are very useful, but don’t do what I want are:
- This keeps complete cases based on all columns:
mydata %>%
drop_na()
## # A tibble: 0 × 3
## # … with 3 variables: important_a <chr>,
## # important_b <chr>, whatever_c <chr>
(Returns 0 as we don’t have rows where all 3 columns have a value).
- This keeps complete cases based on specified columns:
mydata %>%
drop_na(important_a, important_b)
## # A tibble: 1 × 3
## important_a important_b whatever_c
## <chr> <chr> <chr>
## 1 Value Value <NA>
This only keeps the row where both a and b have a value.
- This keeps rows that have a value in any column:
mydata %>%
filter_all(any_vars(! is.na(.)))
## # A tibble: 4 × 3
## important_a important_b whatever_c
## <chr> <chr> <chr>
## 1 Value <NA> <NA>
## 2 <NA> <NA> Value
## 3 Value Value <NA>
## 4 <NA> Value <NA>
The third example is better achieved using the janitor package:
mydata %>%
janitor::remove_empty()
## # A tibble: 4 × 3
## important_a important_b whatever_c
## <chr> <chr> <chr>
## 1 Value <NA> <NA>
## 2 <NA> <NA> Value
## 3 Value Value <NA>
## 4 <NA> Value <NA>
Now, (3) is pretty close, but still, I’m not interested in row 2 - where both a and b are empty but c has a value (which is why it’s kept).
- Simple solution
A quick solution is to use ! is.na()
for each variable inside a filter()
:
mydata %>%
filter(! is.na(important_a) | ! is.na(important_b))
## # A tibble: 3 × 3
## important_a important_b whatever_c
## <chr> <chr> <chr>
## 1 Value <NA> <NA>
## 2 Value Value <NA>
## 3 <NA> Value <NA>
And this is definitely what I do when I only have a couple of these variables. But if you have tens, then the filtering logic becomes horrendously long and it’s easy to miss one out/make a mistake.
- Powerful solution:
A scalable solution is to use filter_at()
with vars()
with a select helper (e.g., starts with()
), and then the any_vars(! is.na(.))
that was introduced in (3).
mydata %>%
filter_at(vars(starts_with("important_")), any_vars(! is.na(.)))
## # A tibble: 3 × 3
## important_a important_b whatever_c
## <chr> <chr> <chr>
## 1 Value <NA> <NA>
## 2 Value Value <NA>
## 3 <NA> Value <NA>