install.packages("mdbplyr")mdbplyr: a tidy interface to MongoDB
mdbplyr is a new R package that provides a tidy interface to MongoDB, a popular NoSQL database. With mdbplyr, you can easily connect to your MongoDB database, query data using a tidy syntax, and manipulate the results using tidyverse tools. In this post, we will explore the features of mdbplyr and how it can simplify your workflow when working with MongoDB in R.
Introduction
We are developing mdbplyr, an R package that provides a disciplined, lazy dplyr-style interface for MongoDB aggregation pipelines.

mdbplyr sits between raw mongolite usage and broader compatibility layers. Compared with writing aggregation JSON by hand, it lets you express supported queries with tidy verbs and inspect the generated pipeline before execution. Compared with approaches that try to hide MongoDB behind table-like semantics, it stays explicit about scope, keeps translation conservative, and fails clearly when a verb or expression is outside the supported subset.
The practical advantages are:
- lazy query construction until
collect(), - direct translation to MongoDB aggregation stages,
show_query()for inspectable pipeline output,- explicit unsupported-feature errors instead of silent fallback,
- a smaller, testable semantic surface aligned with MongoDB rather than SQL.
This post is a quick introduction to mdbplyr with examples. The package is still in early development, so expect some rough edges and missing features. The API is not yet stable, and the documentation is still a work in progress. If you’re interested in trying it out or contributing, check out the GitHub repository.
Usage examples
Installation
Since April 22, 20226, mdbplyr is available on CRAN. You can install it with:
Since mdbplyr is still in early development, you can alternatively install the latest development version from GitHub using the devtools package:
install.packages("devtools")
devtools::install_github("pbosetti/mdbplyr")Loading data
Suppose there is a running MongoDB instance on localhost with default port and no authentication. The code below loads dplyr::starwars into a collection named starwars.
library(dplyr)
library(mongolite)
library(mdbplyr)
starwars_collection <- mongolite::mongo(
collection = "starwars",
db = "mdbplyr"
)
starwars_collection$drop()
starwars_collection$insert(dplyr::starwars)List of 5
$ nInserted : num 87
$ nMatched : num 0
$ nRemoved : num 0
$ nUpserted : num 0
$ writeErrors: list()
starwars_tbl <- tbl_mongo(
starwars_collection,
schema = names(dplyr::starwars)
)Once the collection is loaded, starwars_tbl is the lazy table used in the examples below.
library(dplyr)
library(mdbplyr)
starwars_collection <- mongolite::mongo(
collection = "starwars",
db = "mdbplyr"
)
starwars_tbl <- tbl_mongo(starwars_collection) %>%
infer_schema()Basic inspection
Inspect the known schema and the generated pipeline without executing the query.
schema_fields(starwars_tbl) [1] "_id" "name" "height" "mass" "hair_color"
[6] "skin_color" "eye_color" "birth_year" "sex" "gender"
[11] "homeworld" "species" "films" "vehicles" "starships"
starwars_tbl |>
filter(species == "Human", height > 180) |>
select(name, height, mass) |>
show_query()[
{
"$match": {
"$expr": {
"$and": {
"1": {
"$eq": [
"$species",
"Human"
]
},
"2": {
"$gt": [
"$height",
180
]
}
}
}
}
},
{
"$project": {
"name": 1,
"height": 1,
"mass": 1,
"_id": 0
}
}
]
Supported verbs
Each subsection below shows one of the supported dplyr-like verbs on the starwars collection.
filter()
starwars_tbl |>
filter(species == "Droid", height > 100) |>
collect()# A tibble: 87 × 15
`_id` name height mass hair_color skin_color eye_color birth_year sex
<chr> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 69e8ee44… Luke… 172 77 blond fair blue 19 male
2 69e8ee44… C-3PO 167 75 <NA> gold yellow 112 none
3 69e8ee44… R2-D2 96 32 <NA> white, bl… red 33 none
4 69e8ee44… Dart… 202 136 none white yellow 41.9 male
5 69e8ee44… Leia… 150 49 brown light brown 19 fema…
6 69e8ee44… Owen… 178 120 brown, gr… light blue 52 male
7 69e8ee44… Beru… 165 75 brown light blue 47 fema…
8 69e8ee44… R5-D4 97 32 <NA> white, red red NA none
9 69e8ee44… Bigg… 183 84 black light brown 24 male
10 69e8ee44… Obi-… 182 77 auburn, w… fair blue-gray 57 male
# ℹ 77 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
select()
starwars_tbl |>
select(name, species, homeworld) |>
collect()# A tibble: 87 × 3
name homeworld species
<chr> <chr> <chr>
1 Luke Skywalker Tatooine Human
2 C-3PO Tatooine Droid
3 R2-D2 Naboo Droid
4 Darth Vader Tatooine Human
5 Leia Organa Alderaan Human
6 Owen Lars Tatooine Human
7 Beru Whitesun Lars Tatooine Human
8 R5-D4 Tatooine Droid
9 Biggs Darklighter Tatooine Human
10 Obi-Wan Kenobi Stewjon Human
# ℹ 77 more rows
rename()
starwars_tbl |>
rename(character_name = name, planet = homeworld) |>
collect()# A tibble: 87 × 15
`_id` character_name height mass hair_color skin_color eye_color birth_year
<chr> <chr> <int> <dbl> <chr> <chr> <chr> <dbl>
1 69e8e… Luke Skywalker 172 77 blond fair blue 19
2 69e8e… C-3PO 167 75 <NA> gold yellow 112
3 69e8e… R2-D2 96 32 <NA> white, bl… red 33
4 69e8e… Darth Vader 202 136 none white yellow 41.9
5 69e8e… Leia Organa 150 49 brown light brown 19
6 69e8e… Owen Lars 178 120 brown, gr… light blue 52
7 69e8e… Beru Whitesun… 165 75 brown light blue 47
8 69e8e… R5-D4 97 32 <NA> white, red red NA
9 69e8e… Biggs Darklig… 183 84 black light brown 24
10 69e8e… Obi-Wan Kenobi 182 77 auburn, w… fair blue-gray 57
# ℹ 77 more rows
# ℹ 7 more variables: sex <chr>, gender <chr>, planet <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
mutate()
starwars_tbl |>
mutate(height_m = height / 100) |>
mutate(bmi_like = mass / height_m^2 ) |> # can't be in the same mutate()!
select(name, height, mass, height_m, bmi_like) |>
collect()# A tibble: 87 × 5
name height mass height_m bmi_like
<chr> <int> <dbl> <dbl> <dbl>
1 Luke Skywalker 172 77 1.72 26.0
2 C-3PO 167 75 1.67 26.9
3 R2-D2 96 32 0.96 34.7
4 Darth Vader 202 136 2.02 33.3
5 Leia Organa 150 49 1.5 21.8
6 Owen Lars 178 120 1.78 37.9
7 Beru Whitesun Lars 165 75 1.65 27.5
8 R5-D4 97 32 0.97 34.0
9 Biggs Darklighter 183 84 1.83 25.1
10 Obi-Wan Kenobi 182 77 1.82 23.2
# ℹ 77 more rows
Currently, mdbplyr implements mapping between MongoDB operators and:
- Arithmetic/predicate:
%%,^,%in% - Logs/math:
log(x, base),log10(),floor(),ceiling(),trunc() - Trig/hyperbolic:
sin(),cos(),tan(),asin(),acos(),atan(),atan2(),sinh(),cosh(),tanh(),asinh(),acosh(),atanh() - Scalar min/max:
pmin(),pmax() - String ops:
tolower(),toupper(),nchar(),paste(),paste0(),substr(),substring() - Literal vector support for expressions like
x %in% c(1, 3, 5)
Note though, that within the same mutate call you cannot use a newly created field in another expression, as the translation is currently one-to-one with MongoDB’s $addFields stage. So you need to split the BMI-like calculation into two mutate() calls, as in te above example.
transmute()
starwars_tbl |>
transmute(name = name, height_m = height / 100) |>
collect()# A tibble: 87 × 2
name height_m
<chr> <dbl>
1 Luke Skywalker 1.72
2 C-3PO 1.67
3 R2-D2 0.96
4 Darth Vader 2.02
5 Leia Organa 1.5
6 Owen Lars 1.78
7 Beru Whitesun Lars 1.65
8 R5-D4 0.97
9 Biggs Darklighter 1.83
10 Obi-Wan Kenobi 1.82
# ℹ 77 more rows
arrange()
starwars_tbl |>
arrange(desc(height), name) |>
select(name, height) |>
slice_head(n = 10) |>
collect()# A tibble: 10 × 2
name height
<chr> <int>
1 Yarael Poof 264
2 Tarfful 234
3 Lama Su 229
4 Chewbacca 228
5 Roos Tarpals 224
6 Grievous 216
7 Taun We 213
8 Rugor Nass 206
9 Tion Medon 206
10 Darth Vader 202
group_by()
starwars_tbl |>
group_by(species)<tbl_mongo> collection
Filters: 0
Projection: <all>
Computed: <none>
Groups: species
Summaries: <none>
Slice: <none>
Manual stages: 0
summarise()
starwars_tbl |>
group_by(species) |>
summarise(
n = n(),
avg_height = mean(height),
max_mass = max(mass)
) |>
arrange(desc(n)) |>
collect()# A tibble: 38 × 4
species n avg_height max_mass
<chr> <int> <dbl> <dbl>
1 Human 35 178 136
2 Droid 6 131. 140
3 <NA> 4 175 110
4 Gungan 3 209. 82
5 Twi'lek 2 179 55
6 Kaminoan 2 221 88
7 Wookiee 2 231 136
8 Zabrak 2 173 80
9 Mirialan 2 168 56.2
10 Muun 1 191 NA
# ℹ 28 more rows
slice_head()
starwars_tbl |>
select(name, species) |>
slice_head(n = 5) |>
collect()# A tibble: 5 × 2
name species
<chr> <chr>
1 Luke Skywalker Human
2 C-3PO Droid
3 R2-D2 Droid
4 Darth Vader Human
5 Leia Organa Human
head()
head(starwars_tbl, 5) |>
collect()# A tibble: 5 × 15
`_id` name height mass hair_color skin_color eye_color birth_year sex
<chr> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 69e8ee442… Luke… 172 77 blond fair blue 19 male
2 69e8ee442… C-3PO 167 75 <NA> gold yellow 112 none
3 69e8ee442… R2-D2 96 32 <NA> white, bl… red 33 none
4 69e8ee442… Dart… 202 136 none white yellow 41.9 male
5 69e8ee442… Leia… 150 49 brown light brown 19 fema…
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
Notes
The examples above stay within the currently supported subset:
- explicit field references,
- scalar expressions for
mutate()andtransmute(), - conservative grouped summaries,
- no joins, window functions, or automatic client-side fallback.
If a query falls outside that subset, mdbplyr is designed to fail explicitly rather than guess or silently change execution semantics.
That’s all, folks!