mdbplyr: a tidy interface to MongoDB

post

database

MongoDB

packages

milestone

Author

Paolo Bosetti

Published

April 16, 2026

Abstract

mdbplyr is a new R package that provides a tidy interface to MongoDB, a popular NoSQL database. With mdbplyr, you can easily connect to your MongoDB database, query data using a tidy syntax, and manipulate the results using tidyverse tools. In this post, we will explore the features of mdbplyr and how it can simplify your workflow when working with MongoDB in R.

Introduction

We are developing mdbplyr, an R package that provides a disciplined, lazy dplyr-style interface for MongoDB aggregation pipelines.

mdbplyr sits between raw mongolite usage and broader compatibility layers. Compared with writing aggregation JSON by hand, it lets you express supported queries with tidy verbs and inspect the generated pipeline before execution. Compared with approaches that try to hide MongoDB behind table-like semantics, it stays explicit about scope, keeps translation conservative, and fails clearly when a verb or expression is outside the supported subset.

The practical advantages are:

lazy query construction until collect(),
direct translation to MongoDB aggregation stages,
show_query() for inspectable pipeline output,
explicit unsupported-feature errors instead of silent fallback,
a smaller, testable semantic surface aligned with MongoDB rather than SQL.

Important

This post is a quick introduction to mdbplyr with examples. The package is still in early development, so expect some rough edges and missing features. The API is not yet stable, and the documentation is still a work in progress. If you’re interested in trying it out or contributing, check out the GitHub repository.

Usage examples

Installation

Since April 22, 20226, mdbplyr is available on CRAN. You can install it with:

install.packages("mdbplyr")

Since mdbplyr is still in early development, you can alternatively install the latest development version from GitHub using the devtools package:

install.packages("devtools")
devtools::install_github("pbosetti/mdbplyr")

Loading data

Suppose there is a running MongoDB instance on localhost with default port and no authentication. The code below loads dplyr::starwars into a collection named starwars.

library(dplyr)
library(mongolite)
library(mdbplyr)

starwars_collection <- mongolite::mongo(
  collection = "starwars",
  db = "mdbplyr"
)

starwars_collection$drop()
starwars_collection$insert(dplyr::starwars)

List of 5
 $ nInserted  : num 87
 $ nMatched   : num 0
 $ nRemoved   : num 0
 $ nUpserted  : num 0
 $ writeErrors: list()

starwars_tbl <- tbl_mongo(
  starwars_collection,
  schema = names(dplyr::starwars)
)

Once the collection is loaded, starwars_tbl is the lazy table used in the examples below.

library(dplyr)
library(mdbplyr)

starwars_collection <- mongolite::mongo(
  collection = "starwars",
  db = "mdbplyr"
)

starwars_tbl <- tbl_mongo(starwars_collection) %>% 
  infer_schema()

Basic inspection

Inspect the known schema and the generated pipeline without executing the query.

schema_fields(starwars_tbl)

 [1] "_id"        "name"       "height"     "mass"       "hair_color"
 [6] "skin_color" "eye_color"  "birth_year" "sex"        "gender"    
[11] "homeworld"  "species"    "films"      "vehicles"   "starships"

starwars_tbl |>
  filter(species == "Human", height > 180) |>
  select(name, height, mass) |>
  show_query()

[
  {
    "$match": {
      "$expr": {
        "$and": {
          "1": {
            "$eq": [
              "$species",
              "Human"
            ]
          },
          "2": {
            "$gt": [
              "$height",
              180
            ]
          }
        }
      }
    }
  },
  {
    "$project": {
      "name": 1,
      "height": 1,
      "mass": 1,
      "_id": 0
    }
  }
]

Supported verbs

Each subsection below shows one of the supported dplyr-like verbs on the starwars collection.

`filter()`

starwars_tbl |>
  filter(species == "Droid", height > 100) |>
  collect()

# A tibble: 87 × 15
   `_id`     name  height  mass hair_color skin_color eye_color birth_year sex  
   <chr>     <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 6a1ffb87… Luke…    172    77 blond      fair       blue            19   male 
 2 6a1ffb87… C-3PO    167    75 <NA>       gold       yellow         112   none 
 3 6a1ffb87… R2-D2     96    32 <NA>       white, bl… red             33   none 
 4 6a1ffb87… Dart…    202   136 none       white      yellow          41.9 male 
 5 6a1ffb87… Leia…    150    49 brown      light      brown           19   fema…
 6 6a1ffb87… Owen…    178   120 brown, gr… light      blue            52   male 
 7 6a1ffb87… Beru…    165    75 brown      light      blue            47   fema…
 8 6a1ffb87… R5-D4     97    32 <NA>       white, red red             NA   none 
 9 6a1ffb87… Bigg…    183    84 black      light      brown           24   male 
10 6a1ffb87… Obi-…    182    77 auburn, w… fair       blue-gray       57   male 
# ℹ 77 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

`select()`

starwars_tbl |>
  select(name, species, homeworld) |>
  collect()

# A tibble: 87 × 3
   name               homeworld species
   <chr>              <chr>     <chr>  
 1 Luke Skywalker     Tatooine  Human  
 2 C-3PO              Tatooine  Droid  
 3 R2-D2              Naboo     Droid  
 4 Darth Vader        Tatooine  Human  
 5 Leia Organa        Alderaan  Human  
 6 Owen Lars          Tatooine  Human  
 7 Beru Whitesun Lars Tatooine  Human  
 8 R5-D4              Tatooine  Droid  
 9 Biggs Darklighter  Tatooine  Human  
10 Obi-Wan Kenobi     Stewjon   Human  
# ℹ 77 more rows

`rename()`

starwars_tbl |>
  rename(character_name = name, planet = homeworld) |>
  collect()

# A tibble: 87 × 15
   `_id`  character_name height  mass hair_color skin_color eye_color birth_year
   <chr>  <chr>           <int> <dbl> <chr>      <chr>      <chr>          <dbl>
 1 6a1ff… Luke Skywalker    172    77 blond      fair       blue            19  
 2 6a1ff… C-3PO             167    75 <NA>       gold       yellow         112  
 3 6a1ff… R2-D2              96    32 <NA>       white, bl… red             33  
 4 6a1ff… Darth Vader       202   136 none       white      yellow          41.9
 5 6a1ff… Leia Organa       150    49 brown      light      brown           19  
 6 6a1ff… Owen Lars         178   120 brown, gr… light      blue            52  
 7 6a1ff… Beru Whitesun…    165    75 brown      light      blue            47  
 8 6a1ff… R5-D4              97    32 <NA>       white, red red             NA  
 9 6a1ff… Biggs Darklig…    183    84 black      light      brown           24  
10 6a1ff… Obi-Wan Kenobi    182    77 auburn, w… fair       blue-gray       57  
# ℹ 77 more rows
# ℹ 7 more variables: sex <chr>, gender <chr>, planet <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

`mutate()`

starwars_tbl |>
  mutate(height_m = height / 100) |>
  mutate(bmi_like = mass / height_m^2 ) |> # can't be in the same mutate()!
  select(name, height, mass, height_m, bmi_like) |>
  collect()

# A tibble: 87 × 5
   name               height  mass height_m bmi_like
   <chr>               <int> <dbl>    <dbl>    <dbl>
 1 Luke Skywalker        172    77     1.72     26.0
 2 C-3PO                 167    75     1.67     26.9
 3 R2-D2                  96    32     0.96     34.7
 4 Darth Vader           202   136     2.02     33.3
 5 Leia Organa           150    49     1.5      21.8
 6 Owen Lars             178   120     1.78     37.9
 7 Beru Whitesun Lars    165    75     1.65     27.5
 8 R5-D4                  97    32     0.97     34.0
 9 Biggs Darklighter     183    84     1.83     25.1
10 Obi-Wan Kenobi        182    77     1.82     23.2
# ℹ 77 more rows

Important

Currently, mdbplyr implements mapping between MongoDB operators and:

Arithmetic/predicate: %%, ^, %in%
Logs/math: log(x, base), log10(), floor(), ceiling(), trunc()
Trig/hyperbolic: sin(), cos(), tan(), asin(), acos(), atan(), atan2(), sinh(), cosh(), tanh(), asinh(), acosh(), atanh()
Scalar min/max: pmin(), pmax()
String ops: tolower(), toupper(), nchar(), paste(), paste0(), substr(), substring()
Literal vector support for expressions like x %in% c(1, 3, 5)

Note though, that within the same mutate call you cannot use a newly created field in another expression, as the translation is currently one-to-one with MongoDB’s $addFields stage. So you need to split the BMI-like calculation into two mutate() calls, as in te above example.

`transmute()`

starwars_tbl |>
  transmute(name = name, height_m = height / 100) |>
  collect()

# A tibble: 87 × 2
   name               height_m
   <chr>                 <dbl>
 1 Luke Skywalker         1.72
 2 C-3PO                  1.67
 3 R2-D2                  0.96
 4 Darth Vader            2.02
 5 Leia Organa            1.5 
 6 Owen Lars              1.78
 7 Beru Whitesun Lars     1.65
 8 R5-D4                  0.97
 9 Biggs Darklighter      1.83
10 Obi-Wan Kenobi         1.82
# ℹ 77 more rows

`arrange()`

starwars_tbl |>
  arrange(desc(height), name) |>
  select(name, height) |>
  slice_head(n = 10) |>
  collect()

# A tibble: 10 × 2
   name         height
   <chr>         <int>
 1 Yarael Poof     264
 2 Tarfful         234
 3 Lama Su         229
 4 Chewbacca       228
 5 Roos Tarpals    224
 6 Grievous        216
 7 Taun We         213
 8 Rugor Nass      206
 9 Tion Medon      206
10 Darth Vader     202

`group_by()`

starwars_tbl |>
  group_by(species)

<tbl_mongo> collection 
  Filters: 0 
  Projection: <all> 
  Computed: <none> 
  Groups: species 
  Summaries: <none> 
  Slice: <none> 
  Manual stages: 0

`summarise()`

starwars_tbl |>
  group_by(species) |>
  summarise(
    n = n(),
    avg_height = mean(height),
    max_mass = max(mass)
  ) |>
  arrange(desc(n)) |>
  collect()

# A tibble: 38 × 4
   species      n avg_height max_mass
   <chr>    <int>      <dbl>    <dbl>
 1 Human       35       178     136  
 2 Droid        6       131.    140  
 3 <NA>         4       175     110  
 4 Gungan       3       209.     82  
 5 Zabrak       2       173      80  
 6 Twi'lek      2       179      55  
 7 Mirialan     2       168      56.2
 8 Wookiee      2       231     136  
 9 Kaminoan     2       221      88  
10 Cerean       1       198      82  
# ℹ 28 more rows

`slice_head()`

starwars_tbl |>
  select(name, species) |>
  slice_head(n = 5) |>
  collect()

# A tibble: 5 × 2
  name           species
  <chr>          <chr>  
1 Luke Skywalker Human  
2 C-3PO          Droid  
3 R2-D2          Droid  
4 Darth Vader    Human  
5 Leia Organa    Human

`head()`

head(starwars_tbl, 5) |>
  collect()

# A tibble: 5 × 15
  `_id`      name  height  mass hair_color skin_color eye_color birth_year sex  
  <chr>      <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
1 6a1ffb87d… Luke…    172    77 blond      fair       blue            19   male 
2 6a1ffb87d… C-3PO    167    75 <NA>       gold       yellow         112   none 
3 6a1ffb87d… R2-D2     96    32 <NA>       white, bl… red             33   none 
4 6a1ffb87d… Dart…    202   136 none       white      yellow          41.9 male 
5 6a1ffb87d… Leia…    150    49 brown      light      brown           19   fema…
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

Notes

The examples above stay within the currently supported subset:

explicit field references,
scalar expressions for mutate() and transmute(),
conservative grouped summaries,
no joins, window functions, or automatic client-side fallback.

If a query falls outside that subset, mdbplyr is designed to fail explicitly rather than guess or silently change execution semantics.

That’s all, folks!