R Bindings

GlareDB can be used in R environments along with Pandas and Polars to work with distributed data thanks to eitsupi’s great contribution.

Install

The glaredb package can be installed from R-multiverse.

Currently, Windows is not supported. Please use WSL2.

To install the module:

Sys.setenv(NOT_CRAN = "true")
install.packages("glaredb", repos = "https://community.r-multiverse.org")

Usage

Embedded Connection

Once installed, GlareDB can be used inside your R code by importing the glaredb package, and creating a connection using glaredb_connect.

library(glaredb)

con <- glaredb_connect()

# Select from a remote Parquet file hosted on GitHub
glaredb_sql(
    "SELECT * FROM 'https://github.com/GlareDB/glaredb/raw/main/testdata/parquet/userdata1.parquet'",
    con
  ) |>
  as_glaredb_table()

Converting to a dataframe

In the previous statement, the results are output to a readable GlareDB table. It’s also possible to output your table to an R or Polars dataframe. You will need to install the Polars R bindings.

library(glaredb)
library(polars)

con <- glaredb_connect()

# Output as an R dataframe
r_dataframe <- glaredb_sql(
    "SELECT * FROM 'https://github.com/GlareDB/glaredb/raw/main/testdata/parquet/userdata1.parquet'",
    con
) |>
  as.data.frame()

# Output as a Polars dataframe
polars_dataframe <- glaredb_sql(
    "SELECT * FROM 'https://github.com/GlareDB/glaredb/raw/main/testdata/parquet/userdata1.parquet'",
    con
  ) |>
  as_polars_df()

Locally Persisted Data

By default, connect will use an in-memory database. To persist data locally, a path may be provided to connect. Once you’ve created your connection object, be sure to pass it into the call to glaredb_sql.

library(glaredb)

con <- glaredb_connect("./my_db_path")
glaredb_execute("CREATE TABLE my_table AS SELECT 1", con)

# After closing your session, you can re-open a connection to the same directory
con <- glaredb_connect("./my_db_path")
glaredb_sql("SELECT * FROM my_table", con) |>
  as.data.frame()

Cloud Connection

Provide a connection string to connect to GlareDB Cloud.

library(glaredb)

con <- glaredb_connect("glaredb://<user>:<password>@<org>.remote.glaredb.com:6443/<deployment-name>")
glaredb_sql("SELECT * FROM previously_created_cloud_table", con) |>
  as.data.frame()

In addition to a shared workspace with data that is accessible to multiple users, connecting to GlareDB Cloud with R enables Hybrid execution. Queries are optimized to use both cloud and local compute resources.

Querying Data

Querying Files

GlareDB can query a variety of file types as tables directly.

Download sample data for the following examples here:

You can also query the raw files hosted on GitHub directly as shown above.

library(glaredb)

con <- glaredb_connect()

# Query CSV files.
glaredb_sql("SELECT * FROM './userdata1.csv'", con) |>
  as.data.frame()

# Query JSON files.
glaredb_sql("SELECT * FROM './userdata1.json'", con) |>
  as.data.frame()

# Query Parquet files.
glaredb_sql("SELECT * FROM './userdata1.parquet'", con) |>
  as.data.frame()

# Absolute file paths work also
glaredb_sql("SELECT * FROM 'Users/my_user/Downloads/userdata1.json'", con) |>
  as.data.frame()

Querying In-Memory Data

Query Polars and Arrow dataframes:

library(glaredb)

# Query a Polars dataframe
library(polars)

polars_df <- pl$DataFrame(
  A = 1:5,
  fruits = c("banana", "banana", "apple", "apple", "banana"),
  B = 5:1,
  C = c("beetle", "audi", "beetle", "beetle", "beetle")
)

result <- glaredb_sql("SELECT * FROM polars_df where fruits = 'banana'") |>
  as.data.frame()

result

# Query an Arrow table
library(arrow)

arrow_df <- arrow_table(
  A = 1:5,
  fruits = c("banana", "banana", "apple", "apple", "banana"),
  B = 5:1,
  C = c("beetle", "audi", "beetle", "beetle", "beetle")
)

result <- glaredb_sql("SELECT * FROM arrow_df where fruits = 'banana'") |>
  as.data.frame()

As with files, GlareDB can access these dataframes directly once they are variables in the scope of your R environment.

GlareDB cannot select from R dataframe, since they are not based on the Apache Arrow format. If you would like to select from a native R dataframe directly with GlareDB, you must first coerce it to Polars or Arrow.

Lazy evaluation

The glaredb_sql method returns the logical plan of the query passed in as an argument. This means that the query is not executed until one of the dataframe conversion methods, like as_glaredb_table(), as.data.frame(), or as_polars_dataframe() are called.

This can be used to incrementally build up sql queries by referencing previously defined logical plans created with glaredb_sql.

library(glaredb)

con <- glaredb_connect()

# Note: This uses a raw string in R (available in R >= 4.0.0) to make
# it possible to write the SQL statement on multiple lines.
intermediate <- glaredb_sql(R"(
    SELECT
        *
    FROM
        'https://github.com/GlareDB/glaredb/raw/main/testdata/parquet/userdata1.parquet'
    )",
    con
)

result <- glaredb_sql(R"(
    SELECT
        first_name,
        last_name,
        birthdate
    FROM
        intermediate
    WHERE
        country = 'Canada'
    LIMIT 5
    )",
    con
) |> as.data.frame()

result
  first_name last_name birthdate
1     Albert   Freeman 1/16/1968
2    Deborah Armstrong  4/8/1969
3     Gloria  Hamilton  3/9/1988
4      Aaron    Torres 4/18/1980
5      Peter   Russell 9/26/1976

Eager evalutation

Eagerly evaulating a query can be done via the glaredb_execute method. This will execute the provided query to completion. This is useful when the result of a query doesn’t matter (e.g. when creating a table, inserting data into a table, or other DDL operations).

library(glaredb)

con <- glaredb_connect()

# Create a table.
glaredb_execute("CREATE TABLE my_table (a INT)", con)

# Insert some data.
glaredb_execute("INSERT INTO my_table VALUES (1), (2)", con)

glaredb_sql("SELECT * FROM my_table", con) |>
  as.data.frame()
  a
1 1
2 2

Example

The following is an example that joins a polars data frame with data from a hosted demo Postgres instance.

library(glaredb)
library(polars)

# Create Polars DataFrame
polars_df <- pl$DataFrame(
  region = c(0, 1, 2, 3, 4),
  population = c(10, 20, 30, 40, 50)
)

con <- glaredb_connect()

# Join the above Polars DataFrame on data from our demo Postgres instance

result <- glaredb_sql(R"(
    SELECT
        t1.r_regionkey,
        t1.r_name,
        t2.Population
    FROM
        read_postgres('postgres://demo:demo@pg.demo.glaredb.com/postgres', 'public', 'region') AS t1
    JOIN
        polars_df AS t2
    ON t1.r_regionkey = t2.region)", con
  ) |> as_polars_df()

result
shape: (5, 3)
┌─────────────┬───────────────────────────┬────────────┐
│ r_regionkey ┆ r_name                    ┆ population │
│ ---         ┆ ---                       ┆ ---        │
│ i32         ┆ str                       ┆ f64        │
╞═════════════╪═══════════════════════════╪════════════╡
│ 0           ┆ AFRICA                    ┆ 10.0       │
│ 2           ┆ ASIA                      ┆ 30.0       │
│ 1           ┆ AMERICA                   ┆ 20.0       │
│ 4           ┆ MIDDLE EAST               ┆ 50.0       │
│ 3           ┆ EUROPE                    ┆ 40.0       │
└─────────────┴───────────────────────────┴────────────┘