R Bindings
GlareDB can be used in R environments along with Pandas and Polars to work with distributed data thanks to eitsupi’s great contribution.
Install
The glaredb
package can be installed from R-multiverse.
Currently, Windows is not supported. Please use WSL2.
To install the module:
Sys.setenv(NOT_CRAN = "true")
install.packages("glaredb", repos = "https://community.r-multiverse.org")
Usage
Embedded Connection
Once installed, GlareDB can be used inside your R code by importing the glaredb
package, and creating a connection using glaredb_connect
.
library(glaredb)
con <- glaredb_connect()
# Select from a remote Parquet file hosted on GitHub
glaredb_sql(
"SELECT * FROM 'https://github.com/GlareDB/glaredb/raw/main/testdata/parquet/userdata1.parquet'",
con
) |>
as_glaredb_table()
Converting to a dataframe
In the previous statement, the results are output to a readable GlareDB table. It’s also possible to output your table to an R or Polars dataframe. You will need to install the Polars R bindings.
library(glaredb)
library(polars)
con <- glaredb_connect()
# Output as an R dataframe
r_dataframe <- glaredb_sql(
"SELECT * FROM 'https://github.com/GlareDB/glaredb/raw/main/testdata/parquet/userdata1.parquet'",
con
) |>
as.data.frame()
# Output as a Polars dataframe
polars_dataframe <- glaredb_sql(
"SELECT * FROM 'https://github.com/GlareDB/glaredb/raw/main/testdata/parquet/userdata1.parquet'",
con
) |>
as_polars_df()
Locally Persisted Data
By default, connect
will use an in-memory database. To persist data locally, a path may be provided to connect
. Once you’ve created your connection object, be sure to pass it into the call to glaredb_sql
.
library(glaredb)
con <- glaredb_connect("./my_db_path")
glaredb_execute("CREATE TABLE my_table AS SELECT 1", con)
# After closing your session, you can re-open a connection to the same directory
con <- glaredb_connect("./my_db_path")
glaredb_sql("SELECT * FROM my_table", con) |>
as.data.frame()
Cloud Connection
Provide a connection string to connect to GlareDB Cloud.
library(glaredb)
con <- glaredb_connect("glaredb://<user>:<password>@<org>.remote.glaredb.com:6443/<deployment-name>")
glaredb_sql("SELECT * FROM previously_created_cloud_table", con) |>
as.data.frame()
In addition to a shared workspace with data that is accessible to multiple users, connecting to GlareDB Cloud with R enables Hybrid execution. Queries are optimized to use both cloud and local compute resources.
Querying Data
Querying Files
GlareDB can query a variety of file types as tables directly.
Download sample data for the following examples here:
You can also query the raw files hosted on GitHub directly as shown above.
library(glaredb)
con <- glaredb_connect()
# Query CSV files.
glaredb_sql("SELECT * FROM './userdata1.csv'", con) |>
as.data.frame()
# Query JSON files.
glaredb_sql("SELECT * FROM './userdata1.json'", con) |>
as.data.frame()
# Query Parquet files.
glaredb_sql("SELECT * FROM './userdata1.parquet'", con) |>
as.data.frame()
# Absolute file paths work also
glaredb_sql("SELECT * FROM 'Users/my_user/Downloads/userdata1.json'", con) |>
as.data.frame()
Querying In-Memory Data
Query Polars and Arrow dataframes:
library(glaredb)
# Query a Polars dataframe
library(polars)
polars_df <- pl$DataFrame(
A = 1:5,
fruits = c("banana", "banana", "apple", "apple", "banana"),
B = 5:1,
C = c("beetle", "audi", "beetle", "beetle", "beetle")
)
result <- glaredb_sql("SELECT * FROM polars_df where fruits = 'banana'") |>
as.data.frame()
result
# Query an Arrow table
library(arrow)
arrow_df <- arrow_table(
A = 1:5,
fruits = c("banana", "banana", "apple", "apple", "banana"),
B = 5:1,
C = c("beetle", "audi", "beetle", "beetle", "beetle")
)
result <- glaredb_sql("SELECT * FROM arrow_df where fruits = 'banana'") |>
as.data.frame()
As with files, GlareDB can access these dataframes directly once they are variables in the scope of your R environment.
GlareDB cannot select from R dataframe, since they are not based on the Apache Arrow format. If you would like to select from a native R dataframe directly with GlareDB, you must first coerce it to Polars or Arrow.
Lazy evaluation
The glaredb_sql
method returns the logical plan of the query passed in as an argument. This means that the query is not executed until one of the dataframe conversion methods, like as_glaredb_table()
, as.data.frame()
, or as_polars_dataframe()
are called.
This can be used to incrementally build up sql queries by referencing previously defined logical plans created with glaredb_sql
.
library(glaredb)
con <- glaredb_connect()
# Note: This uses a raw string in R (available in R >= 4.0.0) to make
# it possible to write the SQL statement on multiple lines.
intermediate <- glaredb_sql(R"(
SELECT
*
FROM
'https://github.com/GlareDB/glaredb/raw/main/testdata/parquet/userdata1.parquet'
)",
con
)
result <- glaredb_sql(R"(
SELECT
first_name,
last_name,
birthdate
FROM
intermediate
WHERE
country = 'Canada'
LIMIT 5
)",
con
) |> as.data.frame()
result
first_name last_name birthdate
1 Albert Freeman 1/16/1968
2 Deborah Armstrong 4/8/1969
3 Gloria Hamilton 3/9/1988
4 Aaron Torres 4/18/1980
5 Peter Russell 9/26/1976
Eager evalutation
Eagerly evaulating a query can be done via the glaredb_execute
method. This will execute the provided query to completion. This is useful when the result of a query doesn’t matter (e.g. when creating a table, inserting data into a table, or other DDL operations).
library(glaredb)
con <- glaredb_connect()
# Create a table.
glaredb_execute("CREATE TABLE my_table (a INT)", con)
# Insert some data.
glaredb_execute("INSERT INTO my_table VALUES (1), (2)", con)
glaredb_sql("SELECT * FROM my_table", con) |>
as.data.frame()
a
1 1
2 2
Example
The following is an example that joins a polars data frame with data from a hosted demo Postgres instance.
library(glaredb)
library(polars)
# Create Polars DataFrame
polars_df <- pl$DataFrame(
region = c(0, 1, 2, 3, 4),
population = c(10, 20, 30, 40, 50)
)
con <- glaredb_connect()
# Join the above Polars DataFrame on data from our demo Postgres instance
result <- glaredb_sql(R"(
SELECT
t1.r_regionkey,
t1.r_name,
t2.Population
FROM
read_postgres('postgres://demo:demo@pg.demo.glaredb.com/postgres', 'public', 'region') AS t1
JOIN
polars_df AS t2
ON t1.r_regionkey = t2.region)", con
) |> as_polars_df()
result
shape: (5, 3)
┌─────────────┬───────────────────────────┬────────────┐
│ r_regionkey ┆ r_name ┆ population │
│ --- ┆ --- ┆ --- │
│ i32 ┆ str ┆ f64 │
╞═════════════╪═══════════════════════════╪════════════╡
│ 0 ┆ AFRICA ┆ 10.0 │
│ 2 ┆ ASIA ┆ 30.0 │
│ 1 ┆ AMERICA ┆ 20.0 │
│ 4 ┆ MIDDLE EAST ┆ 50.0 │
│ 3 ┆ EUROPE ┆ 40.0 │
└─────────────┴───────────────────────────┴────────────┘