`read_bson`

Reads one or more BSON files from the local filesystem or supported object store.

Syntax

-- Location of the file or files:
read_bson(<url>);
-- Explicitly set the number of documents considered to infer the
-- schema (default is 100):
read_bson(<url>, schema_sample_size => 250);

Like other functions that produces table objects in queries, read_bson works with all supported cloud providers for remote storage: GCS, S3, Azure, and compatible APIs.

-- Using a cloud credentials object.
read_bson(<url>, <credential_object>);
-- Required named argument for S3 buckets.
read_bson(<url>, <credentials_object>, region => '<aws_region>');
-- Pass S3 credentials using named arguments.
read_bson(<url>, access_key_id => '<aws_access_key_id>', secret_access_key => '<aws_secret_access_key>', region => '<aws_region>');
-- Pass GCS credentials using named arguments.
read_bson(<url>, service_account_key => '<gcp_service_account_key>');

Behavior

Multiple Files

read_bson will expand glob patterns in the <url> argument and will treat the resulting list of files as partitions of the same table.

Schema Inference

By default, read_bson, sorts the files lexicographically and scans the first 100 documents to infer the schema. Every field that appears is added to the schema in the order that it appears as a nullable field. The first type observed becomes the field’s type.

After inferring a schema, the remaining data and files may be read in any order.

The schema_sample_size option allows you to change the number of documents considered when inferring the schema.