Skip to main content

Data Plane

The data plane is the execution path from source ingestion to trusted Silver outputs.

Shared pipeline model​

  • Contracts define expected structure and behavior.
  • Landing handles raw ingress.
  • Bronze preserves operational history.
  • Silver applies casting, validation, and contract-driven silver write strategies.
  • When a Silver write strategy needs deduplication before write, it happens after lineage-based renaming so the Silver contract primary keys are evaluated on Silver column names.
  • Gold exposes the built OLAP models.

::: callout note Authoring starts in the contract hub at datacontracts.rockdata.nl. Confirm the source model, schema intent, and validation expectations there before changing ingestion, Bronze, or Silver logic. :::

Where to work first​

  • The data contract hub defines the source model and expected behavior.
  • The example implementation contains the contract files, notebooks, jobs, and test data your team will update.
  • Platform-specific notebooks and orchestration should stay in the example implementation unless a true framework change is required.

Fabric notebooks can split execution concerns across two paths. Spark reads and writes still use the notebook's default lakehouse, while file-based framework assets should be resolved from an explicit Carabine files root. In the example implementation this runtime root is carried through the carabine_files_root variable, and contract loading stays explicit through data_contract_location.

Deletions handling​

Soft and Hard deletions (explicit and inferred) are not a cosmetic feature. They change record survivorship and downstream truth.

  • Contract metadata determines whether deletions are enabled for certain model.
  • Mapping utilities convert deletion signals into a Silver-compatible representation.
  • The implementation must distinguish between snapshot-style deletion detection and log-based deletion sources.

Silver validation pipeline​

Silver validation is intentionally split into stages so operators can see whether failures came from contract-driven transforms, physical casting, business-rule constraints or data quality checks.

Step 1: Contract-driven transforms​

  • Optional field-level transformLogic expressions run after flattening and lineage-based renaming, so they reference Silver column names.
  • Each transform must output a string value. Physical typing still happens in the cast stage.
  • Invalid SQL authoring fails the pipeline immediately.
  • Row-level transform failures are quarantined with explicit failure reasons before casting continues on the remaining rows.

Step 2: Casting​

  • Cast each field to the contract's declared physical type.
  • Normalize null-like values before type conversion.
  • Treat failed casts as quarantine candidates with explicit failure reasons.

Step 3: Constraint validation​

  • Validate required fields, primary-key expectations, enums, and value bounds.
  • Run these checks only on rows that already passed casting.
  • Preserve a human-readable failure reason so quarantine output can be acted on.

Step 3: Quality checks​

  • Apply SQL-based quality rules defined in the quality block of the data contract.
  • Only type: sql checks are evaluated; each check's query expression identifies rows that fail.
  • Failing rows are subtracted from the valid set and quarantined with the check name and query as the failure reason.
  • Quality checks run after constraint validation, so their input is already cast and structurally valid.

Where to look in the repo​

  • Start with the relevant contract in the contract hub and the matching contract files in the example implementation.
  • Check the example notebooks and job wiring for the platform-specific execution path.
  • Use the Accelerator capabilities browser if you need to inspect deeper implementation surfaces.

Contract validation​

Before running a pipeline stage, validate the contract to confirm all required custom properties are present and correct. Each stage has its own validation method that checks only what that stage needs.

contract = OpenDataContractStandard.from_file("data_contracts/swapi/swapi_people.odcs.yml")

contract.validate_for_landing() # source → landing
contract.validate_for_bronze() # landing → bronze
contract.validate_for_silver() # bronze → silver
contract.validate_for_gold(workspace_path="/workspace/root") # silver → gold

Each method returns True when the contract is valid. If any required property is missing or invalid, it raises ContractValidationError and lists every issue at once so you can fix them in one pass rather than discovering them one-by-one at runtime.

The workspace_path argument on validate_for_gold is optional. When provided, it also checks that the generator_notebook file exists on disk at that path.

For source-to-landing contracts, keep ingestion-specific settings such as connector, ingestion_read_strategy, json_multiline, and content_path on the source schema object's customProperties block. Keep contract-level customProperties for shared contract metadata such as source_system, landing_directory, and landing_file_format.

Contract-driven SQL ingestion​

Shared SQL ingestion now has two SQL Server connector paths, each with a distinct contract role.

Use connector: sqlserver on the source schema for regular SQL Server table reads. Use connector: sqlserver_cdc on the source schema for SQL Server CDC reads.

  • ingestion_read_strategy controls whether the logical run is full_snapshot or incremental.
  • sql_extraction_mode controls whether the logical run is drained with single_query or batched reads.
  • ingestion_filtering_column defines the strict > boundary for incremental SQL ingestion.
  • ingestion_ordering_columns defines deterministic ordering for batched SQL ingestion.
  • batch_size is optional; when it is missing, the shared SQL connector uses its default.

For sqlserver_cdc, Carabine reads from the SQL Server CDC change functions instead of a base table query. That connector uses a timestamp-based checkpoint and does not require sql_extraction_mode.

Landing-push source contracts​

Use connector: landing_push on the source schema when files are already delivered into the landing directory by an external process and Carabine should not pull or copy them.

  • Keep landing_directory and landing_file_format on the source schema as usual.
  • No endpoint URL is required for this connector.
  • The source-to-landing stage still runs and records an inert checkpoint so the workflow stays consistent with other connectors.
  • A shared sample contract lives at carabine/libs/carabine_shared/_06_e2e_tests/data_contracts/landing_push/landing_push_orders.odcs.yml.

Checkpoint semantics stay split by purpose:

  • checkpoint_value stores only the next-run SQL boundary.
  • source_specific_state_json stores only in-run traversal state for batched continuation.

This means batched SQL runs do not use checkpoint_value for offsets, page counters, or other continuation mechanics. The committed boundary moves only after the logical run has completed successfully. For sqlserver_cdc, the committed boundary is the latest CDC commit timestamp and source_specific_state_json remains empty.