`pfg gen dataset`

Capabilities

pfg gen dataset emits large CSV, Parquet, or Arrow datasets using the same deterministic generator that powers JSON output. It is optimized for bulk export: records stream in batches, optional compression keeps artifacts small, and sharding prevents multi-gigabyte files.

Typical use cases

Build reproducible datasets for analytics or lakehouse ingestion.
Produce Parquet snapshots for DuckDB/Athena tests without touching production data.
Export CSV slices for QA workflows that must mirror fixture seeds.
Validate cross-model relations under realistic volume before seeding a database.

Inputs & outputs

Target: module path exposing Pydantic models, stdlib @dataclass types, or TypedDicts (or use --schema for JSON Schema). Required unless a schema is provided.
Format: --format csv|parquet|arrow (default csv). Columnar formats require the dataset extra so PyArrow is available.
Output: --out accepts file templates; include {model} whenever more than one model is selected.
Compression: --compression chooses codecs. CSV supports gzip; Parquet/Arrow support snappy, zstd, brotli, lz4 (Arrow also zstd/lz4). Leave unset for uncompressed output.

Flag reference

Volume + layout

--n/-n: number of records (default 1).
--shard-size: max records per file. Helps when --n is huge.
--format, --compression as described above.

Discovery & schema ingestion

--include/-i, --exclude/-e: glob filters.
--schema PATH: ingest JSON Schema instead of importing a module.

Determinism + privacy

--seed, --now, --preset, --profile, --rng-mode: same semantics as pfg gen json.
--freeze-seeds, --freeze-seeds-file: persist per-model seeds between runs.
--field-hints: prefer Field defaults/examples before providers (modes match pfg gen json).
--locale / --locale-map pattern=locale: override the Faker locale globally or remap specific models/fields for international datasets without editing pyproject.toml.

Collection controls

--collection-min-items / --collection-max-items: clamp global collection lengths before field-level constraints. Handy when you need denser arrays or when CSV tests only expect a couple of elements per list.
--collection-distribution: skew sampling toward smaller (min-heavy) or larger (max-heavy) collections, or keep it uniform.

Quality controls

--respect-validators, --validator-max-retries: enforce Pydantic validators before rows hit disk.
--link, --max-depth, --on-cycle: keep relations and recursion behavior aligned with other emitters.
-O/--override: per-field overrides using the same JSON snippet semantics as config files.

Watch mode

--watch: monitor the module, schema (if provided), config, and output directory. --watch-debounce throttles reruns.

Example workflows

Parquet warehouse export

pfg gen dataset ./app/models.py \

  --out warehouse/{model}/{timestamp}.parquet \

  --format parquet --compression zstd \

  --n 1_000_000 --shard-size 250_000 \

  --include app.schemas.Order --seed 7 \

  --preset boundary-max --profile realistic

Produces four Zstandard-compressed Parquet shards with deterministic `Order` rows.

Sample output

[dataset_emit] format=parquet compression=zstd shard=250000

warehouse/app.schemas.Order/2024-06-01T12-00-00Z-000.parquet (250000 rows)

warehouse/app.schemas.Order/2024-06-01T12-00-00Z-001.parquet (250000 rows)

warehouse/app.schemas.Order/2024-06-01T12-00-00Z-002.parquet (250000 rows)

warehouse/app.schemas.Order/2024-06-01T12-00-00Z-003.parquet (250000 rows)

constraint_summary:

  total_fields=42 constrained=9 warnings=0

**Parquet preview via `pyarrow.parquet.read_table(...).to_pandas().head(2)`**

order_id                           total_cents   status   user_id

0  3fa85f64-5717-4562-b3fc-2c963f66afa6  1999          PENDING  1b111c11-...

1  8d6548ba-52a5-41b4-80f3-4f28c1c9dcef  4599          CAPTURED 2a222d22-...

Schema-driven CSV for integration tests

pfg gen dataset --schema spec/schema.json \

  --out artifacts/{model}.csv --format csv --compression gzip \

  --n 1000 --respect-validators --preset edge

Ingests a JSON Schema file instead of importing the module and writes gzipped CSV files per schema.

Sample output

[schema_ingest] spec=spec/schema.json cached_path=.pfg-cache/schema/spec_schema_models.py

[csv_emit] path=artifacts/Event.csv.gz rows=1000 compression=gzip

[csv_emit] path=artifacts/Address.csv.gz rows=1000 compression=gzip

Inserted warnings: 0 validator_retries: 3

**Excerpt from `artifacts/Event.csv.gz` (after `gunzip`)**

id,timestamp,event_type,metadata

3014d4e8-75de-4b9a-8a5d-a3d4ab610e1a,2025-11-08T12:00:00Z,USER_SIGNED_IN,"{'ip':'203.0.113.42'}"

fe71b998-6af3-4f68-90a2-615efa170f29,2025-11-08T12:00:00Z,PASSWORD_RESET,"{'ip':'203.0.113.87'}"

Additional examples

# Parquet export from a module that mixes BaseModel, dataclass, and TypedDict types

pfg gen dataset examples/models.py \

  --include examples.Order \

  --format parquet --compression zstd \

  --n 25000 --shard-size 5000 \

  --out warehouse/{model}/{timestamp}.parquet \

  --collection-min-items 1 --collection-max-items 3

# CSV + gzip plus field hints for defaults/examples

pfg gen dataset examples/models.py \

  --format csv --compression gzip \

  --n 1000 --out artifacts/{model}.csv.gz \

  --field-hints defaults-then-examples

Python API equivalent:

from pathlib import Path

from pydantic_fixturegen.api import generate_dataset

from pydantic_fixturegen.core.path_template import OutputTemplate

generate_dataset(

    target=Path("examples/models.py"),

    output_template=OutputTemplate("warehouse/{model}-{case_index}.parquet"),

    count=5000,

    format="parquet",

    compression="zstd",

    include=["examples.Order"],

    collection_distribution="max-heavy",

More ready-to-run shells + notebooks live in docs/examples.md.

Operational notes

Each run logs constraint summaries plus the resolved dataset format. For Parquet/Arrow the logger also captures the PyArrow version (when verbose logging is enabled) to aid reproducibility.
When multiple schemas are selected and the output template omits {model}, the command exits early with a DiscoveryError so you never accidentally clobber files.
CI tip: combine --watch with entr/watchexec locally, but disable it in automation because the process will stay alive.

Edit this page

pfg gen dataset