Version: 0.14

Parquet Import with DuckDB

FastTransfer leverages DuckDB's powerful read_parquet function to efficiently import Parquet files into any supported database. Parquet is a columnar storage format optimized for analytics, and DuckDB provides excellent performance and feature support for reading Parquet files.

DuckDB offers several advantages for Parquet processing:

Columnar access - Read only the columns you need for optimal performance
Predicate pushdown - Filter data at the file level before loading
Automatic schema detection - Column types preserved from Parquet metadata
High performance - Optimized Parquet reader with parallel processing
Compression support - Handles all Parquet compression codecs (Snappy, Gzip, LZ4, Zstd)
Nested structure support - Handle complex nested objects, arrays, and maps
Multi-file support - Process multiple Parquet files with glob patterns
Partition support - Hive-style partitioning for efficient data access

Basic Parquet Import

Single Parquet File

Import a single Parquet file using DuckDB's in-memory mode:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT * from read_parquet('D:\DATA\sales.parquet')" `
  --targetconnectiontype "msbulk" `
  --targetserver "localhost" `
  --targettrusted `
  --targetdatabase "SalesDB" `
  --targetschema "dbo" `
  --targettable "sales" `
  --parallelmethod "None" `
  --mapmethod "Name" `
  --loadmode "Truncate"

Column Selection

For optimal performance, select only the columns you need:

--query "SELECT customer_id, order_date, total_amount FROM read_parquet('file.parquet')"

This reduces I/O and memory usage significantly.

Multiple Parquet Files - Single Thread

Import multiple Parquet files using glob pattern:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT * from read_parquet('D:\DATA\orders\*.parquet')" `
  --targetconnectiontype "pgcopy" `
  --targetserver "localhost:5432" `
  --targetuser "postgres" `
  --targetpassword "YourPassword" `
  --targetdatabase "warehouse" `
  --targetschema "public" `
  --targettable "orders" `
  --parallelmethod "None" `
  --mapmethod "Name" `
  --loadmode "Truncate"

Parallel Parquet Import

DataDriven Method with Calculated Column

Use a calculated distribution key for parallel processing:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT * from read_parquet('D:\DATA\transactions\*.parquet')" `
  --targetconnectiontype "pgcopy" `
  --targetserver "localhost:5432" `
  --targetuser "postgres" `
  --targetpassword "YourPassword" `
  --targetdatabase "finance" `
  --targetschema "public" `
  --targettable "transactions" `
  --parallelmethod "DataDriven" `
  --distributeKeyColumn "(transaction_id/100000)::int" `
  --mapmethod "Name" `
  --loadmode "Truncate" `
  --degree 12

Distribution Key Calculation

The expression (transaction_id/100000)::int divides the ID by 100,000 and casts to integer, creating balanced data chunks for parallel processing.

Random Method on Large Parquet Files

Process large Parquet files with Random parallelization:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT * FROM read_parquet('D:\DATA\events\*.parquet')" `
  --targetconnectiontype "msbulk" `
  --targetserver "localhost" `
  --targettrusted `
  --targetdatabase "analytics" `
  --targetschema "dbo" `
  --targettable "events" `
  --parallelmethod "Random" `
  --distributeKeyColumn "event_id" `
  --mapmethod "Name" `
  --loadmode "Truncate" `
  --degree 16

DataDriven Method with Custom Query

Use a custom DataDriven query to calculate optimal distribution:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT * FROM read_parquet('D:\DATA\logs\*.parquet')" `
  --targetconnectiontype "msbulk" `
  --targetserver "localhost" `
  --targettrusted `
  --targetdatabase "logging" `
  --targetschema "dbo" `
  --targettable "application_logs" `
  --parallelmethod "DataDriven" `
  --distributeKeyColumn "CAST(event_date AS VARCHAR)" `
  --datadrivenquery "select distinct CAST(event_date AS VARCHAR) from read_parquet('D:\DATA\logs\*.parquet')" `
  --mapmethod "Name" `
  --loadmode "Truncate" `
  --degree 10

Multiple Parquet Files

Parallel Processing by Filename

Process multiple Parquet files in parallel using the filename as the distribution key:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT * exclude filename FROM read_parquet('D:\DATA\daily\*.parquet', filename=true)" `
  --targetconnectiontype "msbulk" `
  --targetserver "localhost" `
  --targettrusted `
  --targetdatabase "reports" `
  --targetschema "dbo" `
  --targettable "daily_reports" `
  --parallelmethod "DataDriven" `
  --distributeKeyColumn "filename" `
  --datadrivenquery "select file from glob('D:\DATA\daily\*.parquet')" `
  --mapmethod "Name" `
  --loadmode "Truncate" `
  --degree 16

Filename as Distribution Key

Use filename=true in read_parquet() to add a filename column
Use exclude filename to remove it from the final output
The glob() function returns the list of files for parallel processing
Each thread processes one or more complete files

Hive-Style Partitioning

Reading Partitioned Data

Read Parquet files with Hive-style partitioning:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT * FROM read_parquet('D:\DATA\partitioned\**\*.parquet', hive_partitioning=true)" `
  --targetconnectiontype "msbulk" `
  --targetserver "localhost" `
  --targettrusted `
  --targetdatabase "warehouse" `
  --targetschema "dbo" `
  --targettable "partitioned_data" `
  --parallelmethod "None" `
  --mapmethod "Name" `
  --loadmode "Truncate"

Hive Partitioning

For data organized like year=2024/month=01/data.parquet, DuckDB automatically extracts partition columns:

Partition columns become regular columns in the result
Use hive_partitioning=true to enable this feature
Excellent for partition pruning and efficient queries

Parallel Processing on Partitioned Data

Use partition columns for parallel distribution:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT * FROM read_parquet('D:\DATA\sales\year=*/month=*\*.parquet', hive_partitioning=true)" `
  --targetconnectiontype "pgcopy" `
  --targetserver "localhost:5432" `
  --targetuser "postgres" `
  --targetpassword "YourPassword" `
  --targetdatabase "sales" `
  --targetschema "public" `
  --targettable "sales_data" `
  --parallelmethod "DataDriven" `
  --distributeKeyColumn "year || '-' || month" `
  --datadrivenquery "select distinct year || '-' || month as partition_key from read_parquet('D:\DATA\sales\year=*/month=*\*.parquet', hive_partitioning=true)" `
  --mapmethod "Name" `
  --loadmode "Truncate" `
  --degree 12

Nested Structures in Parquet

Accessing Nested Fields

Extract specific fields from nested Parquet structures:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT id, customer.name as customer_name, customer.email as customer_email, order_date, total FROM read_parquet('D:\DATA\orders.parquet')" `
  --targetconnectiontype "msbulk" `
  --targetserver "localhost" `
  --targettrusted `
  --targetdatabase "sales" `
  --targetschema "dbo" `
  --targettable "orders_flat" `
  --parallelmethod "None" `
  --mapmethod "Name" `
  --loadmode "Truncate"

Flattening Nested Arrays with UNNEST

Flatten arrays within Parquet files:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT order_id, order_date, unnest(line_items, recursive := true) FROM read_parquet('D:\DATA\orders_nested.parquet')" `
  --targetconnectiontype "pgcopy" `
  --targetserver "localhost:5432" `
  --targetuser "postgres" `
  --targetpassword "YourPassword" `
  --targetdatabase "sales" `
  --targetschema "public" `
  --targettable "order_items" `
  --parallelmethod "None" `
  --mapmethod "Name" `
  --loadmode "Truncate"

Complex Nested Structures

Handle complex nested Parquet data with multiple levels:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT day, unnest(contexte_chantier, recursive := true), unnest(detail_chantier, recursive := true) FROM read_parquet('D:\DATA\chantier\**\*.parquet', hive_partitioning=true)" `
  --targetconnectiontype "msbulk" `
  --targetserver "localhost" `
  --targettrusted `
  --targetdatabase "construction" `
  --targetschema "dbo" `
  --targettable "site_details" `
  --parallelmethod "DataDriven" `
  --distributeKeyColumn "day" `
  --mapmethod "Name" `
  --loadmode "Truncate" `
  --degree 8

Advanced Filtering

Predicate Pushdown

Apply filters to reduce data read from Parquet files:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT * FROM read_parquet('D:\DATA\events\*.parquet') WHERE event_date >= '2024-01-01' AND event_type = 'purchase'" `
  --targetconnectiontype "msbulk" `
  --targetserver "localhost" `
  --targettrusted `
  --targetdatabase "analytics" `
  --targetschema "dbo" `
  --targettable "purchases" `
  --parallelmethod "None" `
  --mapmethod "Name" `
  --loadmode "Append"

Predicate Pushdown Benefits

DuckDB pushes filters down to the Parquet reader, which:

Skips entire row groups that don't match the filter
Reduces memory usage
Significantly improves performance for selective queries
Works best with filters on partition columns or columns with statistics

Column Pruning with Selection

Select only needed columns for maximum performance:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT customer_id, order_date, total_amount, status FROM read_parquet('D:\DATA\orders_large.parquet') WHERE order_date >= '2024-01-01'" `
  --targetconnectiontype "pgcopy" `
  --targetserver "localhost:5432" `
  --targetuser "postgres" `
  --targetpassword "YourPassword" `
  --targetdatabase "sales" `
  --targetschema "public" `
  --targettable "recent_orders" `
  --parallelmethod "Random" `
  --distributeKeyColumn "customer_id" `
  --mapmethod "Name" `
  --loadmode "Append" `
  --degree 12

Schema and Metadata

Union Multiple Parquet Files

Union Parquet files with compatible schemas:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT * FROM read_parquet(['D:\DATA\2023\*.parquet', 'D:\DATA\2024\*.parquet'], union_by_name=true)" `
  --targetconnectiontype "msbulk" `
  --targetserver "localhost" `
  --targettrusted `
  --targetdatabase "archive" `
  --targetschema "dbo" `
  --targettable "historical_data" `
  --parallelmethod "None" `
  --mapmethod "Name" `
  --loadmode "Truncate"

Union By Name

union_by_name=true allows reading files with different column orders or optional columns:

Columns are matched by name instead of position
Missing columns are filled with NULL
Extra columns are ignored
Useful for schema evolution scenarios

Type Conversion

Convert Parquet data types during import:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT id, CAST(created_at AS TIMESTAMP) as created_at, amount::DECIMAL(18,2) as amount, status FROM read_parquet('D:\DATA\transactions.parquet')" `
  --targetconnectiontype "msbulk" `
  --targetserver "localhost" `
  --targettrusted `
  --targetdatabase "finance" `
  --targetschema "dbo" `
  --targettable "transactions" `
  --parallelmethod "None" `
  --mapmethod "Name" `
  --loadmode "Truncate"

Basic Parquet Import​

Single Parquet File​

Multiple Parquet Files - Single Thread​

Parallel Parquet Import​

DataDriven Method with Calculated Column​

Random Method on Large Parquet Files​

DataDriven Method with Custom Query​

Multiple Parquet Files​

Parallel Processing by Filename​

Hive-Style Partitioning​

Reading Partitioned Data​

Parallel Processing on Partitioned Data​

Nested Structures in Parquet​

Accessing Nested Fields​

Flattening Nested Arrays with UNNEST​

Complex Nested Structures​

Advanced Filtering​

Predicate Pushdown​

Column Pruning with Selection​

Schema and Metadata​

Union Multiple Parquet Files​

Type Conversion​