Skip to main content
Version: 0.15 (Latest)

Parquet Import with DuckDB

FastTransfer leverages DuckDB's powerful read_parquet function to efficiently import Parquet files into any supported database. Parquet is a columnar storage format optimized for analytics, and DuckDB provides excellent performance and feature support for reading Parquet files.

DuckDB offers several advantages for Parquet processing:

  • Columnar access - Read only the columns you need for optimal performance
  • Predicate pushdown - Filter data at the file level before loading
  • Automatic schema detection - Column types preserved from Parquet metadata
  • High performance - Optimized Parquet reader with parallel processing
  • Compression support - Handles all Parquet compression codecs (Snappy, Gzip, LZ4, Zstd)
  • Nested structure support - Handle complex nested objects, arrays, and maps
  • Multi-file support - Process multiple Parquet files with glob patterns
  • Partition support - Hive-style partitioning for efficient data access

Basic Parquet Import

Single Parquet File

Import a single Parquet file using DuckDB's in-memory mode:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * from read_parquet('D:\DATA\sales.parquet')" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "SalesDB" `
--targetschema "dbo" `
--targettable "sales" `
--parallelmethod "None" `
--mapmethod "Name" `
--loadmode "Truncate"
Column Selection

For optimal performance, select only the columns you need:

--query "SELECT customer_id, order_date, total_amount FROM read_parquet('file.parquet')"

This reduces I/O and memory usage significantly.

Multiple Parquet Files - Single Thread

Import multiple Parquet files using glob pattern:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * from read_parquet('D:\DATA\orders\*.parquet')" `
--targetconnectiontype "pgcopy" `
--targetserver "localhost:5432" `
--targetuser "postgres" `
--targetpassword "YourPassword" `
--targetdatabase "warehouse" `
--targetschema "public" `
--targettable "orders" `
--parallelmethod "None" `
--mapmethod "Name" `
--loadmode "Truncate"

Parallel Parquet Import

DataDriven Method with Calculated Column

Use a calculated distribution key for parallel processing:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * from read_parquet('D:\DATA\transactions\*.parquet')" `
--targetconnectiontype "pgcopy" `
--targetserver "localhost:5432" `
--targetuser "postgres" `
--targetpassword "YourPassword" `
--targetdatabase "finance" `
--targetschema "public" `
--targettable "transactions" `
--parallelmethod "DataDriven" `
--distributeKeyColumn "(transaction_id/100000)::int" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 12
Distribution Key Calculation

The expression (transaction_id/100000)::int divides the ID by 100,000 and casts to integer, creating balanced data chunks for parallel processing.

Random Method on Large Parquet Files

Process large Parquet files with Random parallelization:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * FROM read_parquet('D:\DATA\events\*.parquet')" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "analytics" `
--targetschema "dbo" `
--targettable "events" `
--parallelmethod "Random" `
--distributeKeyColumn "event_id" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 16

DataDriven Method with Custom Query

Use a custom DataDriven query to calculate optimal distribution:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * FROM read_parquet('D:\DATA\logs\*.parquet')" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "logging" `
--targetschema "dbo" `
--targettable "application_logs" `
--parallelmethod "DataDriven" `
--distributeKeyColumn "CAST(event_date AS VARCHAR)" `
--datadrivenquery "select distinct CAST(event_date AS VARCHAR) from read_parquet('D:\DATA\logs\*.parquet')" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 10

Multiple Parquet Files

Parallel Processing by Filename

Process multiple Parquet files in parallel using the filename as the distribution key:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * exclude filename FROM read_parquet('D:\DATA\daily\*.parquet', filename=true)" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "reports" `
--targetschema "dbo" `
--targettable "daily_reports" `
--parallelmethod "DataDriven" `
--distributeKeyColumn "filename" `
--datadrivenquery "select file from glob('D:\DATA\daily\*.parquet')" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 16
Filename as Distribution Key
  • Use filename=true in read_parquet() to add a filename column
  • Use exclude filename to remove it from the final output
  • The glob() function returns the list of files for parallel processing
  • Each thread processes one or more complete files

Hive-Style Partitioning

Reading Partitioned Data

Read Parquet files with Hive-style partitioning:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * FROM read_parquet('D:\DATA\partitioned\**\*.parquet', hive_partitioning=true)" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "warehouse" `
--targetschema "dbo" `
--targettable "partitioned_data" `
--parallelmethod "None" `
--mapmethod "Name" `
--loadmode "Truncate"
Hive Partitioning

For data organized like year=2024/month=01/data.parquet, DuckDB automatically extracts partition columns:

  • Partition columns become regular columns in the result
  • Use hive_partitioning=true to enable this feature
  • Excellent for partition pruning and efficient queries

Parallel Processing on Partitioned Data

Use partition columns for parallel distribution:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * FROM read_parquet('D:\DATA\sales\year=*/month=*\*.parquet', hive_partitioning=true)" `
--targetconnectiontype "pgcopy" `
--targetserver "localhost:5432" `
--targetuser "postgres" `
--targetpassword "YourPassword" `
--targetdatabase "sales" `
--targetschema "public" `
--targettable "sales_data" `
--parallelmethod "DataDriven" `
--distributeKeyColumn "year || '-' || month" `
--datadrivenquery "select distinct year || '-' || month as partition_key from read_parquet('D:\DATA\sales\year=*/month=*\*.parquet', hive_partitioning=true)" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 12

Nested Structures in Parquet

Accessing Nested Fields

Extract specific fields from nested Parquet structures:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT id, customer.name as customer_name, customer.email as customer_email, order_date, total FROM read_parquet('D:\DATA\orders.parquet')" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "sales" `
--targetschema "dbo" `
--targettable "orders_flat" `
--parallelmethod "None" `
--mapmethod "Name" `
--loadmode "Truncate"

Flattening Nested Arrays with UNNEST

Flatten arrays within Parquet files:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT order_id, order_date, unnest(line_items, recursive := true) FROM read_parquet('D:\DATA\orders_nested.parquet')" `
--targetconnectiontype "pgcopy" `
--targetserver "localhost:5432" `
--targetuser "postgres" `
--targetpassword "YourPassword" `
--targetdatabase "sales" `
--targetschema "public" `
--targettable "order_items" `
--parallelmethod "None" `
--mapmethod "Name" `
--loadmode "Truncate"

Complex Nested Structures

Handle complex nested Parquet data with multiple levels:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT day, unnest(contexte_chantier, recursive := true), unnest(detail_chantier, recursive := true) FROM read_parquet('D:\DATA\chantier\**\*.parquet', hive_partitioning=true)" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "construction" `
--targetschema "dbo" `
--targettable "site_details" `
--parallelmethod "DataDriven" `
--distributeKeyColumn "day" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 8

Advanced Filtering

Predicate Pushdown

Apply filters to reduce data read from Parquet files:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * FROM read_parquet('D:\DATA\events\*.parquet') WHERE event_date >= '2024-01-01' AND event_type = 'purchase'" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "analytics" `
--targetschema "dbo" `
--targettable "purchases" `
--parallelmethod "None" `
--mapmethod "Name" `
--loadmode "Append"
Predicate Pushdown Benefits

DuckDB pushes filters down to the Parquet reader, which:

  • Skips entire row groups that don't match the filter
  • Reduces memory usage
  • Significantly improves performance for selective queries
  • Works best with filters on partition columns or columns with statistics

Column Pruning with Selection

Select only needed columns for maximum performance:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT customer_id, order_date, total_amount, status FROM read_parquet('D:\DATA\orders_large.parquet') WHERE order_date >= '2024-01-01'" `
--targetconnectiontype "pgcopy" `
--targetserver "localhost:5432" `
--targetuser "postgres" `
--targetpassword "YourPassword" `
--targetdatabase "sales" `
--targetschema "public" `
--targettable "recent_orders" `
--parallelmethod "Random" `
--distributeKeyColumn "customer_id" `
--mapmethod "Name" `
--loadmode "Append" `
--degree 12

Schema and Metadata

Union Multiple Parquet Files

Union Parquet files with compatible schemas:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * FROM read_parquet(['D:\DATA\2023\*.parquet', 'D:\DATA\2024\*.parquet'], union_by_name=true)" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "archive" `
--targetschema "dbo" `
--targettable "historical_data" `
--parallelmethod "None" `
--mapmethod "Name" `
--loadmode "Truncate"
Union By Name

union_by_name=true allows reading files with different column orders or optional columns:

  • Columns are matched by name instead of position
  • Missing columns are filled with NULL
  • Extra columns are ignored
  • Useful for schema evolution scenarios

Type Conversion

Convert Parquet data types during import:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT id, CAST(created_at AS TIMESTAMP) as created_at, amount::DECIMAL(18,2) as amount, status FROM read_parquet('D:\DATA\transactions.parquet')" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "finance" `
--targetschema "dbo" `
--targettable "transactions" `
--parallelmethod "None" `
--mapmethod "Name" `
--loadmode "Truncate"
Copyright © 2026 Architecture & Performance.