Parquet Import with DuckDB
FastTransfer leverages DuckDB's powerful read_parquet function to efficiently import Parquet files into any supported database. Parquet is a columnar storage format optimized for analytics, and DuckDB provides excellent performance and feature support for reading Parquet files.
DuckDB offers several advantages for Parquet processing:
- Columnar access - Read only the columns you need for optimal performance
- Predicate pushdown - Filter data at the file level before loading
- Automatic schema detection - Column types preserved from Parquet metadata
- High performance - Optimized Parquet reader with parallel processing
- Compression support - Handles all Parquet compression codecs (Snappy, Gzip, LZ4, Zstd)
- Nested structure support - Handle complex nested objects, arrays, and maps
- Multi-file support - Process multiple Parquet files with glob patterns
- Partition support - Hive-style partitioning for efficient data access
Basic Parquet Import
Single Parquet File
Import a single Parquet file using DuckDB's in-memory mode:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * from read_parquet('D:\DATA\sales.parquet')" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "SalesDB" `
--targetschema "dbo" `
--targettable "sales" `
--parallelmethod "None" `
--mapmethod "Name" `
--loadmode "Truncate"
For optimal performance, select only the columns you need:
--query "SELECT customer_id, order_date, total_amount FROM read_parquet('file.parquet')"
This reduces I/O and memory usage significantly.
Multiple Parquet Files - Single Thread
Import multiple Parquet files using glob pattern:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * from read_parquet('D:\DATA\orders\*.parquet')" `
--targetconnectiontype "pgcopy" `
--targetserver "localhost:5432" `
--targetuser "postgres" `
--targetpassword "YourPassword" `
--targetdatabase "warehouse" `
--targetschema "public" `
--targettable "orders" `
--parallelmethod "None" `
--mapmethod "Name" `
--loadmode "Truncate"
Parallel Parquet Import
DataDriven Method with Calculated Column
Use a calculated distribution key for parallel processing:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * from read_parquet('D:\DATA\transactions\*.parquet')" `
--targetconnectiontype "pgcopy" `
--targetserver "localhost:5432" `
--targetuser "postgres" `
--targetpassword "YourPassword" `
--targetdatabase "finance" `
--targetschema "public" `
--targettable "transactions" `
--parallelmethod "DataDriven" `
--distributeKeyColumn "(transaction_id/100000)::int" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 12
The expression (transaction_id/100000)::int divides the ID by 100,000 and casts to integer, creating balanced data chunks for parallel processing.
Random Method on Large Parquet Files
Process large Parquet files with Random parallelization:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * FROM read_parquet('D:\DATA\events\*.parquet')" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "analytics" `
--targetschema "dbo" `
--targettable "events" `
--parallelmethod "Random" `
--distributeKeyColumn "event_id" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 16
DataDriven Method with Custom Query
Use a custom DataDriven query to calculate optimal distribution:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * FROM read_parquet('D:\DATA\logs\*.parquet')" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "logging" `
--targetschema "dbo" `
--targettable "application_logs" `
--parallelmethod "DataDriven" `
--distributeKeyColumn "CAST(event_date AS VARCHAR)" `
--datadrivenquery "select distinct CAST(event_date AS VARCHAR) from read_parquet('D:\DATA\logs\*.parquet')" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 10
Multiple Parquet Files
Parallel Processing by Filename
Process multiple Parquet files in parallel using the filename as the distribution key:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * exclude filename FROM read_parquet('D:\DATA\daily\*.parquet', filename=true)" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "reports" `
--targetschema "dbo" `
--targettable "daily_reports" `
--parallelmethod "DataDriven" `
--distributeKeyColumn "filename" `
--datadrivenquery "select file from glob('D:\DATA\daily\*.parquet')" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 16
- Use
filename=trueinread_parquet()to add a filename column - Use
exclude filenameto remove it from the final output - The
glob()function returns the list of files for parallel processing - Each thread processes one or more complete files
Hive-Style Partitioning
Reading Partitioned Data
Read Parquet files with Hive-style partitioning:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * FROM read_parquet('D:\DATA\partitioned\**\*.parquet', hive_partitioning=true)" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "warehouse" `
--targetschema "dbo" `
--targettable "partitioned_data" `
--parallelmethod "None" `
--mapmethod "Name" `
--loadmode "Truncate"
For data organized like year=2024/month=01/data.parquet, DuckDB automatically extracts partition columns:
- Partition columns become regular columns in the result
- Use
hive_partitioning=trueto enable this feature - Excellent for partition pruning and efficient queries
Parallel Processing on Partitioned Data
Use partition columns for parallel distribution:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * FROM read_parquet('D:\DATA\sales\year=*/month=*\*.parquet', hive_partitioning=true)" `
--targetconnectiontype "pgcopy" `
--targetserver "localhost:5432" `
--targetuser "postgres" `
--targetpassword "YourPassword" `
--targetdatabase "sales" `
--targetschema "public" `
--targettable "sales_data" `
--parallelmethod "DataDriven" `
--distributeKeyColumn "year || '-' || month" `
--datadrivenquery "select distinct year || '-' || month as partition_key from read_parquet('D:\DATA\sales\year=*/month=*\*.parquet', hive_partitioning=true)" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 12
Nested Structures in Parquet
Accessing Nested Fields
Extract specific fields from nested Parquet structures:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT id, customer.name as customer_name, customer.email as customer_email, order_date, total FROM read_parquet('D:\DATA\orders.parquet')" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "sales" `
--targetschema "dbo" `
--targettable "orders_flat" `
--parallelmethod "None" `
--mapmethod "Name" `
--loadmode "Truncate"
Flattening Nested Arrays with UNNEST
Flatten arrays within Parquet files:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT order_id, order_date, unnest(line_items, recursive := true) FROM read_parquet('D:\DATA\orders_nested.parquet')" `
--targetconnectiontype "pgcopy" `
--targetserver "localhost:5432" `
--targetuser "postgres" `
--targetpassword "YourPassword" `
--targetdatabase "sales" `
--targetschema "public" `
--targettable "order_items" `
--parallelmethod "None" `
--mapmethod "Name" `
--loadmode "Truncate"
Complex Nested Structures
Handle complex nested Parquet data with multiple levels:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT day, unnest(contexte_chantier, recursive := true), unnest(detail_chantier, recursive := true) FROM read_parquet('D:\DATA\chantier\**\*.parquet', hive_partitioning=true)" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "construction" `
--targetschema "dbo" `
--targettable "site_details" `
--parallelmethod "DataDriven" `
--distributeKeyColumn "day" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 8
Advanced Filtering
Predicate Pushdown
Apply filters to reduce data read from Parquet files:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * FROM read_parquet('D:\DATA\events\*.parquet') WHERE event_date >= '2024-01-01' AND event_type = 'purchase'" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "analytics" `
--targetschema "dbo" `
--targettable "purchases" `
--parallelmethod "None" `
--mapmethod "Name" `
--loadmode "Append"
DuckDB pushes filters down to the Parquet reader, which:
- Skips entire row groups that don't match the filter
- Reduces memory usage
- Significantly improves performance for selective queries
- Works best with filters on partition columns or columns with statistics
Column Pruning with Selection
Select only needed columns for maximum performance:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT customer_id, order_date, total_amount, status FROM read_parquet('D:\DATA\orders_large.parquet') WHERE order_date >= '2024-01-01'" `
--targetconnectiontype "pgcopy" `
--targetserver "localhost:5432" `
--targetuser "postgres" `
--targetpassword "YourPassword" `
--targetdatabase "sales" `
--targetschema "public" `
--targettable "recent_orders" `
--parallelmethod "Random" `
--distributeKeyColumn "customer_id" `
--mapmethod "Name" `
--loadmode "Append" `
--degree 12
Schema and Metadata
Union Multiple Parquet Files
Union Parquet files with compatible schemas:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * FROM read_parquet(['D:\DATA\2023\*.parquet', 'D:\DATA\2024\*.parquet'], union_by_name=true)" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "archive" `
--targetschema "dbo" `
--targettable "historical_data" `
--parallelmethod "None" `
--mapmethod "Name" `
--loadmode "Truncate"
union_by_name=true allows reading files with different column orders or optional columns:
- Columns are matched by name instead of position
- Missing columns are filled with NULL
- Extra columns are ignored
- Useful for schema evolution scenarios
Type Conversion
Convert Parquet data types during import:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT id, CAST(created_at AS TIMESTAMP) as created_at, amount::DECIMAL(18,2) as amount, status FROM read_parquet('D:\DATA\transactions.parquet')" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "finance" `
--targetschema "dbo" `
--targettable "transactions" `
--parallelmethod "None" `
--mapmethod "Name" `
--loadmode "Truncate"