Skip to main content
Version: 0.14

CSV Import with DuckDB

FastTransfer leverages DuckDB's powerful read_csv function to efficiently import CSV files into any supported database. DuckDB provides advanced features like automatic schema detection, parallel processing, and support for complex CSV formats.

DuckDB offers several advantages for CSV processing:

  • Automatic schema detection - No need to manually define column types
  • High performance - Optimized CSV parser with parallel reading
  • Flexible formatting - Handle various delimiters, quotes, and escape characters
  • Multi-file support - Process multiple CSV files with glob patterns
  • In-memory processing - No need for a persistent database file

Basic CSV Import

Single Small CSV File

Import a single CSV file using DuckDB's in-memory mode:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * from read_csv('D:\DATA\arborescence.csv') rcsv" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "Edilians" `
--targetschema "dbo" `
--targettable "arborescence" `
--parallelmethod "None" `
--mapmethod "Name" `
--loadmode "Truncate"
Using :memory:

The :memory: server parameter tells DuckDB to operate in-memory without creating a database file. This is perfect for one-time CSV imports.

Medium CSV File - Single Thread

Import a medium-sized CSV file to PostgreSQL:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * from read_csv('D:\OpenData\wikidata\vector_database_wikipedia_articles_embedded.csv') rcsv" `
--targetconnectiontype "pgcopy" `
--targetserver "localhost:35432" `
--targetuser "postgres" `
--targetpassword "YourPassword" `
--targetdatabase "wikipedia" `
--targetschema "public" `
--targettable "articles" `
--parallelmethod "None" `
--mapmethod "Name" `
--loadmode "Truncate"

Parallel CSV Import

DataDriven Method with Calculated Column

Use a calculated distribution key for parallel processing:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * from read_csv('D:\OpenData\wikidata\vector_database_wikipedia_articles_embedded.csv') rcsv" `
--targetconnectiontype "pgcopy" `
--targetserver "localhost:35432" `
--targetuser "postgres" `
--targetpassword "YourPassword" `
--targetdatabase "wikipedia" `
--targetschema "public" `
--targettable "articles" `
--parallelmethod "DataDriven" `
--distributeKeyColumn "(id/10000)::int" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 12
Distribution Key Calculation

The expression (id/10000)::int divides the ID column by 10,000 and casts to integer, creating balanced data chunks for parallel processing.

Random Method on Large CSV

Process a large CSV file with Random parallelization:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * FROM read_csv('D:\temp\lineitem*.CSV')" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "tpch_test" `
--targetschema "dbo" `
--targettable "lineitem" `
--parallelmethod "Random" `
--distributeKeyColumn "l_orderkey" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 16

DataDriven Method with Custom Query

Use a custom DataDriven query to calculate optimal distribution:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * FROM read_csv('D:\temp\lineitem*.CSV') rcsv" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "tpch_test" `
--targetschema "dbo" `
--targettable "lineitem" `
--parallelmethod "DataDriven" `
--distributeKeyColumn "(l_orderkey/5000000)::int" `
--datadrivenquery "select distinct (l_orderkey/5000000)::int l_orderkey_group from read_csv('D:\temp\lineitem*.CSV') rcsv" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 13

Multiple CSV Files

Parallel Processing by Filename

Process multiple CSV files in parallel using the filename as the distribution key:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * exclude filename FROM read_csv('D:\temp\FastBCP_lineitem_Extract\lineitem*.CSV', filename=true)" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "tpch_test" `
--targetschema "dbo" `
--targettable "lineitem" `
--parallelmethod "DataDriven" `
--distributeKeyColumn "filename" `
--datadrivenquery "select file from glob('D:\temp\FastBCP_lineitem_Extract\*.CSV')" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 16
Filename as Distribution Key
  • Use filename=true in read_csv() to add a filename column
  • Use exclude filename to remove it from the final output
  • The glob() function returns the list of files for parallel processing
  • Each thread processes one or more complete files

Advanced CSV Formats

Explicit Format Specification

When automatic detection doesn't work, specify the CSV format explicitly:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * FROM read_csv('D:\OpenData\wikidata\vector_database_wikipedia_articles_embedded.csv', auto_detect=false, delim=',', quote='""', escape='""', new_line='\n', skip=0, comment='', header=true, columns={'id': 'BIGINT', 'url': 'VARCHAR', 'title': 'VARCHAR', 'text': 'VARCHAR', 'title_vector': 'VARCHAR', 'content_vector': 'VARCHAR', 'vector_id': 'BIGINT'}) rcsv" `
--targetconnectiontype "pgcopy" `
--targetserver "localhost:35432" `
--targetuser "postgres" `
--targetpassword "YourPassword" `
--targetdatabase "wikipedia" `
--targetschema "public" `
--targettable "articles" `
--parallelmethod "DataDriven" `
--distributeKeyColumn "(id/10000)::int" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 12

Format Parameters:

  • auto_detect=false - Disable automatic schema detection
  • delim=',' - Field delimiter (comma)
  • quote='"' - Quote character for fields containing delimiters
  • escape='"' - Escape character for quotes within quoted fields
  • new_line='\n' - Line terminator
  • skip=0 - Number of lines to skip at the start
  • comment='' - Character indicating comment lines to ignore
  • header=true - First line contains column names
  • columns={...} - Explicit column names and types

CSV with Vector Data to SQL Server

Import CSV containing vector embeddings as VARCHAR:

.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT id,url,title,text,title_vector::varchar tv,content_vector::varchar cv,vector_id from read_csv('D:\OpenData\wikidata\vector_database_wikipedia_articles_embedded.csv') limit 100" `
--targetconnectiontype "msbulk" `
--targetserver "localhost\SS2025" `
--targettrusted `
--targetdatabase "wikipedia" `
--targetschema "dbo" `
--targettable "wikipedia_articles_embeddings_as_vector" `
--mapmethod "Name" `
--loadmode "Append"

Common CSV Scenarios

Tab-Delimited Files

For tab-separated values (TSV):

--query "SELECT * FROM read_csv('file.tsv', delim='\t')"

CSV with Different Encodings

--query "SELECT * FROM read_csv('file.csv', encoding='UTF-8')"

CSV with Custom Quote Character

--query "SELECT * FROM read_csv('file.csv', quote='''', escape='''')"

CSV Without Headers

--query "SELECT * FROM read_csv('file.csv', header=false, columns={'col1': 'VARCHAR', 'col2': 'INTEGER'})"

Skip Header Lines

--query "SELECT * FROM read_csv('file.csv', skip=3)"
Copyright © 2026 Architecture & Performance.