Version: 0.14

CSV Import with DuckDB

FastTransfer leverages DuckDB's powerful read_csv function to efficiently import CSV files into any supported database. DuckDB provides advanced features like automatic schema detection, parallel processing, and support for complex CSV formats.

DuckDB offers several advantages for CSV processing:

Automatic schema detection - No need to manually define column types
High performance - Optimized CSV parser with parallel reading
Flexible formatting - Handle various delimiters, quotes, and escape characters
Multi-file support - Process multiple CSV files with glob patterns
In-memory processing - No need for a persistent database file

Basic CSV Import

Single Small CSV File

Import a single CSV file using DuckDB's in-memory mode:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT * from read_csv('D:\DATA\arborescence.csv') rcsv" `
  --targetconnectiontype "msbulk" `
  --targetserver "localhost" `
  --targettrusted `
  --targetdatabase "Edilians" `
  --targetschema "dbo" `
  --targettable "arborescence" `
  --parallelmethod "None" `
  --mapmethod "Name" `
  --loadmode "Truncate"

Using :memory:

The :memory: server parameter tells DuckDB to operate in-memory without creating a database file. This is perfect for one-time CSV imports.

Medium CSV File - Single Thread

Import a medium-sized CSV file to PostgreSQL:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT * from read_csv('D:\OpenData\wikidata\vector_database_wikipedia_articles_embedded.csv') rcsv" `
  --targetconnectiontype "pgcopy" `
  --targetserver "localhost:35432" `
  --targetuser "postgres" `
  --targetpassword "YourPassword" `
  --targetdatabase "wikipedia" `
  --targetschema "public" `
  --targettable "articles" `
  --parallelmethod "None" `
  --mapmethod "Name" `
  --loadmode "Truncate"

Parallel CSV Import

DataDriven Method with Calculated Column

Use a calculated distribution key for parallel processing:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT * from read_csv('D:\OpenData\wikidata\vector_database_wikipedia_articles_embedded.csv') rcsv" `
  --targetconnectiontype "pgcopy" `
  --targetserver "localhost:35432" `
  --targetuser "postgres" `
  --targetpassword "YourPassword" `
  --targetdatabase "wikipedia" `
  --targetschema "public" `
  --targettable "articles" `
  --parallelmethod "DataDriven" `
  --distributeKeyColumn "(id/10000)::int" `
  --mapmethod "Name" `
  --loadmode "Truncate" `
  --degree 12

Distribution Key Calculation

The expression (id/10000)::int divides the ID column by 10,000 and casts to integer, creating balanced data chunks for parallel processing.

Random Method on Large CSV

Process a large CSV file with Random parallelization:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT * FROM read_csv('D:\temp\lineitem*.CSV')" `
  --targetconnectiontype "msbulk" `
  --targetserver "localhost" `
  --targettrusted `
  --targetdatabase "tpch_test" `
  --targetschema "dbo" `
  --targettable "lineitem" `
  --parallelmethod "Random" `
  --distributeKeyColumn "l_orderkey" `
  --mapmethod "Name" `
  --loadmode "Truncate" `
  --degree 16

DataDriven Method with Custom Query

Use a custom DataDriven query to calculate optimal distribution:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT * FROM read_csv('D:\temp\lineitem*.CSV') rcsv" `
  --targetconnectiontype "msbulk" `
  --targetserver "localhost" `
  --targettrusted `
  --targetdatabase "tpch_test" `
  --targetschema "dbo" `
  --targettable "lineitem" `
  --parallelmethod "DataDriven" `
  --distributeKeyColumn "(l_orderkey/5000000)::int" `
  --datadrivenquery "select distinct (l_orderkey/5000000)::int l_orderkey_group from read_csv('D:\temp\lineitem*.CSV') rcsv" `
  --mapmethod "Name" `
  --loadmode "Truncate" `
  --degree 13

Multiple CSV Files

Parallel Processing by Filename

Process multiple CSV files in parallel using the filename as the distribution key:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT * exclude filename FROM read_csv('D:\temp\FastBCP_lineitem_Extract\lineitem*.CSV', filename=true)" `
  --targetconnectiontype "msbulk" `
  --targetserver "localhost" `
  --targettrusted `
  --targetdatabase "tpch_test" `
  --targetschema "dbo" `
  --targettable "lineitem" `
  --parallelmethod "DataDriven" `
  --distributeKeyColumn "filename" `
  --datadrivenquery "select file from glob('D:\temp\FastBCP_lineitem_Extract\*.CSV')" `
  --mapmethod "Name" `
  --loadmode "Truncate" `
  --degree 16

Filename as Distribution Key

Use filename=true in read_csv() to add a filename column
Use exclude filename to remove it from the final output
The glob() function returns the list of files for parallel processing
Each thread processes one or more complete files

Advanced CSV Formats

Explicit Format Specification

When automatic detection doesn't work, specify the CSV format explicitly:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT * FROM read_csv('D:\OpenData\wikidata\vector_database_wikipedia_articles_embedded.csv', auto_detect=false, delim=',', quote='""', escape='""', new_line='\n', skip=0, comment='', header=true, columns={'id': 'BIGINT', 'url': 'VARCHAR', 'title': 'VARCHAR', 'text': 'VARCHAR', 'title_vector': 'VARCHAR', 'content_vector': 'VARCHAR', 'vector_id': 'BIGINT'}) rcsv" `
  --targetconnectiontype "pgcopy" `
  --targetserver "localhost:35432" `
  --targetuser "postgres" `
  --targetpassword "YourPassword" `
  --targetdatabase "wikipedia" `
  --targetschema "public" `
  --targettable "articles" `
  --parallelmethod "DataDriven" `
  --distributeKeyColumn "(id/10000)::int" `
  --mapmethod "Name" `
  --loadmode "Truncate" `
  --degree 12

Format Parameters:

auto_detect=false - Disable automatic schema detection
delim=',' - Field delimiter (comma)
quote='"' - Quote character for fields containing delimiters
escape='"' - Escape character for quotes within quoted fields
new_line='\n' - Line terminator
skip=0 - Number of lines to skip at the start
comment='' - Character indicating comment lines to ignore
header=true - First line contains column names
columns={...} - Explicit column names and types

CSV with Vector Data to SQL Server

Import CSV containing vector embeddings as VARCHAR:

.\FastTransfer.exe `
  --sourceconnectiontype "duckdbstream" `
  --sourceserver ":memory:" `
  --query "SELECT id,url,title,text,title_vector::varchar tv,content_vector::varchar cv,vector_id from read_csv('D:\OpenData\wikidata\vector_database_wikipedia_articles_embedded.csv') limit 100" `
  --targetconnectiontype "msbulk" `
  --targetserver "localhost\SS2025" `
  --targettrusted `
  --targetdatabase "wikipedia" `
  --targetschema "dbo" `
  --targettable "wikipedia_articles_embeddings_as_vector" `
  --mapmethod "Name" `
  --loadmode "Append"

Common CSV Scenarios

Tab-Delimited Files

For tab-separated values (TSV):

--query "SELECT * FROM read_csv('file.tsv', delim='\t')"

CSV with Different Encodings

--query "SELECT * FROM read_csv('file.csv', encoding='UTF-8')"

CSV with Custom Quote Character

--query "SELECT * FROM read_csv('file.csv', quote='''', escape='''')"

CSV Without Headers

--query "SELECT * FROM read_csv('file.csv', header=false, columns={'col1': 'VARCHAR', 'col2': 'INTEGER'})"

Skip Header Lines

--query "SELECT * FROM read_csv('file.csv', skip=3)"

Basic CSV Import​

Single Small CSV File​

Medium CSV File - Single Thread​

Parallel CSV Import​

DataDriven Method with Calculated Column​

Random Method on Large CSV​

DataDriven Method with Custom Query​

Multiple CSV Files​

Parallel Processing by Filename​

Advanced CSV Formats​

Explicit Format Specification​

CSV with Vector Data to SQL Server​

Common CSV Scenarios​

Tab-Delimited Files​

CSV with Different Encodings​

CSV with Custom Quote Character​

CSV Without Headers​

Skip Header Lines​