CSV Import with DuckDB
FastTransfer leverages DuckDB's powerful read_csv function to efficiently import CSV files into any supported database. DuckDB provides advanced features like automatic schema detection, parallel processing, and support for complex CSV formats.
DuckDB offers several advantages for CSV processing:
- Automatic schema detection - No need to manually define column types
- High performance - Optimized CSV parser with parallel reading
- Flexible formatting - Handle various delimiters, quotes, and escape characters
- Multi-file support - Process multiple CSV files with glob patterns
- In-memory processing - No need for a persistent database file
Basic CSV Import
Single Small CSV File
Import a single CSV file using DuckDB's in-memory mode:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * from read_csv('D:\DATA\arborescence.csv') rcsv" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "Edilians" `
--targetschema "dbo" `
--targettable "arborescence" `
--parallelmethod "None" `
--mapmethod "Name" `
--loadmode "Truncate"
The :memory: server parameter tells DuckDB to operate in-memory without creating a database file. This is perfect for one-time CSV imports.
Medium CSV File - Single Thread
Import a medium-sized CSV file to PostgreSQL:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * from read_csv('D:\OpenData\wikidata\vector_database_wikipedia_articles_embedded.csv') rcsv" `
--targetconnectiontype "pgcopy" `
--targetserver "localhost:35432" `
--targetuser "postgres" `
--targetpassword "YourPassword" `
--targetdatabase "wikipedia" `
--targetschema "public" `
--targettable "articles" `
--parallelmethod "None" `
--mapmethod "Name" `
--loadmode "Truncate"
Parallel CSV Import
DataDriven Method with Calculated Column
Use a calculated distribution key for parallel processing:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * from read_csv('D:\OpenData\wikidata\vector_database_wikipedia_articles_embedded.csv') rcsv" `
--targetconnectiontype "pgcopy" `
--targetserver "localhost:35432" `
--targetuser "postgres" `
--targetpassword "YourPassword" `
--targetdatabase "wikipedia" `
--targetschema "public" `
--targettable "articles" `
--parallelmethod "DataDriven" `
--distributeKeyColumn "(id/10000)::int" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 12
The expression (id/10000)::int divides the ID column by 10,000 and casts to integer, creating balanced data chunks for parallel processing.
Random Method on Large CSV
Process a large CSV file with Random parallelization:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * FROM read_csv('D:\temp\lineitem*.CSV')" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "tpch_test" `
--targetschema "dbo" `
--targettable "lineitem" `
--parallelmethod "Random" `
--distributeKeyColumn "l_orderkey" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 16
DataDriven Method with Custom Query
Use a custom DataDriven query to calculate optimal distribution:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * FROM read_csv('D:\temp\lineitem*.CSV') rcsv" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "tpch_test" `
--targetschema "dbo" `
--targettable "lineitem" `
--parallelmethod "DataDriven" `
--distributeKeyColumn "(l_orderkey/5000000)::int" `
--datadrivenquery "select distinct (l_orderkey/5000000)::int l_orderkey_group from read_csv('D:\temp\lineitem*.CSV') rcsv" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 13
Multiple CSV Files
Parallel Processing by Filename
Process multiple CSV files in parallel using the filename as the distribution key:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * exclude filename FROM read_csv('D:\temp\FastBCP_lineitem_Extract\lineitem*.CSV', filename=true)" `
--targetconnectiontype "msbulk" `
--targetserver "localhost" `
--targettrusted `
--targetdatabase "tpch_test" `
--targetschema "dbo" `
--targettable "lineitem" `
--parallelmethod "DataDriven" `
--distributeKeyColumn "filename" `
--datadrivenquery "select file from glob('D:\temp\FastBCP_lineitem_Extract\*.CSV')" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 16
- Use
filename=trueinread_csv()to add a filename column - Use
exclude filenameto remove it from the final output - The
glob()function returns the list of files for parallel processing - Each thread processes one or more complete files
Advanced CSV Formats
Explicit Format Specification
When automatic detection doesn't work, specify the CSV format explicitly:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT * FROM read_csv('D:\OpenData\wikidata\vector_database_wikipedia_articles_embedded.csv', auto_detect=false, delim=',', quote='""', escape='""', new_line='\n', skip=0, comment='', header=true, columns={'id': 'BIGINT', 'url': 'VARCHAR', 'title': 'VARCHAR', 'text': 'VARCHAR', 'title_vector': 'VARCHAR', 'content_vector': 'VARCHAR', 'vector_id': 'BIGINT'}) rcsv" `
--targetconnectiontype "pgcopy" `
--targetserver "localhost:35432" `
--targetuser "postgres" `
--targetpassword "YourPassword" `
--targetdatabase "wikipedia" `
--targetschema "public" `
--targettable "articles" `
--parallelmethod "DataDriven" `
--distributeKeyColumn "(id/10000)::int" `
--mapmethod "Name" `
--loadmode "Truncate" `
--degree 12
Format Parameters:
auto_detect=false- Disable automatic schema detectiondelim=','- Field delimiter (comma)quote='"'- Quote character for fields containing delimitersescape='"'- Escape character for quotes within quoted fieldsnew_line='\n'- Line terminatorskip=0- Number of lines to skip at the startcomment=''- Character indicating comment lines to ignoreheader=true- First line contains column namescolumns={...}- Explicit column names and types
CSV with Vector Data to SQL Server
Import CSV containing vector embeddings as VARCHAR:
.\FastTransfer.exe `
--sourceconnectiontype "duckdbstream" `
--sourceserver ":memory:" `
--query "SELECT id,url,title,text,title_vector::varchar tv,content_vector::varchar cv,vector_id from read_csv('D:\OpenData\wikidata\vector_database_wikipedia_articles_embedded.csv') limit 100" `
--targetconnectiontype "msbulk" `
--targetserver "localhost\SS2025" `
--targettrusted `
--targetdatabase "wikipedia" `
--targetschema "dbo" `
--targettable "wikipedia_articles_embeddings_as_vector" `
--mapmethod "Name" `
--loadmode "Append"
Common CSV Scenarios
Tab-Delimited Files
For tab-separated values (TSV):
--query "SELECT * FROM read_csv('file.tsv', delim='\t')"
CSV with Different Encodings
--query "SELECT * FROM read_csv('file.csv', encoding='UTF-8')"
CSV with Custom Quote Character
--query "SELECT * FROM read_csv('file.csv', quote='''', escape='''')"
CSV Without Headers
--query "SELECT * FROM read_csv('file.csv', header=false, columns={'col1': 'VARCHAR', 'col2': 'INTEGER'})"
Skip Header Lines
--query "SELECT * FROM read_csv('file.csv', skip=3)"