Parallel Parameters

FastTransfer can parallelize data export to significantly improve performance. This section covers the parameters that control parallel execution.

Parallel Method

Use the -m or --parallelmethod parameter to specify how data will be split across parallel threads.

Windows
Linux

.\\FastTransfer.exe `
...
 --parallelmethod Ntile `
...

./FastTransfer \ 
...
 --parallelmethod Ntile \
...

Syntax:

Short form: -m <method>
Long form: --parallelmethod <method>

Available Methods

None - No parallel processing. Data is exported sequentially.

DataDriven - Uses all values of a column (or a given list provided by --datadrivenquery) to split the export. If the number of values is greater than the degree of parallelism, throttling will be applied.

tip

You can use an expression in the distribute key column instead of a column name.
Example: YEAR(o_orderdate)

Ntile - Uses the distributed column field and ntile values to retrieve evenly distributed chunks of data. Each parallel thread exports a portion based on a range built using the distributed column values. The uniqueness of the distributed column is not mandatory.

RangeId - Uses the distributed column field and its min/max values to retrieve chunks of data. Each parallel thread exports a portion based on a range built using the distributed column values.

Random - Requires a distribution column that must be an integer/bigint with many values (at least as many as the degree of parallelism).

Ctid - Uses an internal hidden field to retrieve chunks of data. Each parallel thread exports a portion based on the CTID range.

note

PostgreSQL databases only (and some compatible PostgreSQL databases).

Physloc - Uses an internal hidden field to retrieve chunks of data. Each parallel thread exports a portion based on the Physloc range.

note

Physloc parallel method is for SQL Server databases only.

Rowid - Uses an internal hidden field to retrieve chunks of data. Each parallel thread exports a portion based on the ROWID range.

note

RowId parallel method is for Oracle databases only.

Methods Comparison

Method	Parallel	Needs Distributed Column	Database Source Type
`None`			Any
`Random`			Any
`DataDriven`			Any
`RangeId`			Any
`Ntile`			Any
`Ctid`			PostgreSQL (pgsql/pgcopy)
`Physloc`			SQL Server (mssql)
`Rowid`			Oracle (oraodp)

Distribute Key Column

Use the -c or --distributekeycolumn parameter to define the column (or computation) on the data source that will be used to split the data into several parts.

FastTransfer will use SQL queries that run in parallel against the source. Each query will have a WHERE clause that retrieves a part of the total data.

Information

This parameter is mandatory when using methods that require a distributed column: Random, DataDriven, RangeId, or Ntile.

Windows
Linux

# Using a column name
.\\FastTransfer.exe `
...
 --distributekeycolumn order_date `
...

# Using an expression
.\\FastTransfer.exe `
...
 --distributekeycolumn "YEAR(order_date)" `
...

# Using a column name
./FastTransfer \
...
 --distributekeycolumn order_date \
...

# Using an expression
./FastTransfer \
...
 --distributekeycolumn "YEAR(order_date)" \
...

Syntax:

Short form: -c <column_name>
Long form: --distributekeycolumn <column_name>

Degree of Parallelism

Use the -p or --paralleldegree parameter to control how many parallel threads will be used for the export.

DOP Values

Positive value (e.g., 4) - Uses exactly that number of parallel threads. If greater than the number of CPU cores/threads, it will be downscaled to match available cores.

0 - Automatically aligns with the number of cores (or threads if Hyper-Threading is enabled) on the machine.

Negative value (e.g., -2) - Computed as number of cores / abs(dop). For example, if you have 16 cores and set DOP to -2, the actual DOP will be 8.

Windows
Linux

# Use 8 parallel threads
.\\FastTransfer.exe `
...
 --paralleldegree 8 `
...

# Auto-detect based on CPU cores
.\\FastTransfer.exe `
...
 --paralleldegree 0 `
... 

# Use half of available cores
.\\FastTransfer.exe `
...
 --paralleldegree -2 `
...

# Use 8 parallel threads
./FastTransfer \
...
 --paralleldegree 8 \
...

# Auto-detect based on CPU cores
./FastTransfer \
...
 --paralleldegree 0 \
...

# Use half of available cores
./FastTransfer \ 
...
 --paralleldegree -2 \
...

Syntax:

Short form: -p <value>
Long form: --paralleldegree <value>

Default: -2

Data Driven Query

Use the --datadrivenquery parameter when using the DataDriven method to provide a query that returns the list of values that will be used to split the data. This allows you to filter the values that will be exported and used to split the data.

Windows
Linux

.\\FastTransfer.exe `
...
 --datadrivenquery "SELECT tagname FROM tags" `
...

./FastTransfer \
...
 --datadrivenquery "SELECT tagname FROM tags" \
...

Windows
Linux

.\\FastTransfer.exe `
...
 --datadrivenquery "SELECT o_orderdate FROM dim_date where ref_date > getdate() - 30" `
...

./FastTransfer \
...
 --datadrivenquery "SELECT o_orderdate FROM dim_date where ref_date > getdate() - 30" \
...

Syntax:

Long form only: --datadrivenquery "<query>"

Merge

Use the -M or --merge flag to specify if the temporary files generated for the parallel export should be kept splitted.

Without this flag, distributed files are merge for local csv and parquet files and kept distributed for cloud destination.

warning

Current version allows valid merge for CSV and Parquet formats only.

warning

Merge is not available for cloud destinations

warning

Merge is automatic for local files destination

Windows
Linux

# Merge files after parallel export (default for local files)
.\\FastTransfer.exe `
...
 --merge true `
...

# Keep distributed files 
.\\FastTransfer.exe `
...
 --merge false `
...

# Merge files after parallel export (default for local files)
./FastTransfer \
...
 --merge true \
...

# Keep distributed files 
./FastTransfer \
...
 --merge false \
...

Syntax:

Short form: -M
Long form: --merge

Default: true

Complete Example

Here's a complete example using parallel parameters with the DataDriven method:

Windows
Linux

.\\FastTransfer.exe `
 --connectiontype "pgcopy" `
 --server "localhost:15432" `
 --database "tpch" `
 --user "FastUser" `
 --password "FastPassword" `
 --query "with T1 AS (select *, to_char(o_orderdate, 'YYYYMM') o_ordermonth from tpch_10.orders) SELECT * FROM T1" `
 --directory "D:\temp\tpch\orders\" `
 --fileoutput "pgcopy_orders.parquet" `
 --method "DataDriven" `
 --distributekeycolumn "o_ordermonth" `
 --datadrivenquery "SELECT to_char(d, 'YYYYMM') AS month FROM generate_series(DATE '1998-01-01', DATE '1998-12-01',INTERVAL '1 month') AS d" `
 --paralleldegree 10 `
 --merge "False" `
 --runid "pgcopy_to_parquet_parallel12_DataDriven"

./FastTransfer \
 --connectiontype "pgcopy" \
 --server "localhost:15432" \
 --database "tpch" \
 --user "FastUser" \
 --password "FastPassword" \
 --query "with T1 AS (select *, to_char(o_orderdate, 'YYYYMM') o_ordermonth from tpch_10.orders) SELECT * FROM T1" \
 --directory "/data/export/tpch/orders/" \
 --fileoutput "pgcopy_orders.parquet" \
 --method "DataDriven" \
 --distributekeycolumn "o_ordermonth" \
 --datadrivenquery "SELECT to_char(d, 'YYYYMM') AS month FROM generate_series(DATE '1998-01-01', DATE '1998-12-01',INTERVAL '1 month') AS d" \
 --paralleldegree 10 \
 --merge "False" \
 --runid "pgcopy_to_parquet_parallel12_DataDriven"

This example:

Connects to a PostgreSQL database on localhost:15432
Exports data from the dbo.orders table
Uses the DataDriven distribution method
Distributes work based on the month of the order date using an expression
Use a custom query to generate month values to extract from a light query (instead of a SELECT DISTINCT o_ordermonth from the source query)
Uses 10 parallel threads (even if there is 12 month to export)
Keeps distributed files without merging (--merge false)
Assigns a custom Run ID for tracking

Parallel Method​

Available Methods​

Methods Comparison​

Distribute Key Column​

Degree of Parallelism​

DOP Values​

Data Driven Query​

Merge​

Complete Example​

Parallel Method

Available Methods

Methods Comparison

Distribute Key Column

Degree of Parallelism

DOP Values

Data Driven Query

Merge

Complete Example