Parallel Parameters
FastTransfer can parallelize data export to significantly improve performance. This section covers the parameters that control parallel execution.
Parallel Method
Use the -m or --parallelmethod parameter to specify how data will be split across parallel threads.
- Windows
- Linux
.\\FastTransfer.exe `
...
--parallelmethod Ntile `
...
./FastTransfer \
...
--parallelmethod Ntile \
...
Syntax:
- Short form:
-m <method> - Long form:
--parallelmethod <method>
Available Methods
None - No parallel processing. Data is exported sequentially.
DataDriven - Uses all values of a column (or a given list provided by --datadrivenquery) to split the export. If the number of values is greater than the degree of parallelism, throttling will be applied.
You can use an expression in the distribute key column instead of a column name.
Example: YEAR(o_orderdate)
Ntile - Uses the distributed column field and ntile values to retrieve evenly distributed chunks of data. Each parallel thread exports a portion based on a range built using the distributed column values. The uniqueness of the distributed column is not mandatory.
RangeId - Uses the distributed column field and its min/max values to retrieve chunks of data. Each parallel thread exports a portion based on a range built using the distributed column values.
Random - Requires a distribution column that must be an integer/bigint with many values (at least as many as the degree of parallelism).
Ctid - Uses an internal hidden field to retrieve chunks of data. Each parallel thread exports a portion based on the CTID range.
PostgreSQL databases only (and some compatible PostgreSQL databases).
Physloc - Uses an internal hidden field to retrieve chunks of data. Each parallel thread exports a portion based on the Physloc range.
Physloc parallel method is for SQL Server databases only.
Rowid - Uses an internal hidden field to retrieve chunks of data. Each parallel thread exports a portion based on the ROWID range.
RowId parallel method is for Oracle databases only.
Methods Comparison
| Method | Parallel | Needs Distributed Column | Database Source Type |
|---|---|---|---|
None | Any | ||
Random | Any | ||
DataDriven | Any | ||
RangeId | Any | ||
Ntile | Any | ||
Ctid | PostgreSQL (pgsql/pgcopy) | ||
Physloc | SQL Server (mssql) | ||
Rowid | Oracle (oraodp) |
Distribute Key Column
Use the -c or --distributekeycolumn parameter to define the column (or computation) on the data source that will be used to split the data into several parts.
FastTransfer will use SQL queries that run in parallel against the source. Each query will have a WHERE clause that retrieves a part of the total data.
This parameter is mandatory when using methods that require a distributed column: Random, DataDriven, RangeId, or Ntile.
- Windows
- Linux
# Using a column name
.\\FastTransfer.exe `
...
--distributekeycolumn order_date `
...
# Using an expression
.\\FastTransfer.exe `
...
--distributekeycolumn "YEAR(order_date)" `
...
# Using a column name
./FastTransfer \
...
--distributekeycolumn order_date \
...
# Using an expression
./FastTransfer \
...
--distributekeycolumn "YEAR(order_date)" \
...
Syntax:
- Short form:
-c <column_name> - Long form:
--distributekeycolumn <column_name>
Degree of Parallelism
Use the -p or --paralleldegree parameter to control how many parallel threads will be used for the export.
DOP Values
Positive value (e.g., 4) - Uses exactly that number of parallel threads. If greater than the number of CPU cores/threads, it will be downscaled to match available cores.
0 - Automatically aligns with the number of cores (or threads if Hyper-Threading is enabled) on the machine.
Negative value (e.g., -2) - Computed as number of cores / abs(dop). For example, if you have 16 cores and set DOP to -2, the actual DOP will be 8.
- Windows
- Linux
# Use 8 parallel threads
.\\FastTransfer.exe `
...
--paralleldegree 8 `
...
# Auto-detect based on CPU cores
.\\FastTransfer.exe `
...
--paralleldegree 0 `
...
# Use half of available cores
.\\FastTransfer.exe `
...
--paralleldegree -2 `
...
# Use 8 parallel threads
./FastTransfer \
...
--paralleldegree 8 \
...
# Auto-detect based on CPU cores
./FastTransfer \
...
--paralleldegree 0 \
...
# Use half of available cores
./FastTransfer \
...
--paralleldegree -2 \
...
Syntax:
- Short form:
-p <value> - Long form:
--paralleldegree <value>
Default: -2
Data Driven Query
Use the --datadrivenquery parameter when using the DataDriven method to provide a query that returns the list of values that will be used to split the data. This allows you to filter the values that will be exported and used to split the data.
- Windows
- Linux
.\\FastTransfer.exe `
...
--datadrivenquery "SELECT tagname FROM tags" `
...
./FastTransfer \
...
--datadrivenquery "SELECT tagname FROM tags" \
...
- Windows
- Linux
.\\FastTransfer.exe `
...
--datadrivenquery "SELECT o_orderdate FROM dim_date where ref_date > getdate() - 30" `
...
./FastTransfer \
...
--datadrivenquery "SELECT o_orderdate FROM dim_date where ref_date > getdate() - 30" \
...
Syntax:
- Long form only:
--datadrivenquery "<query>"
Merge
Use the -M or --merge flag to specify if the temporary files generated for the parallel export should be kept splitted.
Without this flag, distributed files are merge for local csv and parquet files and kept distributed for cloud destination.
Current version allows valid merge for CSV and Parquet formats only.
Merge is not available for cloud destinations
Merge is automatic for local files destination
- Windows
- Linux
# Merge files after parallel export (default for local files)
.\\FastTransfer.exe `
...
--merge true `
...
# Keep distributed files
.\\FastTransfer.exe `
...
--merge false `
...
# Merge files after parallel export (default for local files)
./FastTransfer \
...
--merge true \
...
# Keep distributed files
./FastTransfer \
...
--merge false \
...
Syntax:
- Short form:
-M - Long form:
--merge
Default: true
Complete Example
Here's a complete example using parallel parameters with the DataDriven method:
- Windows
- Linux
.\\FastTransfer.exe `
--connectiontype "pgcopy" `
--server "localhost:15432" `
--database "tpch" `
--user "FastUser" `
--password "FastPassword" `
--query "with T1 AS (select *, to_char(o_orderdate, 'YYYYMM') o_ordermonth from tpch_10.orders) SELECT * FROM T1" `
--directory "D:\temp\tpch\orders\" `
--fileoutput "pgcopy_orders.parquet" `
--method "DataDriven" `
--distributekeycolumn "o_ordermonth" `
--datadrivenquery "SELECT to_char(d, 'YYYYMM') AS month FROM generate_series(DATE '1998-01-01', DATE '1998-12-01',INTERVAL '1 month') AS d" `
--paralleldegree 10 `
--merge "False" `
--runid "pgcopy_to_parquet_parallel12_DataDriven"
./FastTransfer \
--connectiontype "pgcopy" \
--server "localhost:15432" \
--database "tpch" \
--user "FastUser" \
--password "FastPassword" \
--query "with T1 AS (select *, to_char(o_orderdate, 'YYYYMM') o_ordermonth from tpch_10.orders) SELECT * FROM T1" \
--directory "/data/export/tpch/orders/" \
--fileoutput "pgcopy_orders.parquet" \
--method "DataDriven" \
--distributekeycolumn "o_ordermonth" \
--datadrivenquery "SELECT to_char(d, 'YYYYMM') AS month FROM generate_series(DATE '1998-01-01', DATE '1998-12-01',INTERVAL '1 month') AS d" \
--paralleldegree 10 \
--merge "False" \
--runid "pgcopy_to_parquet_parallel12_DataDriven"
This example:
- Connects to a PostgreSQL database on localhost:15432
- Exports data from the dbo.orders table
- Uses the DataDriven distribution method
- Distributes work based on the month of the order date using an expression
- Use a custom query to generate month values to extract from a light query (instead of a
SELECT DISTINCT o_ordermonthfrom the source query) - Uses 10 parallel threads (even if there is 12 month to export)
- Keeps distributed files without merging (--merge false)
- Assigns a custom Run ID for tracking