Import XLSX Files to Citus Data

    Fast, parallel file import using DuckDBStream

    FastTransfer
    Terminal
    .\FastTransfer.exe `
      --sourceconnectiontype "duckdbstream" `
      --sourceserver ":memory:" `
      --sourceserver "your-server" `
      --sourceuser "your-username" `
      --sourcepassword "your-password" `
      --query "SELECT * FROM read_xlsx('D:\path\to\files\*.xlsx, filename=true')" `
      --targetconnectiontype "pgcopy" `
      --targetserver "your-server" `
      --targetuser "your-username" `
      --targetpassword "your-password" `
      --targetdatabase "your-database" `
      --targetschema "your-schema" `
      --targettable "your-table" `
      --method "DataDriven" `
      --distributekeycolumn "filename"  `
      --datadrivenquery "select file from glob('D:\path\to\files\*.xlsx')"  `
      --degree -2  `
      --loadmode "Truncate"  `
      --mapmethod "Name"
    Get FastTransfer

    Source - Excel (XLSX)

    The Excel XLSX format is ubiquitous in enterprise environments. FastTransfer can directly read Excel files without prior conversion.

    Features:

    • Direct reading without Excel installed with DuckDB read_xlsx() syntax
    • Support for multiple sheets
    • Automatic header detection
    • Data type preservation

    Processing - DuckDBStream with DataDriven

    DuckDB is a fast and efficient in-process analytical database. FastTransfer uses DuckDBStream to read multiple file formats with exceptional performance.

    Parallel Method: DataDriven (Files)

    For files, FastTransfer uses the filename as distribution key to parallelize the processing of multiple files simultaneously.

    • Concurrent processing of multiple files
    • Ideal for batch imports
    • Automatic horizontal scaling

    Destination - Citus Data

    FastTransfer uses PostgreSQL's binary COPY protocol for Citus with a PostgreSQL Compatible Source if you use pgcopy both in source and target connection types.

    Loading method:

    Binary COPY Protocol (Distributed)

    Advantages:

    • Binary COPY for maximum performance (Pg Compatible Source Only + pgcopy/pgcopy)
    • Automatic distribution across shards
    • Optimized for distributed tables