Без кейворда
The glob module is part of the Python Standard Library, so no separate installation is needed.
Step-by-Step Guide
Step 1: Import the Necessary ModulesFirst, we need to import the modules required for our task.
import duckdb import glob Step 2: Define the File Path Pattern with GlobThe glob module allows you to use wildcard characters to match multiple files. This is extremely useful when you want to process all files of a certain type in a directory.
- Use *.csv to match all CSV files in a specific folder (e.g., ./data/*.csv ).
- Use sales_*.csv to match files that start with “sales_” and end with “.csv”.
Create a connection to a DuckDB database. You can connect to an in-memory database (the default) or a persistent file.
# Connect to an in-memory database conn = duckdb.connect() # Or, connect to a persistent database file # conn = duckdb.connect('my_database.duckdb') Step 4: Create the Target Table (Optional but Recommended)While DuckDB can create a table on the fly, it is often a good practice to define the table structure first, especially if you know the schema of your CSV files. This ensures data type consistency.
conn.execute(""" CREATE TABLE IF NOT EXISTS all_sales_data ( sale_id INTEGER, product_name VARCHAR, sale_amount DECIMAL(10,2), sale_date DATE ) """) Step 5: The Magic – Looping and Inserting DataNow for the main event. We will loop through the list of CSV files we obtained from the glob pattern and insert each file’s data into our DuckDB table. DuckDB makes this incredibly simple with its read_csv_auto function.
for file_path in csv_files: # Use DuckDB's read_csv_auto to read the file and insert it into the table conn.execute(f""" INSERT INTO all_sales_data SELECT * FROM read_csv_auto('') """) print(f"Data from loaded successfully.") # Commit the transaction conn.commit() Step 6: Verify the ResultsFinally, let’s run a query to confirm that all the data has been loaded correctly into our single table.
result = conn.execute("SELECT COUNT(*) AS total_records FROM all_sales_data").fetchall() print(f"Total records in the consolidated table: ") # Preview the data preview = conn.execute("SELECT * FROM all_sales_data LIMIT 5").fetchall() print("Preview of the data:") for row in preview: print(row) Step 7: Don’t Forget to Close the ConnectionIt is good practice to close the database connection once you are done.
conn.close()Complete Example Script
Here is the complete script put together for your reference.
import duckdb import glob # Step 1: Find all CSV files file_pattern = './data/sales_*.csv' csv_files = glob.glob(file_pattern) print(f"Files to process: ") # Step 2: Connect to DuckDB conn = duckdb.connect('my_data.duckdb') # Step 3: Create the target table conn.execute(""" CREATE TABLE IF NOT EXISTS consolidated_sales ( sale_id INTEGER, product VARCHAR, amount DECIMAL(10,2), date DATE ) """) # Step 4: Load each CSV file into the table for file_path in csv_files: conn.execute(f""" INSERT INTO consolidated_sales SELECT * FROM read_csv_auto('') """) # Step 5: Commit and verify conn.commit() count = conn.execute("SELECT COUNT(*) FROM consolidated_sales").fetchone() print(f"Total rows loaded: ") # Step 6: Close the connection conn.close()Important Considerations and Best Practices
Handling Different SchemasIf your CSV files have slightly different schemas (e.g., different column orders or a few extra columns), DuckDB’s read_csv_auto is quite intelligent and can often handle it. However, for maximum control, you can:
- Explicitly define the column names and types in the SELECT statement.
- Use the union_by_name option in newer versions of DuckDB to combine files with different column orders.
- For a very large number of files, consider wrapping the insert operations in a single transaction to speed up the process.
- DuckDB is fast, but reading a huge number of very large files sequentially can take time. Monitor performance for your specific use case.
In a production script, you should add try-except blocks to handle potential errors with individual files, such as a file being missing, corrupt, or having an incompatible format.
Summary: This guide demonstrated an efficient method for bulk loading multiple CSV files into a single DuckDB table using Python’s glob module for file discovery and DuckDB’s powerful SQL capabilities. This technique is perfect for data consolidation, ETL pipelines, and analytical workflows, leveraging DuckDB’s high performance for complex queries on the combined dataset.
Incoming search terms – How to load multiple CSV files into DuckDB using Python – Using glob pattern to combine CSV files in DuckDB – Python script to import all CSV files from a folder to DuckDB – DuckDB bulk insert from multiple CSV files tutorial – Best way to consolidate CSV data with DuckDB and Python – How to use read_csv_auto for multiple files in DuckDB – ETL process for multiple CSV files into one DuckDB table – Python DuckDB glob example for data loading – Merge several CSV files into a single DuckDB database table – Step by step guide for loading CSV files with glob in DuckDB – Handling multiple CSV imports in DuckDB with Python loop – Efficiently combine CSV datasets using DuckDB SQL – Automate CSV data loading to DuckDB with Python scripting – DuckDB Python example for reading multiple CSV files – How to create a single table from multiple CSV files in DuckDB
Theme: Moza Blog by ashathemes.
- Follow Us