Distributed Computing๏
The distributed computing module provides utilities for working with Hadoop Distributed File System (HDFS) and Apache Spark.
Module Overview๏
Distributed functions package โ lazy-loaded.
Contains Spark utilities, HDFS configuration, and HDFS operations. All submodules load on first attribute access via PEP 562 __getattr__.
HDFS Configuration๏
Abstract HDFS Configuration Configurable settings for HDFS operations - no hard-coded project dependencies
Functions๏
- siege_utilities.distributed.hdfs_config.create_cluster_config(data_path, **kwargs)[source]
Create config optimized for cluster deployment
- siege_utilities.distributed.hdfs_config.create_geocoding_config(data_path, **kwargs)[source]
Create config optimized for geocoding workloads
- siege_utilities.distributed.hdfs_config.create_hdfs_config(**kwargs)[source]
Factory function to create HDFS configuration
- siege_utilities.distributed.hdfs_config.create_local_config(data_path, **kwargs)[source]
Create config optimized for local development
Usage Examples๏
Basic HDFS configuration:
import siege_utilities
# Create local HDFS configuration
local_config = siege_utilities.create_local_config()
print(local_config)
# Create cluster configuration
cluster_config = siege_utilities.create_cluster_config(
namenode='hdfs://namenode:8020',
username='hdfs'
)
# Create geocoding-specific configuration
geo_config = siege_utilities.create_geocoding_config(
api_key='your_api_key',
rate_limit=1000
)
HDFS Operations๏
Abstract HDFS Operations - Fully Configurable and Reusable Zero hard-coded project dependencies
Functions๏
- siege_utilities.distributed.hdfs_operations.create_hdfs_operations(config)[source]
Factory function to create HDFS operations instance
- siege_utilities.distributed.hdfs_operations.setup_distributed_environment(config, data_path=None, dependency_paths=None)[source]
Convenience function to set up distributed environment
Usage Examples๏
HDFS operations:
# Create HDFS operations instance
hdfs_ops = siege_utilities.create_hdfs_operations()
# Setup distributed environment
siege_utilities.setup_distributed_environment()
# Check HDFS status (method on operations instance)
status = hdfs_ops.check_hdfs_status()
print(f"HDFS Status: {status}")
# Get file signature (from files module)
signature = siege_utilities.get_quick_file_signature('/path/to/file')
Unit Tests๏
The distributed computing modules have comprehensive test coverage:
โ
test_spark_utils.py - Spark utilities tests pass
Test Coverage:
- HDFS configuration creation
- HDFS operations setup
- Distributed environment configuration
- File signature generation
- HDFS status checking
Test Results: All distributed computing tests pass successfully.