File Utilities

Comprehensive file operations, hashing, path management, and remote file handling.

siege_utilities.files.configure_shared_logging(log_file_path, level='INFO', max_bytes=5000000, backup_count=5)[source]

Configure all loggers to write to the same shared log file. Perfect for distributed computing where all workers should log to one file.

Parameters:
  • log_file_path (str) – Path to shared log file

  • level (str) – Log level for the shared file

  • max_bytes (int) – Max file size before rotation

  • backup_count (int) – Number of backup files to keep

Example

>>> # In Spark job - all workers write to same file
>>> configure_shared_logging("/shared/logs/spark_job.log", level="INFO")
>>> log_info("Worker started", logger_name="worker_1")
>>> log_info("Processing data", logger_name="worker_2")
siege_utilities.files.disable_shared_logging()[source]

Disable shared logging configuration. Loggers will revert to individual file handling.

siege_utilities.files.init_logger(name='siege_utilities', log_to_file=False, log_dir='logs', level='INFO', max_bytes=5000000, backup_count=5, shared_log_file=None)[source]

Initialize and configure a named logger.

Parameters:
  • name (str) – Logger name. Each component can have its own logger.

  • log_to_file (bool) – If True, creates individual log file (unless shared_log_file specified).

  • log_dir (str) – Directory for individual log files.

  • level (str|int) – Logging level for this logger.

  • max_bytes (int) – Max size for rotating file handler.

  • backup_count (int) – How many backup logs to keep.

  • shared_log_file (str) – If provided, this logger writes to shared file instead of individual file.

Returns:

Configured logger instance.

Return type:

logging.Logger

Examples

>>> # Individual loggers with separate files
>>> db_logger = init_logger("database", log_to_file=True, level="DEBUG")
>>> api_logger = init_logger("api", log_to_file=True, level="INFO")
>>> # Multiple loggers sharing one file (great for Spark!)
>>> worker1 = init_logger("worker_1", shared_log_file="spark_workers.log")
>>> worker2 = init_logger("worker_2", shared_log_file="spark_workers.log")
>>> # Use global shared configuration
>>> configure_shared_logging("/shared/logs/app.log")
>>> logger1 = init_logger("component_1")  # Automatically uses shared file
>>> logger2 = init_logger("component_2")  # Automatically uses shared file
siege_utilities.files.get_logger(name=None)[source]

Return a logger instance.

Parameters:

name (str, optional) – Logger name. If None, returns/creates the default logger.

Returns:

Logger instance.

Return type:

logging.Logger

Examples

>>> logger = get_logger()  # Gets default logger
>>> db_logger = get_logger("database")  # Gets database logger
>>> api_logger = get_logger("api")  # Gets API logger
siege_utilities.files.get_all_loggers()[source]

Get all initialized loggers.

Returns:

Dictionary of logger_name -> logger_instance

Return type:

dict

Example

>>> loggers = get_all_loggers()
>>> print(f"Active loggers: {list(loggers.keys())}")
siege_utilities.files.set_default_logger_name(name)[source]

Set the default logger name used by convenience functions.

Parameters:

name (str) – New default logger name

Example

>>> set_default_logger_name("spark_master")
>>> log_info("This will use 'spark_master' logger")
siege_utilities.files.cleanup_logger(name)[source]

Remove a logger and clean up its handlers.

Parameters:

name (str) – Logger name to remove

Returns:

True if logger was removed, False if it didn’t exist

Return type:

bool

siege_utilities.files.cleanup_all_loggers()[source]

Clean up all loggers and their handlers. Useful for testing or application shutdown.

siege_utilities.files.log_debug(message, logger_name=None)[source]

Log a debug message.

Parameters:
  • message – Message to log

  • logger_name (str, optional) – Specific logger to use. Uses default if None.

Examples

>>> log_debug("Debug information")
>>> log_debug("Database query details", logger_name="database")
siege_utilities.files.log_info(message: str, logger_name=None) None[source]

Log an info message.

Parameters:
  • message (str) – Message to log

  • logger_name (str, optional) – Specific logger to use. Uses default if None.

Examples

>>> log_info("Application started")
>>> log_info("Worker processing task", logger_name="worker_1")
siege_utilities.files.log_warning(message, logger_name=None)[source]

Log a warning message.

Parameters:
  • message – Message to log

  • logger_name (str, optional) – Specific logger to use. Uses default if None.

Examples

>>> log_warning("This is a warning")
>>> log_warning("Cache miss detected", logger_name="cache")
siege_utilities.files.log_error(message, logger_name=None)[source]

Log an error message.

Parameters:
  • message – Message to log

  • logger_name (str, optional) – Specific logger to use. Uses default if None.

Examples

>>> log_error("An error occurred")
>>> log_error("Database connection failed", logger_name="database")
siege_utilities.files.log_critical(message, logger_name=None)[source]

Log a critical message.

Parameters:
  • message – Message to log

  • logger_name (str, optional) – Specific logger to use. Uses default if None.

Examples

>>> log_critical("Critical system error")
>>> log_critical("Spark cluster failure", logger_name="spark_master")
siege_utilities.files.parse_log_level(level)[source]

Convert a string or numeric level into a logging level constant.

Hashing

Hash Management Functions - Fixed Version Provides standardized hash functions that actually exist and work properly

siege_utilities.files.hashing.generate_sha256_hash_for_file(file_path) str | None[source]

Generate SHA256 hash for a file - chunked reading for large files

Parameters:

file_path – Path to the file (str or Path object)

Returns:

SHA256 hash as hexadecimal string, or None if error

siege_utilities.files.hashing.get_file_hash(file_path, algorithm='sha256') str | None[source]

Generate hash for a file using specified algorithm

Parameters:
  • file_path – Path to the file (str or Path object)

  • algorithm – Hash algorithm to use (‘sha256’, ‘md5’, ‘sha1’, etc.)

Returns:

Hash as hexadecimal string, or None if error

siege_utilities.files.hashing.calculate_file_hash(file_path) str | None[source]

Alias for get_file_hash with SHA256 - for backward compatibility

siege_utilities.files.hashing.get_quick_file_signature(file_path) str[source]

Generate a quick file signature using file stats + partial hash Faster for change detection, not cryptographically secure

Parameters:

file_path – Path to the file

Returns:

Quick signature string

siege_utilities.files.hashing.verify_file_integrity(file_path, expected_hash, algorithm='sha256') bool[source]

Verify file integrity by comparing with expected hash

Parameters:
  • file_path – Path to the file

  • expected_hash – Expected hash value

  • algorithm – Hash algorithm used

Returns:

True if file matches expected hash, False otherwise

siege_utilities.files.hashing.test_hash_functions()[source]

Test the hash functions with a temporary file

Operations

siege_utilities.files.operations.rmtree(f: Path)[source]

“”” Utility function: rmtree.

Part of Siege Utilities Utilities module. Auto-discovered and available at package level.

Returns:

Description needed

Example

>>> import siege_utilities
>>> result = siege_utilities.rmtree()
>>> print(result)

Note

This function is auto-discovered and available without imports across all siege_utilities modules.

“””

siege_utilities.files.operations.check_if_file_exists_at_path(target_file_path: Path) bool[source]
Parameters:

target_file_path – This is the path we are going to check to see if a file exists

Returns:

True if file exists, False otherwise

siege_utilities.files.operations.delete_existing_file_and_replace_it_with_an_empty_file(target_file_path: Path) Path[source]

This function deletes the existing file and replaces it with an empty file. :param target_file_path: Pathlib.path object to interact with :return: pathlib.Path object to interact with

siege_utilities.files.operations.count_total_rows_in_file_pythonically(target_file_path: Path) int[source]
Parameters:

target_file_path – pathlib.Path object that we are going to count the rows of

Returns:

count of total rows in file

siege_utilities.files.operations.count_empty_rows_in_file_pythonically(target_file_path: Path) int[source]
Parameters:

target_file_path – pathlib.Path object that we are going to count the empty rows of

Returns:

count of empty rows in file

siege_utilities.files.operations.count_duplicate_rows_in_file_using_awk(target_file_path: Path) int[source]

“This uses an awk pattern from Justin Hernandez to count duplicate rows in file” :param target_file_path: pathlib.Path object that we are going to count the duplicate rows of :return: count of duplicate rows in file

siege_utilities.files.operations.count_total_rows_in_file_using_sed(target_file_path: Path) int[source]
Parameters:

target_file_path – pathlib.Path object that we are going to count the total rows of

Returns:

count of total rows in file

siege_utilities.files.operations.count_empty_rows_in_file_using_awk(target_file_path: Path) int[source]
Parameters:

target_file_path – pathlib.Path object that we are going to count the empty rows of

Returns:

count of empty rows in file

siege_utilities.files.operations.remove_empty_rows_in_file_using_sed(target_file_path: Path, fixed_file_path: Path = None)[source]
Parameters:
  • target_file_path – pathlib.Path object that we are going to remove the empty rows of

  • target_file_path – pathlib.Path object to path for saved fixed file

Returns:

siege_utilities.files.operations.write_data_to_a_new_empty_file(target_file_path: Path, data: str) Path[source]
Parameters:
  • target_file_path – file path to write data to

  • data – what to write

Returns:

the path to the file

siege_utilities.files.operations.write_data_to_an_existing_file(target_file_path: Path, data: str) Path[source]
Parameters:
  • target_file_path – file path to write data to

  • data – what to write

Returns:

the path to the file

siege_utilities.files.operations.check_for_file_type_in_directory(target_file_path: Path, file_type: str) bool[source]
Parameters:
  • target_file_path

  • file_type

Returns:

bool

Paths

siege_utilities.files.paths.ensure_path_exists(desired_path: Path) Path[source]

“”” Perform file operations: ensure path exists.

Part of Siege Utilities File Operations module. Auto-discovered and available at package level.

Returns:

Description needed

Example

>>> import siege_utilities
>>> result = siege_utilities.ensure_path_exists()
>>> print(result)

Note

This function is auto-discovered and available without imports across all siege_utilities modules.

“””

siege_utilities.files.paths.unzip_file_to_its_own_directory(path_to_zipfile: Path, new_dir_name=None, new_dir_parent=None)[source]

“”” Perform file operations: unzip file to its own directory.

Part of Siege Utilities File Operations module. Auto-discovered and available at package level.

Returns:

Description needed

Example

>>> import siege_utilities
>>> result = siege_utilities.unzip_file_to_its_own_directory()
>>> print(result)

Note

This function is auto-discovered and available without imports across all siege_utilities modules.

“””

Remote

siege_utilities.files.remote.download_file(url, local_filename)[source]

Download a file from a URL to a local file with progress bar

Parameters:
  • url – The URL to download from

  • local_filename – The local path where the file should be saved

Returns:

The local filename if successful, False otherwise

siege_utilities.files.remote.generate_local_path_from_url(url: str, directory_path: Path, as_string: bool = True)[source]

“”” Perform file operations: generate local path from url.

Part of Siege Utilities File Operations module. Auto-discovered and available at package level.

Returns:

Description needed

Example

>>> import siege_utilities
>>> result = siege_utilities.generate_local_path_from_url()
>>> print(result)

Note

This function is auto-discovered and available without imports across all siege_utilities modules.

“””

Shell

siege_utilities.files.shell.run_subprocess(command_list)[source]

Run a shell command as a subprocess and handle the output.

Parameters:

command_list – The command to run, as a list or string

Returns:

The command output (stdout if successful, stderr if failed)