DVC Integration
OSeMOSYS-RDM uses DVC (Data Version Control) for reproducible pipeline automation, dependency tracking, and data versioning.
Why DVC?
DVC provides:
Automatic dependency tracking: Re-runs only what changed
Caching: Saves time by skipping unchanged stages
Reproducibility: Ensures consistent results across machines
Data versioning: Track large files without Git bloat
Remote storage: Share data across teams
Main Pipelines
OSeMOSYS-RDM provides three main pipeline commands for easy execution:
python run.py rdm
Executes the complete RDM (Robust Decision Making) pipeline:
Base future generation (Future 0)
RDM experiment with uncertainty sampling
Postprocessing and result consolidation
python run.py prim
Executes only the PRIM (Patient Rule Induction Method) analysis:
Requires RDM results to be available
Performs scenario discovery
Generates predominant parameter ranges
python run.py all
Executes both pipelines sequentially:
First runs the complete RDM pipeline
Then runs the PRIM analysis
Provides end-to-end results from modeling to scenario discovery
Pipeline Structure
The pipeline is defined in dvc.yaml:
stages:
base_future:
cmd: python scripts/run_base_future.py
deps:
- src/workflow/0_Scenarios/
- src/Interface_RDM.xlsx
outs:
- src/workflow/1_Experiment/Executables/
rdm_experiment:
cmd: python scripts/run_rdm_experiment.py
deps:
- src/Interface_RDM.xlsx
- src/workflow/1_Experiment/0_From_Confection/B1_Model_Structure.xlsx
outs:
- src/workflow/1_Experiment/Experimental_Platform/
- src/Results/ # Energy_Input.csv is generated here
postprocess:
cmd: python scripts/run_postprocess.py
deps:
- src/workflow/1_Experiment/Experimental_Platform/
- src/workflow/1_Experiment/Executables/
- src/Results/ # Reads Energy_Input.csv from rdm_experiment
prim_files_creator:
cmd: python scripts/run_prim_files_creator.py
deps:
- src/Results/
outs:
- src/workflow/4_PRIM/t3b_sdiscovery/experiment_data/
prim_analysis:
cmd: python scripts/run_prim_analysis.py
deps:
- src/workflow/4_PRIM/t3b_sdiscovery/experiment_data/
outs:
- src/workflow/4_PRIM/t3b_sdiscovery/sd_ana_*.csv
- src/workflow/4_PRIM/t3b_sdiscovery/t3f4_predominant_ranges_*.xlsx
Visualizing the DAG
View the pipeline structure:
conda run -n AFR-RDM-env dvc dag
Output:
+-------------+
| base_future |
+-------------+
*
*
*
+----------------+
| rdm_experiment |
+----------------+
*
*
*
+-------------+
| postprocess |
+-------------+
*
*
*
+--------------------+
| prim_files_creator |
+--------------------+
*
*
*
+---------------+
| prim_analysis |
+---------------+
The dvc.lock File
dvc.lock records the exact state of each pipeline run:
schema: '2.0'
stages:
base_future:
cmd: python scripts/run_base_future.py
deps:
- path: src/Interface_RDM.xlsx
hash: md5
md5: 0b1b34692ccd178946420e941782389f
size: 30436
outs:
- path: src/workflow/1_Experiment/Executables/
hash: md5
md5: f3064203b41fc19bcde70c161515d026.dir
size: 17887476
nfiles: 3
Important: Commit dvc.lock to Git for full reproducibility.
Ignoring Files
The .dvcignore file specifies what DVC should ignore:
# Temporary solver files
*.lp
*.log
*.sol
# Python cache
__pycache__/
*.py[cod]
# Virtual environments
venv/
env/
# Large dashboard files
*.twbx
Workflow with Git + DVC
Typical Development Cycle
# 1. Make changes to code or configuration
vim src/Interface_RDM.xlsx
# 2. Run pipeline
python run.py rdm
# 3. Check what changed
dvc status
git status
# 4. Commit to Git
git add dvc.yaml dvc.lock src/Interface_RDM.xlsx
git commit -m "Updated uncertainty ranges"
# 5. Push data to DVC remote
dvc push
# 6. Push code to Git
git push
Reproducing on Another Machine
# 1. Clone repository
git clone https://github.com/yourrepo/osemosys-rdm.git
cd osemosys-rdm
# 2. Pull data from DVC remote
dvc pull
# 3. (Optional) Re-run pipeline to verify
python run.py rdm
Running Without Git
OSeMOSYS-RDM can run on machines without Git installed:
# Download repository as ZIP and extract
# Navigate to directory
# DVC will initialize in standalone mode
python run.py rdm
DVC will create .dvc/ with --no-scm flag automatically.
Limitations without Git:
No code version history
Cannot push to Git remotes
Pipeline still works normally
Metrics Tracking
Each stage generates metrics files:
Note
The rdm_experiment stage now generates the consolidated Energy_Input.csv in src/Results/,
while the postprocess stage generates Energy_Output.csv. This means src/Results/ is an
output of rdm_experiment and a dependency of postprocess.
Stage |
Metrics File |
|---|---|
base_future |
|
rdm_experiment |
|
postprocess |
|
prim_files_creator |
|
prim_analysis |
|
Example metrics:
{
"stage": "rdm_experiment",
"timestamp": "2025-01-08 10:30:00",
"futures_generated": 100,
"total_parquet_files": 200,
"scenarios_processed": 1
}
Best Practices
1. Commit dvc.lock
Always commit dvc.lock to track exact pipeline state.
2. Use Remotes for Large Files
Don’t store GB-sized files in Git. Use DVC remotes.
3. Document Parameter Changes
Use descriptive commit messages:
git commit -m "RDM: Increased futures to 200, added fuel uncertainty"
4. Regular Pushes
Push to DVC remote after successful runs:
dvc push
git push
5. Clean Up Cache
Periodically clean old cached data:
dvc gc -w # Remove unused cache (keep workspace)