Curator
synapseclient.extensions.curator
¶
Synapse Curator Extensions
This module provides library functions for metadata curation tasks in Synapse.
Functions¶
create_file_based_metadata_task
¶
create_file_based_metadata_task(folder_id: str, curation_task_name: str, instructions: str, attach_wiki: bool = False, entity_view_name: str = 'JSON Schema view', schema_uri: Optional[str] = None, enable_derived_annotations: bool = False, *, synapse_client: Optional[Synapse] = None) -> Tuple[str, str]
Create a file view for a schema-bound folder using schematic.
Creating a file-based metadata curation task with schema binding
In this example, we create an EntityView and CurationTask for file-based metadata curation. If a schema_uri is provided, it will be bound to the folder.
import synapseclient
from synapseclient.extensions.curator import create_file_based_metadata_task
syn = synapseclient.Synapse()
syn.login()
entity_view_id, task_id = create_file_based_metadata_task(
synapse_client=syn,
folder_id="syn12345678",
curation_task_name="BiospecimenMetadataTemplate",
instructions="Please curate this metadata according to the schema requirements",
attach_wiki=False,
entity_view_name="Biospecimen Metadata View",
schema_uri="sage.schemas.v2571-amp.Biospecimen.schema-0.0.1"
)
| PARAMETER | DESCRIPTION |
|---|---|
folder_id
|
The Synapse Folder ID to create the file view for.
TYPE:
|
curation_task_name
|
Name for the CurationTask (used as data_type field). Must be unique within the project, otherwise if it matches an existing CurationTask, that task will be updated with new data.
TYPE:
|
instructions
|
Instructions for the curation task.
TYPE:
|
attach_wiki
|
Whether or not to attach a Synapse Wiki (default: False).
TYPE:
|
entity_view_name
|
Name for the created entity view (default: "JSON Schema view").
TYPE:
|
schema_uri
|
Optional JSON schema URI to bind to the folder. If provided, the schema will be bound to the folder before creating the entity view. (e.g., 'sage.schemas.v2571-amp.Biospecimen.schema-0.0.1') |
enable_derived_annotations
|
If true, enable derived annotations. Defaults to False.
TYPE:
|
synapse_client
|
If not passed in and caching was not disabled by
|
| RETURNS | DESCRIPTION |
|---|---|
Tuple[str, str]
|
A tuple containing: - The Synapse ID of the entity view created - The task ID of the curation task created |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If required parameters are missing. |
SynapseError
|
If there are issues with Synapse operations. |
Source code in synapseclient/extensions/curator/file_based_metadata_task.py
293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 | |
create_record_based_metadata_task
¶
create_record_based_metadata_task(project_id: str, folder_id: str, record_set_name: str, record_set_description: str, curation_task_name: str, upsert_keys: List[str], instructions: str, schema_uri: str, bind_schema_to_record_set: bool = True, enable_derived_annotations: bool = False, *, synapse_client: Optional[Synapse] = None) -> Tuple[RecordSet, CurationTask, Grid]
Generate and upload CSV templates as a RecordSet for record-based metadata, create a CurationTask, and also create a Grid to bootstrap the ValidationStatistics.
A number of schema URIs that are already registered to Synapse can be found at:
If you have yet to create and register your JSON schema in Synapse, please refer to the tutorial at https://python-docs.synapse.org/en/stable/tutorials/python/json_schema/.
Creating a record-based metadata curation task with a schema URI
In this example, we create a RecordSet and CurationTask for biospecimen metadata
curation using a schema URI. By default this will also bind the schema to the
RecordSet, however the bind_schema_to_record_set parameter can be set to
False to skip that step.
import synapseclient
from synapseclient.extensions.curator import create_record_based_metadata_task
syn = synapseclient.Synapse()
syn.login()
record_set, task, grid = create_record_based_metadata_task(
synapse_client=syn,
project_id="syn12345678",
folder_id="syn87654321",
record_set_name="BiospecimenMetadata_RecordSet",
record_set_description="RecordSet for biospecimen metadata curation",
curation_task_name="BiospecimenMetadataTemplate",
upsert_keys=["specimenID"],
instructions="Please curate this metadata according to the schema requirements",
schema_uri="schema-org-schema.name.schema-v1.0.0"
)
| PARAMETER | DESCRIPTION |
|---|---|
project_id
|
The Synapse ID of the project where the folder exists.
TYPE:
|
folder_id
|
The Synapse ID of the folder to upload to.
TYPE:
|
record_set_name
|
Name for the RecordSet.
TYPE:
|
record_set_description
|
Description for the RecordSet.
TYPE:
|
curation_task_name
|
Name for the CurationTask (used as data_type field). Must be unique within the project, otherwise if it matches an existing CurationTask, that task will be updated with new data.
TYPE:
|
upsert_keys
|
List of column names to use as upsert keys. |
instructions
|
Instructions for the curation task.
TYPE:
|
schema_uri
|
JSON schema URI for the RecordSet schema. (e.g., 'sage.schemas.v2571-amp.Biospecimen.schema-0.0.1', 'sage.schemas.v2571-ad.Analysis.schema-0.0.0')
TYPE:
|
bind_schema_to_record_set
|
Whether to bind the given schema to the RecordSet (default: True).
TYPE:
|
enable_derived_annotations
|
If true, enable derived annotations. Defaults to False.
TYPE:
|
synapse_client
|
If not passed in and caching was not disabled by
|
| RETURNS | DESCRIPTION |
|---|---|
Tuple[RecordSet, CurationTask, Grid]
|
Tuple containing the created RecordSet, CurationTask, and Grid objects |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If required parameters are missing or if schema_uri is not provided. |
SynapseError
|
If there are issues with Synapse operations. |
Source code in synapseclient/extensions/curator/record_based_metadata_task.py
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 | |
generate_jsonld
¶
generate_jsonld(schema: Any, data_model_labels: DisplayLabelType, output_jsonld: Optional[str], *, synapse_client: Optional[Synapse] = None) -> dict
Convert a CSV data model specification to JSON-LD format with validation and error checking.
This function parses your CSV data model (containing attributes, validation rules,
dependencies, and valid values), converts it to a graph-based JSON-LD representation,
validates the structure for common errors, and saves the result. The generated JSON-LD
file serves as input for generate_jsonschema() and other data model operations.
Data Model Requirements:
Your CSV should include columns defining:
- Attribute names: Property/attribute identifiers
- Display names: Human-readable labels (optional but recommended)
- Descriptions: Documentation for each attribute
- Valid values: Allowed enum values for attributes (comma-separated)
- Validation rules: Rules like
list,regex,inRange,required, etc. - Dependencies: Relationships between attributes using
dependsOn - Required status: Whether attributes are mandatory
Validation Checks Performed:
- Ensures all required fields (like
displayName) are present - Detects cycles in attribute dependencies (which would create invalid schemas)
- Checks for blacklisted characters in display names that Synapse doesn't allow
- Validates that attribute names don't conflict with reserved system names
- Verifies the graph structure is a valid directed acyclic graph (DAG)
| PARAMETER | DESCRIPTION |
|---|---|
schema
|
Path to your data model CSV file. This file should contain your complete data model specification with all attributes, validation rules, and relationships.
TYPE:
|
data_model_labels
|
Label format for the JSON-LD output:
TYPE:
|
output_jsonld
|
Path where the JSON-LD file will be saved. If None, saves alongside
the input CSV with a |
synapse_client
|
Optional Synapse client instance for logging. If None, creates a
new client instance. Use |
Output:
The function logs validation errors and warnings to help you fix data model issues before generating JSON schemas. Errors indicate critical problems that must be fixed, while warnings suggest improvements but won't block schema generation.
| RETURNS | DESCRIPTION |
|---|---|
dict
|
The generated data model as a dictionary in JSON-LD format. The same data is
also saved to the file path specified in |
Using this function to generate JSONLD Schema files:
Basic usage with default output path:
from synapseclient import Synapse
from synapseclient.extensions.curator import generate_jsonld
syn = Synapse()
syn.login()
jsonld_model = generate_jsonld(
schema="path/to/my_data_model.csv",
data_model_labels="class_label",
output_jsonld=None, # Saves to my_data_model.jsonld
synapse_client=syn
)
Specify custom output path:
jsonld_model = generate_jsonld(
schema="models/patient_model.csv",
data_model_labels="class_label",
output_jsonld="~/output/patient_model_v1.jsonld",
synapse_client=syn
)
Use display labels:
jsonld_model = generate_jsonld(
schema="my_model.csv",
data_model_labels="display_label",
output_jsonld="my_model.jsonld",
synapse_client=syn
)
Source code in synapseclient/extensions/curator/schema_generation.py
5987 5988 5989 5990 5991 5992 5993 5994 5995 5996 5997 5998 5999 6000 6001 6002 6003 6004 6005 6006 6007 6008 6009 6010 6011 6012 6013 6014 6015 6016 6017 6018 6019 6020 6021 6022 6023 6024 6025 6026 6027 6028 6029 6030 6031 6032 6033 6034 6035 6036 6037 6038 6039 6040 6041 6042 6043 6044 6045 6046 6047 6048 6049 6050 6051 6052 6053 6054 6055 6056 6057 6058 6059 6060 6061 6062 6063 6064 6065 6066 6067 6068 6069 6070 6071 6072 6073 6074 6075 6076 6077 6078 6079 6080 6081 6082 6083 6084 6085 6086 6087 6088 6089 6090 6091 6092 6093 6094 6095 6096 6097 6098 6099 6100 6101 6102 6103 6104 6105 6106 6107 6108 6109 6110 6111 6112 6113 6114 6115 6116 6117 6118 6119 6120 6121 6122 6123 6124 6125 6126 6127 6128 6129 6130 6131 6132 6133 6134 6135 6136 6137 6138 6139 6140 6141 6142 6143 6144 6145 6146 6147 6148 6149 6150 6151 6152 6153 6154 6155 6156 6157 6158 6159 | |
generate_jsonschema
¶
generate_jsonschema(data_model_source: str, output_directory: str, data_type: Optional[list[str]], data_model_labels: DisplayLabelType, synapse_client: Synapse) -> tuple[list[dict[str, Any]], list[str]]
Generate JSON Schema validation files from a data model with validation rules.
This function creates JSON Schema files that enforce validation rules defined in your CSV data model. The generated schemas can validate manifests for required fields, data types, valid values (enums), ranges, regex patterns, conditional dependencies, and more.
Validation Rules Supported:
- Type validation: Enforces string, number, integer, or boolean types
- Valid values: Creates enum constraints from valid values in the data model
- Required fields: Marks attributes as required (can be component-specific)
- Range validation: Translates
inRangerules to min/max constraints - Pattern matching: Converts
regexrules to JSON Schema patterns - Format validation: Applies
date(ISO date) andurl(URI) format constraints - Array validation: Handles
listrules for array-type properties - Conditional dependencies: Creates
if/thenschemas for dependent attributes
Component-Based Rules:
Rules can be applied selectively to specific components using the #Component syntax
in your validation rules. This allows different validation behavior per manifest type.
| PARAMETER | DESCRIPTION |
|---|---|
data_model_source
|
Path to the data model file (CSV or JSONLD) or URL to the raw JSONLD. Can accept:
TYPE:
|
output_directory
|
Directory path where JSON Schema files will be saved. Each
component will generate a separate
TYPE:
|
data_type
|
List of specific component names (data types) to generate schemas for. If None, generates schemas for all components in the data model. |
data_model_labels
|
Label format for properties in the generated schema:
TYPE:
|
synapse_client
|
Synapse client instance for logging. Use
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[list[dict[str, Any]], list[str]]
|
tuple[list[dict[str, Any]], list[str]]: A tuple containing: - A list of JSON schema dictionaries, each corresponding to a component - A list of file paths where the schemas were written |
Using this function to generate JSON Schema files:
Generate schemas from a CSV data model:
from synapseclient import Synapse
from synapseclient.extensions.curator import generate_jsonschema
syn = Synapse()
syn.login()
schemas, file_paths = generate_jsonschema(
data_model_source="path/to/model.csv",
output_directory="./schemas",
data_type=None, # All components
data_model_labels="class_label",
synapse_client=syn
)
Generate schemas from a JSONLD data model:
schemas, file_paths = generate_jsonschema(
data_model_source="path/to/model.jsonld",
output_directory="./schemas",
data_type=None, # All components
data_model_labels="class_label",
synapse_client=syn
)
Generate schema for specific components:
schemas, file_paths = generate_jsonschema(
data_model_source="https://example.com/model.jsonld",
output_directory="./validation_schemas",
data_type=["Patient", "Biospecimen"],
data_model_labels="class_label",
synapse_client=syn
)
Source code in synapseclient/extensions/curator/schema_generation.py
5872 5873 5874 5875 5876 5877 5878 5879 5880 5881 5882 5883 5884 5885 5886 5887 5888 5889 5890 5891 5892 5893 5894 5895 5896 5897 5898 5899 5900 5901 5902 5903 5904 5905 5906 5907 5908 5909 5910 5911 5912 5913 5914 5915 5916 5917 5918 5919 5920 5921 5922 5923 5924 5925 5926 5927 5928 5929 5930 5931 5932 5933 5934 5935 5936 5937 5938 5939 5940 5941 5942 5943 5944 5945 5946 5947 5948 5949 5950 5951 5952 5953 5954 5955 5956 5957 5958 5959 5960 5961 5962 5963 5964 5965 5966 5967 5968 5969 5970 5971 5972 5973 5974 5975 5976 5977 5978 5979 5980 5981 5982 5983 5984 | |
query_schema_registry
¶
query_schema_registry(synapse_client: Optional[Synapse] = None, schema_registry_table_id: Optional[str] = None, column_config: Optional[SchemaRegistryColumnConfig] = None, return_latest_only: bool = True, **filters) -> Union[str, List[str], None]
Query the schema registry table to find schemas matching the provided filters.
This function searches the Synapse schema registry table for schemas that match the provided filter parameters. Results are sorted by version in descending order (newest first). The function supports any number of filter parameters as long as they are configured in the column_config.
| PARAMETER | DESCRIPTION |
|---|---|
synapse_client
|
Optional authenticated Synapse client instance |
schema_registry_table_id
|
Optional Synapse ID of the schema registry table. If None, uses the default table ID. |
column_config
|
Optional configuration for custom column names. If None, uses default configuration ('version' and 'uri' columns).
TYPE:
|
return_latest_only
|
If True (default), returns only the latest URI as a string. If False, returns all matching URIs as a list of strings.
TYPE:
|
**filters
|
Filter parameters to search for matching schemas. These work as follows:
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
Union[str, List[str], None]
|
If return_latest_only is True: Single URI string of the latest version, or None if not found |
Union[str, List[str], None]
|
If return_latest_only is False: List of URI strings sorted by version (highest version first) |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If no filter parameters are provided |
Expected Table Structure
The schema registry table should contain columns for:
- Schema version for sorting (default: 'version')
- JSON schema URI (default: 'uri')
- Any filterable columns as configured in column_config
Additional columns may be present and will be included in results.
Comprehensive filter usage demonstrations
This includes several examples of how to use the filtering system.
Basic Filtering (using default filters):
from synapseclient import Synapse
from synapseclient.extensions.curator import query_schema_registry
syn = Synapse()
syn.login()
# 1. Get latest schema URI for a specific DCC and datatype
latest_uri = query_schema_registry(
synapse_client=syn,
dcc="ad", # Exact match for Alzheimer's Disease DCC
datatype="Analysis" # Exact datatype match
)
# Returns: "sage.schemas.v2571-ad.Analysis.schema-0.0.0"
# 2. Get all versions of matching schemas (not just latest)
all_versions = query_schema_registry(
synapse_client=syn,
dcc="mc2",
datatype="Biospecimen",
return_latest_only=False
)
# Returns: ["MultiConsortiaCoordinatingCenter-Biospecimen-12.0.0",
# "sage.schemas.v2571-mc2.Biospecimen.schema-9.0.0"]
# 3. Pattern matching with wildcards
# Find all "Biospecimen" schemas across all DCCs
biospecimen_schemas = query_schema_registry(
synapse_client=syn,
datatype="Biospecimen", # Exact match for Biospecimen
return_latest_only=False
)
# Returns: ["MultiConsortiaCoordinatingCenter-Biospecimen-12.0.0",
# "sage.schemas.v2571-mc2.Biospecimen.schema-9.0.0",
# "sage.schemas.v2571-veo.Biospecimen.schema-0.3.0",
# "sage.schemas.v2571-amp.Biospecimen.schema-0.0.1"]
# 4. Pattern matching for DCC variations
mc2_schemas = query_schema_registry(
synapse_client=syn,
dcc="%C2", # Matches 'mc2' and 'MC2'
return_latest_only=False
)
# Returns schemas from both 'mc2' and 'MC2' DCCs
# 5. Using additional columns for filtering (if they exist in your table)
specific_schemas = query_schema_registry(
synapse_client=syn,
dcc="amp", # Must be AMP DCC
org="sage.schemas.v2571", # Must match organization
return_latest_only=False
)
# Returns schemas that match BOTH conditions
Direct Column Filtering (simplified approach):
# Any column in the schema registry table can be used for filtering
# Just use the column name directly as a keyword argument
# Basic filters using standard columns
query_schema_registry(dcc="ad", datatype="Analysis")
query_schema_registry(version="0.0.0")
query_schema_registry(uri="sage.schemas.v2571-ad.Analysis.schema-0.0.0")
# Additional columns (if they exist in your table)
query_schema_registry(org="sage.schemas.v2571")
query_schema_registry(name="ad.Analysis.schema")
# Multiple column filters (all must match)
query_schema_registry(
dcc="mc2",
datatype="Biospecimen",
org="MultiConsortiaCoordinatingCenter"
)
Filter Value Examples with Real Data:
# Exact matching
query_schema_registry(dcc="ad") # Returns schemas with dcc="ad"
query_schema_registry(datatype="Biospecimen") # Returns schemas with datatype="Biospecimen"
query_schema_registry(dcc="MC2") # Returns schemas with dcc="MC2" (case sensitive)
# Pattern matching with wildcards
query_schema_registry(dcc="%C2") # Matches "mc2", "MC2"
query_schema_registry(datatype="%spec%") # Matches "Biospecimen"
# Examples with expected results:
query_schema_registry(dcc="ad", datatype="Analysis")
# Returns: "sage.schemas.v2571-ad.Analysis.schema-0.0.0"
query_schema_registry(datatype="Biospecimen", return_latest_only=False)
# Returns: ["MultiConsortiaCoordinatingCenter-Biospecimen-12.0.0",
# "sage.schemas.v2571-mc2.Biospecimen.schema-9.0.0", ...]
# Multiple conditions (all must be true)
query_schema_registry(
dcc="amp", # AND
datatype="Biospecimen", # AND
org="sage.schemas.v2571" # AND (if org column exists)
)
# Returns: ["sage.schemas.v2571-amp.Biospecimen.schema-0.0.1"]
Source code in synapseclient/extensions/curator/schema_registry.py
123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 | |