Skip to content

How to Create Metadata Curation Workflows

This guide shows you how to set up a metadata curation workflows in Synapse using the curator extension. You'll learn to find appropriate schemas, create curation tasks for your research data.

What you'll accomplish

By following this guide, you will:

  • Find and select the right JSON schema for your data type
  • Create a metadata curation workflow with automatic validation
  • Set up either file-based or record-based metadata collection
  • Configure curation tasks that guide collaborators through metadata entry

Prerequisites

  • A Synapse account with project creation permissions
  • Python environment with synapseclient and the curator extension installed (ie. pip install --upgrade "synapseclient[curator]")
  • An existing Synapse project and folder where you want to manage metadata
  • A JSON Schema registered in Synapse (many schemas are already available for Sage-affiliated projects, or you can register your own by following the JSON Schema tutorial)

Step 1: Authenticate and import required functions

from synapseclient.extensions.curator import (
    create_record_based_metadata_task,
    create_file_based_metadata_task,
    query_schema_registry
)
from synapseclient import Synapse

syn = Synapse()
syn.login()

Step 2: Find the right schema for your data

Before creating a curation task, identify which JSON schema matches your data type. Many schemas are already registered in Synapse for Sage-affiliated projects. The schema registry contains validated schemas organized by data coordination center (DCC) and data type.

If you need to register your own schema, follow the JSON Schema tutorial to understand the registration process.

# Find the latest schema for your specific data type
schema_uri = query_schema_registry(
    synapse_client=syn,
    dcc="ad",  # Your data coordination center, check out the `syn69735275` table if you do not know your code
    datatype="IndividualAnimalMetadataTemplate"  # Your specific data type
)

print("Latest schema URI:", schema_uri)

When to use this approach: You know your DCC and data type, you want the most current schema version, and it has already been registered into https://www.synapse.org/Synapse:syn69735275/tables/.

Alternative - browse available schemas:

# Get all versions to see what's available
all_schemas = query_schema_registry(
    synapse_client=syn,
    dcc="ad",
    datatype="IndividualAnimalMetadataTemplate",
    return_latest_only=False
)

Step 3: Choose your metadata workflow type

Option A: Record-based metadata

Use this when metadata describes individual data files and is stored as annotations directly on each file.

record_set, curation_task, data_grid = create_record_based_metadata_task(
    synapse_client=syn,
    project_id="syn123456789",         # Your project ID
    folder_id="syn987654321",          # Folder where files are stored
    record_set_name="AnimalMetadata_Records",
    record_set_description="Centralized metadata for animal study data",
    curation_task_name="AnimalMetadata_Curation", # Must be unique within the project
    upsert_keys=["StudyKey"],          # Fields that uniquely identify records
    instructions="Complete all required fields according to the schema. Use StudyKey to link records to your data files.",
    schema_uri=schema_uri,             # Schema found in Step 2
    bind_schema_to_record_set=True
)

print(f"Created RecordSet: {record_set.id}")
print(f"Created CurationTask: {curation_task.task_id}")

What this creates:

  • A RecordSet where metadata is stored as structured records (like a spreadsheet)
  • A CurationTask that guides users through completing the metadata
  • Automatic schema binding for validation
  • A data grid interface for easy metadata entry

Option B: File-based metadata (for unique per-file metadata)

Use this when metadata is normalized in structured records to eliminate duplication and ensure consistency.

entity_view_id, task_id = create_file_based_metadata_task(
    synapse_client=syn,
    folder_id="syn987654321",          # Folder containing your data files
    curation_task_name="FileMetadata_Curation", # Must be unique within the project
    instructions="Annotate each file with metadata according to the schema requirements.",
    attach_wiki=True,                  # Creates a wiki in the folder with the entity view (Defaults to False)
    entity_view_name="Animal Study Files View",
    schema_uri=schema_uri              # Schema found in Step 2
)

print(f"Created EntityView: {entity_view_id}")
print(f"Created CurationTask: {task_id}")

What this creates:

  • An EntityView that displays all files in the folder
  • A CurationTask for guided metadata entry
  • Automatic schema binding to the folder for validation
  • Optional wiki attached to the folder

Complete example script

Here's the full script that demonstrates both workflow types:

from pprint import pprint
from synapseclient.extensions.curator import (
    create_record_based_metadata_task,
    create_file_based_metadata_task,
    query_schema_registry
)
from synapseclient import Synapse

# Step 1: Authenticate
syn = Synapse()
syn.login()

# Step 2: Find schema
schema_uri = query_schema_registry(
    synapse_client=syn,
    dcc="ad",
    datatype="IndividualAnimalMetadataTemplate"
)
print("Using schema:", schema_uri)

# Step 3A: Create record-based workflow
record_set, curation_task, data_grid = create_record_based_metadata_task(
    synapse_client=syn,
    project_id="syn123456789",
    folder_id="syn987654321",
    record_set_name="AnimalMetadata_Records",
    record_set_description="Centralized animal study metadata",
    curation_task_name="AnimalMetadata_Curation",
    upsert_keys=["StudyKey"],
    instructions="Complete metadata for all study animals using StudyKey to link records to data files.",
    schema_uri=schema_uri,
    bind_schema_to_record_set=True
)

print(f"Record-based workflow created:")
print(f"  RecordSet: {record_set.id}")
print(f"  CurationTask: {curation_task.task_id}")

# Step 3B: Create file-based workflow
entity_view_id, task_id = create_file_based_metadata_task(
    synapse_client=syn,
    folder_id="syn987654321",
    curation_task_name="FileMetadata_Curation",
    instructions="Annotate each file with complete metadata according to schema.",
    attach_wiki=True,
    entity_view_name="Animal Study Files View",
    schema_uri=schema_uri
)

print(f"File-based workflow created:")
print(f"  EntityView: {entity_view_id}")
print(f"  CurationTask: {task_id}")

Additional utilities

Validate schema binding on folders

Use this script to verify the schema on a folder against the items contained within that folder:

from synapseclient import Synapse
from synapseclient.models import Folder

# The Synapse ID of the entity you want to bind the JSON Schema to. This should be the ID of a Folder where you want to enforce the schema.
FOLDER_ID = ""

syn = Synapse()
syn.login()

folder = Folder(id=FOLDER_ID).get()
schema_validation = folder.validate_schema()

print(f"Schema validation result for folder {FOLDER_ID}: {schema_validation}")

List existing curation tasks

Use this script to see all curation tasks in a project:

from pprint import pprint
from synapseclient import Synapse
from synapseclient.models.curation import CurationTask

PROJECT_ID = ""  # The Synapse ID of the project to list tasks from

syn = Synapse()
syn.login()

for curation_task in CurationTask.list(
    project_id=PROJECT_ID
):
    pprint(curation_task)

References

API Documentation