Skip to content

Table

Tables in Synapse allow you to upload a CSV and/or TSV into a querable interface that follow a user-defined schema and queried using a SQL-like syntax.

Tutorial Purpose

In this tutorial you will:

  1. Create a Table and load it with data
  2. Query your Table
  3. Update your Table
  4. Change Table structure
  5. Deleting Table rows and Tables

Prerequisites

  • Make sure that you have completed the Project tutorial.
  • The tutorial assumes you have data in a dataframe and/or CSV you want to query in a SQL like interface in Synapse

1. Creating a table

Initial setup

import synapseclient
from synapseclient.models import Column, Project, query, SchemaStorageStrategy, Table

syn = synapseclient.Synapse()
syn.login()

project = Project(name="My uniquely named project about Alzheimer's Disease").get()

Example data

First, let's load some data. Let's say we had a file, genes.csv:

Name,Chromosome,Start,End,Strand,TranscriptionFactor
foo,1,12345,12600,+,False
arg,2,20001,20200,+,False
zap,2,30033,30999,-,False
bah,1,40444,41444,-,False
bnk,1,51234,54567,+,True
xyz,1,61234,68686,+,False

Loading Data

  • Creating a table from a CSV without specifying initial columns

    table = Table(
        name="My Favorite Genes",
        parent_id=project.id,
    )
    table = table.store()
    table.store_rows(values="/path/to/genes.csv", schema_storage_strategy=SchemaStorageStrategy.INFER_FROM_DATA)
    
  • Creating a table from a CSV with specified columns

    columns = [
        Column(name='Name', column_type='STRING', maximum_size=20),
        Column(name='Chromosome', column_type='STRING', maximum_size=20),
        Column(name='Start', column_type='INTEGER'),
        Column(name='End', column_type='INTEGER'),
        Column(name='Strand', column_type='STRING', enum_values=['+', '-'], maximum_size=1),
        Column(name='TranscriptionFactor', column_type='BOOLEAN')
    ]
    table = Table(
        name="My Favorite Genes",
        columns=columns,
        parent_id=project.id,
    )
    table = table.store()
    table.store_rows(values="/path/to/genes.csv")
    
  • Creating a table from a Pandas DataFrame - Pandas is a popular library for working with tabular data.

    import pandas as pd
    
    df = pd.read_csv("/path/to/genes.csv", index_col=False)
    table = Table(
        name="My Favorite Genes",
        columns=columns,
        parent_id=project.id,
    )
    table = table.store()
    table.store_rows(values=df)
    

2. Querying for data

The query language is quite similar to SQL select statements, except that joins are not supported. The documentation for the Synapse API has lots of query examples. Let's try to query:

results = query(
    f"SELECT * FROM {table.id} WHERE Chromosome = '1' AND Start < 41000 AND \"End\" > 20000"
)
for _, row_info in results.iterrows():
    print(row_info)

3. Changing Data

Once the schema is settled, changes come in two flavors: appending new rows and updating existing ones.

Appending new rows

Appending new rows is fairly straightforward. To continue the previous example, we might add some new genes from another file. Lets take this new data in more_genes.csv

Name,Chromosome,Start,End,Strand,TranscriptionFactor
Qux1,4,201001,202001,+,False
Qux2,4,203001,204001,+,False
  • Using a CSV

    table = Table(
        name="My Favorite Genes",
        parent_id=project.id,
    ).get()
    table = table.store_rows(values="/path/to/more_genes.csv")
    
  • Using Pandas

    new_rows_df = pd.DataFrame({
        "Name": ["Qux3", "Qux4"],
        "Chromosome": ["4", "4"],
        "Start": [201001, 203001],
        "End": [202001, 204001],
        "Strand": ["+", "+"],
        "TranscriptionFactor": [False, False]
    })
    table.store_rows(values=new_rows_df)
    

Updating existing rows

  • Updating the existing table - query the existing table and update the data. Minimizing changesets to contain only rows that actually change will make processing faster.

    results_df = query(f"select * from {table.id} where Chromosome='1'")
    results_df['Name'] = ['rzing', 'zing1', 'zing2', 'zing3']
    table.store_rows(values=results_df)
    
  • Upserting rows (update or insert) - If your data has a primary key, you can use the upsert functionality to update existing rows or insert new rows based on this primary key. This way you won't have to query for the existing data to update your table. This function does not do deletions.

    to_upsert_df = pd.DataFrame({
        "Name": ["Qux3", "Qux5"],
        "Chromosome": ["4", "4"],
        "Start": [201001, 203001],
        "End": [202001, 204001],
        "Strand": ["-", "+"],
        "TranscriptionFactor": [True, False]
    })
    # Qux3 will be updated, Qux5 will be inserted
    table.upsert_rows(values=to_upsert_df,  primary_keys=['Name'])
    

4. Changing Table Structure

  • Adding columns

    table.add_column(
        Column(name="Expression", column_type="STRING")
    )
    table.store()
    
  • Renaming or modifying a column:

    table.columns['Expression'].name = 'Expression2'
    table.columns['Expression'].column_type = 'INTEGER'
    table.store()
    
  • Removing a column

    table.delete_column(name="Expression2")
    table.store()
    

5. Deleting Table rows & Tables

  • Deleting specific rows - Query for the rows you want to delete and call delete_rows on the results

    table.delete_rows(query=f"SELECT * FROM {table.id} WHERE Strand = '+'")
    
  • Or deleting rows based on a dataframe, where the ROW_ID and ROW_VERSION columns specify the rows to be deleted from the table. In this example, rows 2 and 3 are deleted. See this document that describes the expected columns of the dataframe: https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/table/Row.html. Note: The ROW_VERSION begins at 1 upon row creation and increases by one with every subsequent update.

    table.delete_rows(df = pd.DataFrame({"ROW_ID": [2, 3], "ROW_VERSION": [1, 1]}))
    
  • Deleting the whole table will deletes the whole table and all rows

    table.delete()
    

References used in this tutorial