Table¶

Tables in Synapse allow you to upload a CSV and/or TSV into a querable interface that follow a user-defined schema and queried using a SQL-like syntax.

Tutorial Purpose¶

In this tutorial you will:

Create a Table and load it with data
Query your Table
Update your Table
Change Table structure
Deleting Table rows and Tables

Prerequisites¶

Make sure that you have completed the Project tutorial.
The tutorial assumes you have data in a dataframe and/or CSV you want to query in a SQL like interface in Synapse

1. Creating a table¶

Initial setup¶

import synapseclient
from synapseclient.models import Column, Project, query, SchemaStorageStrategy, Table

syn = synapseclient.Synapse()
syn.login()

project = Project(name="My uniquely named project about Alzheimer's Disease").get()

Example data¶

First, let's load some data. Let's say we had a file, genes.csv:

Name,Chromosome,Start,End,Strand,TranscriptionFactor
foo,1,12345,12600,+,False
arg,2,20001,20200,+,False
zap,2,30033,30999,-,False
bah,1,40444,41444,-,False
bnk,1,51234,54567,+,True
xyz,1,61234,68686,+,False

Loading Data¶

Creating a table from a CSV without specifying initial columns

table = Table(
    name="My Favorite Genes",
    parent_id=project.id,
)
table = table.store()
table.store_rows(values="/path/to/genes.csv", schema_storage_strategy=SchemaStorageStrategy.INFER_FROM_DATA)

Creating a table from a CSV with specified columns

columns = [
    Column(name='Name', column_type='STRING', maximum_size=20),
    Column(name='Chromosome', column_type='STRING', maximum_size=20),
    Column(name='Start', column_type='INTEGER'),
    Column(name='End', column_type='INTEGER'),
    Column(name='Strand', column_type='STRING', enum_values=['+', '-'], maximum_size=1),
    Column(name='TranscriptionFactor', column_type='BOOLEAN')
]
table = Table(
    name="My Favorite Genes",
    columns=columns,
    parent_id=project.id,
)
table = table.store()
table.store_rows(values="/path/to/genes.csv")

Creating a table from a Pandas DataFrame - Pandas is a popular library for working with tabular data.

import pandas as pd

df = pd.read_csv("/path/to/genes.csv", index_col=False)
table = Table(
    name="My Favorite Genes",
    columns=columns,
    parent_id=project.id,
)
table = table.store()
table.store_rows(values=df)

2. Querying for data¶

The query language is quite similar to SQL select statements, except that joins are not supported. The documentation for the Synapse API has lots of query examples. Let's try to query:

results = query(
    f"SELECT * FROM {table.id} WHERE Chromosome = '1' AND Start < 41000 AND \"End\" > 20000"
)
for _, row_info in results.iterrows():
    print(row_info)

3. Changing Data¶

Once the schema is settled, changes come in two flavors: appending new rows and updating existing ones.

Appending new rows¶

Appending new rows is fairly straightforward. To continue the previous example, we might add some new genes from another file. Lets take this new data in more_genes.csv

Name,Chromosome,Start,End,Strand,TranscriptionFactor
Qux1,4,201001,202001,+,False
Qux2,4,203001,204001,+,False

Using a CSV

table = Table(
    name="My Favorite Genes",
    parent_id=project.id,
).get()
table = table.store_rows(values="/path/to/more_genes.csv")

Using Pandas

new_rows_df = pd.DataFrame({
    "Name": ["Qux3", "Qux4"],
    "Chromosome": ["4", "4"],
    "Start": [201001, 203001],
    "End": [202001, 204001],
    "Strand": ["+", "+"],
    "TranscriptionFactor": [False, False]
})
table.store_rows(values=new_rows_df)

Updating existing rows¶

Updating the existing table - query the existing table and update the data. Minimizing changesets to contain only rows that actually change will make processing faster.

results_df = query(f"select * from {table.id} where Chromosome='1'")
results_df['Name'] = ['rzing', 'zing1', 'zing2', 'zing3']
table.store_rows(values=results_df)

Upserting rows (update or insert) - If your data has a primary key, you can use the upsert functionality to update existing rows or insert new rows based on this primary key. This way you won't have to query for the existing data to update your table. This function does not do deletions.

to_upsert_df = pd.DataFrame({
    "Name": ["Qux3", "Qux5"],
    "Chromosome": ["4", "4"],
    "Start": [201001, 203001],
    "End": [202001, 204001],
    "Strand": ["-", "+"],
    "TranscriptionFactor": [True, False]
})
# Qux3 will be updated, Qux5 will be inserted
table.upsert_rows(values=to_upsert_df,  primary_keys=['Name'])

4. Changing Table Structure¶

Adding columns

table.add_column(
    Column(name="Expression", column_type="STRING")
)
table.store()

Renaming or modifying a column:

table.columns['Expression'].name = 'Expression2'
table.columns['Expression'].column_type = 'INTEGER'
table.store()

Removing a column

table.delete_column(name="Expression2")
table.store()

5. Deleting Table rows & Tables¶

Deleting specific rows - Query for the rows you want to delete and call delete_rows on the results
```
table.delete_rows(query=f"SELECT * FROM {table.id} WHERE Strand = '+'")
```
Or deleting rows based on a dataframe, where the ROW_ID and ROW_VERSION columns specify the rows to be deleted from the table. In this example, rows 2 and 3 are deleted. See this document that describes the expected columns of the dataframe: https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/table/Row.html. Note: The ROW_VERSION begins at 1 upon row creation and increases by one with every subsequent update.
```
table.delete_rows(df = pd.DataFrame({"ROW_ID": [2, 3], "ROW_VERSION": [1, 1]}))
```
Deleting the whole table will deletes the whole table and all rows
```
table.delete()
```