Loading data

We provide a simple guide in how to load your own data and initialize the adata object for downstream analysis. This tutorial assumes that the files are either in txt or mtx format. After running this pipeline, please see the folder test_data for how exactly each data is stored.

The first step to use CoSpar is to construct an anndata object that stores all relevant data. The key annotations after initialization are:

  • adata.X: state count matrix, shape (n_cell, n_gene). This should not be log-normalized.

  • adata.var_names: list of gene names, shape (n_genes,).

  • adata.obs[‘time_info’]: time annotation (type: str) for each cell, shape (n_cell,).

  • adata.obs[‘state_info’]: state annotation for each cell, shape (n_cell, 1). [Optional. Can be generated in preprocessing.

  • adata.obsm[‘X_clone’]: clonal labels for each cell in the form of np.array or sparse matrix, shape (n_cell, n_clone).

  • adata.obsm[‘X_pca’]: PCA matrix, shape (n_cell, n_pcs). [Optional. Can be generated in preprocessing.

  • adata.obsm[‘X_emb’]: two-dimensional embedding, shape (n_cell, 2). [Optional. Can be generated in preprocessing.

cs.pp.initialize_adata_object assists this initialization.

[1]:
import cospar as cs
import pandas as pd
import scipy.io as sio
import numpy as np
import os
[2]:
cs.logging.print_version()

# Set the messaging level. At a given value, a running function will
# print information at or below its level.
cs.settings.verbosity = 2  # range: 0 (error),1 (warning),2 (info),3 (hint).

# Plot setting. If you want to control a particular plot,
# just change the setting here, and run that plotting function.
cs.settings.set_figure_params(
    format="png", figsize=[4, 3.5], dpi=75, fontsize=14, pointsize=3
)
Running cospar 0.3.0 (python 3.8.16) on 2023-07-14 12:10.
ERROR: XMLRPC request failed [code: -32500]
RuntimeError: PyPI no longer supports 'pip search' (or XML-RPC search). Please use https://pypi.org/search (via a browser) instead. See https://warehouse.pypa.io/api-reference/xml-rpc.html#deprecated-methods for more information.

Each dataset should have its folder to avoid conflicts.

[3]:
# set the directory for figures and data. If not existed yet, they will be created automtaically.
cs.settings.data_path = "test_data"  # This name should be exactly 'text_data'
cs.settings.figure_path = "fig_cospar"
cs.hf.set_up_folders()

cs.datasets.raw_data_for_import_exercise()  # download the data
try downloading from url
https://github.com/ShouWenWang-Lab/cospar/files/12036732/test_data.zip
... this may take a while but only happens once

Import method A

Assuming that the X_clone data is stored in mtx format

[4]:
df_cell_id = pd.read_csv(os.path.join(cs.settings.data_path, "cell_id.txt"))
df_cell_id.head()
[4]:
Cell_ID
0 cell_10
1 cell_13
2 cell_18
3 cell_32
4 cell_70
[5]:
df_gene = pd.read_csv(os.path.join(cs.settings.data_path, "gene_names.txt"))
df_gene.head()
[5]:
gene_names
0 0610009L18Rik
1 0610037L13Rik
2 1110012L19Rik
3 1110020A21Rik
4 1110028F11Rik
[6]:
df_time = pd.read_csv(os.path.join(cs.settings.data_path, "time_info.txt"))
df_time.head()
[6]:
time_info
0 6
1 6
2 6
3 6
4 6
[7]:
RNA_count_matrix = sio.mmread(
    os.path.join(cs.settings.data_path, "gene_expression_count_matrx.mtx")
)
RNA_count_matrix
[7]:
<781x2481 sparse matrix of type '<class 'numpy.float64'>'
        with 225001 stored elements in COOrdinate format>
[8]:
X_clone = sio.mmread(os.path.join(cs.settings.data_path, "cell_by_clone_matrx.mtx"))
X_clone
[8]:
<781x339 sparse matrix of type '<class 'numpy.int64'>'
        with 781 stored elements in COOrdinate format>
[9]:
df_state = pd.read_csv(os.path.join(cs.settings.data_path, "state_info.txt"))
df_state.head()
[9]:
state_info
0 Neutrophil
1 Baso
2 Monocyte
3 Monocyte
4 Baso
[10]:
df_X_emb = pd.read_csv(os.path.join(cs.settings.data_path, "embedding.txt"))
df_X_emb.head()
[10]:
x y
0 1165.708 -2077.047
1 -980.871 -2055.629
2 2952.953 281.797
3 1314.821 -560.760
4 -1067.502 -1275.488
[11]:
X_emb = np.array([df_X_emb["x"], df_X_emb["y"]]).T

Now, initialize the adata object

[12]:
adata_orig = cs.pp.initialize_adata_object(
    X_state=RNA_count_matrix,
    gene_names=df_gene["gene_names"],
    cell_names=df_cell_id["Cell_ID"],
    time_info=df_time["time_info"],
    X_clone=X_clone,
    state_info=df_state["state_info"],
    X_emb=X_emb,
    data_des="my_data",
)
Create new anndata object
WARNING: X_pca not provided. Downstream processing is needed to generate X_pca before computing the transition map.
Time points with clonal info: ['2' '4' '6']
WARNING: Default ascending order of time points are: ['2' '4' '6']. If not correct, run cs.hf.update_time_ordering for correction.
WARNING: Please make sure that the count matrix adata.X is NOT log-transformed.
/Users/shouwen/Documents/packages/cospar/cospar/preprocessing/_preprocessing.py:96: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
  adata = sc.AnnData(ssp.csr_matrix(X_state))

Update the time ordering. A correct time ordering is assumed later.

[13]:
cs.hf.update_time_ordering(adata_orig, updated_ordering=["2", "4", "6"])
[14]:
adata_orig
[14]:
AnnData object with n_obs × n_vars = 781 × 2481
    obs: 'time_info', 'state_info'
    uns: 'data_des', 'time_ordering', 'clonal_time_points'
    obsm: 'X_clone', 'X_emb'

Alternatively, you can initialize the object by building on an existing adata object. This will keep existing annotations from the old adata. You can make modifications by specifying additional entries.

[15]:
adata_orig_new = cs.pp.initialize_adata_object(adata=adata_orig, X_clone=X_clone)
WARNING: X_pca not provided. Downstream processing is needed to generate X_pca before computing the transition map.
Time points with clonal info: ['2' '4' '6']
WARNING: Default ascending order of time points are: ['2' '4' '6']. If not correct, run cs.hf.update_time_ordering for correction.
WARNING: Please make sure that the count matrix adata.X is NOT log-transformed.
[16]:
cs.pl.embedding(adata_orig, color="state_info")
_images/20210602_loading_data_23_0.png

Import method B

Assuming that the X_clone data is stored in a (cell_id, barcode_id) format

First, initialize the adata without concerning the clonal data

[17]:
df_cell_id = pd.read_csv(os.path.join(cs.settings.data_path, "cell_id.txt"))
df_gene = pd.read_csv(os.path.join(cs.settings.data_path, "gene_names.txt"))
df_time = pd.read_csv(os.path.join(cs.settings.data_path, "time_info.txt"))
RNA_count_matrix = sio.mmread(
    os.path.join(cs.settings.data_path, "gene_expression_count_matrx.mtx")
)
df_state = pd.read_csv(os.path.join(cs.settings.data_path, "state_info.txt"))
df_X_emb = pd.read_csv(os.path.join(cs.settings.data_path, "embedding.txt"))
X_emb = np.array([df_X_emb["x"], df_X_emb["y"]]).T
[18]:
adata_orig = cs.pp.initialize_adata_object(
    X_state=RNA_count_matrix,
    gene_names=df_gene["gene_names"],
    cell_names=df_cell_id["Cell_ID"],
    time_info=df_time["time_info"],
    state_info=df_state["state_info"],
    X_emb=X_emb,
    data_des="my_data",
)
Create new anndata object
WARNING: X_pca not provided. Downstream processing is needed to generate X_pca before computing the transition map.
Time points with clonal info: []
WARNING: Default ascending order of time points are: ['2' '4' '6']. If not correct, run cs.hf.update_time_ordering for correction.
WARNING: Please make sure that the count matrix adata.X is NOT log-transformed.
/Users/shouwen/Documents/packages/cospar/cospar/preprocessing/_preprocessing.py:96: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
  adata = sc.AnnData(ssp.csr_matrix(X_state))

Now, load the clonal data. Here, we start with a table of (cell_id, clone_id) pair. The cell_id does not need to be ranked. It should match the cell_id in the adata_orig.obs_names

[19]:
df_X_clone = pd.read_csv(
    os.path.join(cs.settings.data_path, "clonal_data_in_table_format.txt")
)
df_X_clone
[19]:
Cell_ID Clone_ID
0 cell_0 clone_275
1 cell_1 clone_329
2 cell_2 clone_56
3 cell_3 clone_236
4 cell_4 clone_213
... ... ...
776 cell_776 clone_196
777 cell_777 clone_239
778 cell_778 clone_259
779 cell_779 clone_217
780 cell_780 clone_259

781 rows × 2 columns

[20]:
adata_orig.obs_names
[20]:
Index(['cell_10', 'cell_13', 'cell_18', 'cell_32', 'cell_70', 'cell_80',
       'cell_90', 'cell_97', 'cell_108', 'cell_117',
       ...
       'cell_7389', 'cell_7392', 'cell_7402', 'cell_7407', 'cell_7409',
       'cell_7417', 'cell_7423', 'cell_7435', 'cell_7436', 'cell_7437'],
      dtype='object', length=781)
[21]:
cs.pp.get_X_clone(adata_orig, df_X_clone["Cell_ID"], df_X_clone["Clone_ID"])
100%|███████████████████████████████████████| 87/87 [00:00<00:00, 412321.41it/s]
reorder clones

[22]:
adata_orig
[22]:
AnnData object with n_obs × n_vars = 781 × 2481
    obs: 'time_info', 'state_info'
    uns: 'data_des', 'time_ordering', 'clonal_time_points', 'clone_id'
    obsm: 'X_clone', 'X_emb'
[23]:
cs.pl.embedding(adata_orig, color="state_info")
_images/20210602_loading_data_33_0.png
[ ]: