Loading data¶
We provide a simple guide in how to load your own data and initialize the adata object for downstream analysis. This tutorial assumes that the files are either in txt or mtx format. After running this pipeline, please see the folder test_data for how exactly each data is stored.
The first step to use CoSpar is to construct an anndata
object that stores all relevant data. The key annotations after initialization are:
adata.X: state count matrix, shape (n_cell, n_gene). This should not be log-normalized.
adata.var_names: list of gene names, shape (n_genes,).
adata.obs[‘time_info’]: time annotation (type:
str
) for each cell, shape (n_cell,).adata.obs[‘state_info’]: state annotation for each cell, shape (n_cell, 1). [Optional. Can be generated in preprocessing.
adata.obsm[‘X_clone’]: clonal labels for each cell in the form of np.array or sparse matrix, shape (n_cell, n_clone).
adata.obsm[‘X_pca’]: PCA matrix, shape (n_cell, n_pcs). [Optional. Can be generated in preprocessing.
adata.obsm[‘X_emb’]: two-dimensional embedding, shape (n_cell, 2). [Optional. Can be generated in preprocessing.
cs.pp.initialize_adata_object
assists this initialization.
[1]:
import cospar as cs
import pandas as pd
import scipy.io as sio
import numpy as np
import os
[2]:
cs.logging.print_version()
# Set the messaging level. At a given value, a running function will
# print information at or below its level.
cs.settings.verbosity = 2 # range: 0 (error),1 (warning),2 (info),3 (hint).
# Plot setting. If you want to control a particular plot,
# just change the setting here, and run that plotting function.
cs.settings.set_figure_params(
format="png", figsize=[4, 3.5], dpi=75, fontsize=14, pointsize=3
)
Running cospar 0.2.0 (python 3.8.12) on 2022-02-08 19:25.
Each dataset should have its folder to avoid conflicts.
[3]:
# set the directory for figures and data. If not existed yet, they will be created automtaically.
cs.settings.data_path = "test_data" # This name should be exactly 'text_data'
cs.settings.figure_path = "fig_cospar"
cs.hf.set_up_folders()
cs.datasets.raw_data_for_import_exercise() # download the data
Import method A¶
Assuming that the X_clone data is stored in mtx format
[4]:
df_cell_id = pd.read_csv(os.path.join(cs.settings.data_path, "cell_id.txt"))
df_cell_id.head()
[4]:
Cell_ID | |
---|---|
0 | cell_10 |
1 | cell_13 |
2 | cell_18 |
3 | cell_32 |
4 | cell_70 |
[5]:
df_gene = pd.read_csv(os.path.join(cs.settings.data_path, "gene_names.txt"))
df_gene.head()
[5]:
gene_names | |
---|---|
0 | 0610009L18Rik |
1 | 0610037L13Rik |
2 | 1110012L19Rik |
3 | 1110020A21Rik |
4 | 1110028F11Rik |
[6]:
df_time = pd.read_csv(os.path.join(cs.settings.data_path, "time_info.txt"))
df_time.head()
[6]:
time_info | |
---|---|
0 | 6 |
1 | 6 |
2 | 6 |
3 | 6 |
4 | 6 |
[7]:
RNA_count_matrix = sio.mmread(
os.path.join(cs.settings.data_path, "gene_expression_count_matrx.mtx")
)
RNA_count_matrix
[7]:
<781x2481 sparse matrix of type '<class 'numpy.float64'>'
with 225001 stored elements in COOrdinate format>
[8]:
X_clone = sio.mmread(os.path.join(cs.settings.data_path, "cell_by_clone_matrx.mtx"))
X_clone
[8]:
<781x339 sparse matrix of type '<class 'numpy.int64'>'
with 781 stored elements in COOrdinate format>
[9]:
df_state = pd.read_csv(os.path.join(cs.settings.data_path, "state_info.txt"))
df_state.head()
[9]:
state_info | |
---|---|
0 | Neutrophil |
1 | Baso |
2 | Monocyte |
3 | Monocyte |
4 | Baso |
[10]:
df_X_emb = pd.read_csv(os.path.join(cs.settings.data_path, "embedding.txt"))
df_X_emb.head()
[10]:
x | y | |
---|---|---|
0 | 1165.708 | -2077.047 |
1 | -980.871 | -2055.629 |
2 | 2952.953 | 281.797 |
3 | 1314.821 | -560.760 |
4 | -1067.502 | -1275.488 |
[11]:
X_emb = np.array([df_X_emb["x"], df_X_emb["y"]]).T
Now, initialize the adata object
[12]:
adata_orig = cs.pp.initialize_adata_object(
X_state=RNA_count_matrix,
gene_names=df_gene["gene_names"],
cell_names=df_cell_id["Cell_ID"],
time_info=df_time["time_info"],
X_clone=X_clone,
state_info=df_state["state_info"],
X_emb=X_emb,
data_des="my_data",
)
Create new anndata object
WARNING: X_pca not provided. Downstream processing is needed to generate X_pca before computing the transition map.
Time points with clonal info: ['2' '4' '6']
WARNING: Default ascending order of time points are: ['2' '4' '6']. If not correct, run cs.hf.update_time_ordering for correction.
WARNING: Please make sure that the count matrix adata.X is NOT log-transformed.
Update the time ordering. A correct time ordering is assumed later.
[13]:
cs.hf.update_time_ordering(adata_orig, updated_ordering=["2", "4", "6"])
[14]:
adata_orig
[14]:
AnnData object with n_obs × n_vars = 781 × 2481
obs: 'time_info', 'state_info'
uns: 'data_des', 'time_ordering', 'clonal_time_points'
obsm: 'X_clone', 'X_emb'
Alternatively, you can initialize the object by building on an existing adata object. This will keep existing annotations from the old adata. You can make modifications by specifying additional entries.
[15]:
adata_orig_new = cs.pp.initialize_adata_object(adata=adata_orig, X_clone=X_clone)
WARNING: X_pca not provided. Downstream processing is needed to generate X_pca before computing the transition map.
Time points with clonal info: ['2' '4' '6']
WARNING: Default ascending order of time points are: ['2' '4' '6']. If not correct, run cs.hf.update_time_ordering for correction.
WARNING: Please make sure that the count matrix adata.X is NOT log-transformed.
[16]:
cs.pl.embedding(adata_orig, color="state_info")
Import method B¶
Assuming that the X_clone data is stored in a (cell_id, barcode_id) format
First, initialize the adata without concerning the clonal data
[17]:
df_cell_id = pd.read_csv(os.path.join(cs.settings.data_path, "cell_id.txt"))
df_gene = pd.read_csv(os.path.join(cs.settings.data_path, "gene_names.txt"))
df_time = pd.read_csv(os.path.join(cs.settings.data_path, "time_info.txt"))
RNA_count_matrix = sio.mmread(
os.path.join(cs.settings.data_path, "gene_expression_count_matrx.mtx")
)
df_state = pd.read_csv(os.path.join(cs.settings.data_path, "state_info.txt"))
df_X_emb = pd.read_csv(os.path.join(cs.settings.data_path, "embedding.txt"))
X_emb = np.array([df_X_emb["x"], df_X_emb["y"]]).T
[18]:
adata_orig = cs.pp.initialize_adata_object(
X_state=RNA_count_matrix,
gene_names=df_gene["gene_names"],
cell_names=df_cell_id["Cell_ID"],
time_info=df_time["time_info"],
state_info=df_state["state_info"],
X_emb=X_emb,
data_des="my_data",
)
Create new anndata object
WARNING: X_pca not provided. Downstream processing is needed to generate X_pca before computing the transition map.
Time points with clonal info: []
WARNING: Default ascending order of time points are: ['2' '4' '6']. If not correct, run cs.hf.update_time_ordering for correction.
WARNING: Please make sure that the count matrix adata.X is NOT log-transformed.
Now, load the clonal data. Here, we start with a table of (cell_id, clone_id) pair. The cell_id does not need to be ranked. It should match the cell_id in the adata_orig.obs_names
[19]:
df_X_clone = pd.read_csv(
os.path.join(cs.settings.data_path, "clonal_data_in_table_format.txt")
)
df_X_clone
[19]:
Cell_ID | Clone_ID | |
---|---|---|
0 | cell_0 | clone_275 |
1 | cell_1 | clone_329 |
2 | cell_2 | clone_56 |
3 | cell_3 | clone_236 |
4 | cell_4 | clone_213 |
... | ... | ... |
776 | cell_776 | clone_196 |
777 | cell_777 | clone_239 |
778 | cell_778 | clone_259 |
779 | cell_779 | clone_217 |
780 | cell_780 | clone_259 |
781 rows × 2 columns
[20]:
adata_orig.obs_names
[20]:
Index(['cell_10', 'cell_13', 'cell_18', 'cell_32', 'cell_70', 'cell_80',
'cell_90', 'cell_97', 'cell_108', 'cell_117',
...
'cell_7389', 'cell_7392', 'cell_7402', 'cell_7407', 'cell_7409',
'cell_7417', 'cell_7423', 'cell_7435', 'cell_7436', 'cell_7437'],
dtype='object', length=781)
[21]:
cs.pp.get_X_clone(adata_orig, df_X_clone["Cell_ID"], df_X_clone["Clone_ID"])
100%|██████████| 781/781 [00:00<00:00, 27474.68it/s]
[22]:
adata_orig
[22]:
AnnData object with n_obs × n_vars = 781 × 2481
obs: 'time_info', 'state_info'
uns: 'data_des', 'time_ordering', 'clonal_time_points', 'clone_id'
obsm: 'X_clone', 'X_emb'
[23]:
cs.pl.embedding(adata_orig, color="state_info")