The radius of the hyperbolic space differed depending on the cell types


Variations in gene expression across samples taken from different clusters, which represent different cell types, tissues and disease states, show more complicated distributions and have attracted a great deal of attention. To study the geometric structure of expression space globally, we selected 100 samples randomly from the whole population instead of local clusters, and performed the same embeddings as in Figure 8A. Surprisingly we find that, as the number of probes increases, the convexity of the Shepard diagram increases from being approximately zero κ = 0.07 to κ = 0.64 in EMDS and from κ = 0.42 to κ = 0.05 in HMDS . These fitting results match the signatures expected for hyperbolic geometry in Figure 7B. It shows that the gene expression space has hyperbolic structure that becomes increasingly more apparent upon including a moderately large number of genes in the measurements. To test robustness of this conclusion and make full use of the whole data, we repeat the sampling process 300 times both for the local sampling where samples are taken from different single clusters, and for the global sampling where samples are broadly taken from the whole data. The samples are taken with replacement. As expected, for samples taken from local clusters,cut flower transport bucket the median values of convexity κ ≈ 0 in EMDS and κ < 0 in HMDS even when all genes are used.

These measurements indicate Euclidean structure. For samples taken across the whole population, with increasing number of probes, the median of κ increases to be positive in EMDS and close to zero in HMDS ; these signatures indicate that samples across population have hyperbolic structure when represented by a moderately large number of genes . The microwell-seq data in this dataset are much sparser than the microarray data. Therefore, we first check whether changes in sparseness of measurement data could affect the geometry detection and its parameters. To this end, we re-analyze synthetic data where at the stage of Euclidean high-dimensional embedding, all values are re-set to zero if their values are in the smallest 5%. Even though intermediate embeddings into larger dimensional space have more values that are set to zero, this does not change the estimated convexity values for low-dimensional embeddings using either Euclidean or hyperbolic metric, and the tests correctly identify the presence of a low-dimensional hyperbolic geometry . With these checks at hand, we proceeded to analyze the microwell-seq data from different mouse organs. Following previous studies, the data was pre-processed using the Seurat algorithm and projected onto top 50 principle components . Next, we applied 5D HMDS to the processed data from four of the mouse organs – brain, kidney, lung and embryonic stem cells. We find that all these data have an underlying hyperbolic structure . It is worth noting that the hyperbolic radius necessary to describe these data is smaller than that for human samples.

Among the four mouse cell types, the largest radius is found for the mouse brain cells with Rbrain = 2.03±0.02 , followed by mouse kidney and lung that have similar radii Rdata = 1.78±0.02 and 1.77±0.02, respectively . Finally, the smallest radius is observed for mouse embryonic stem cells with Rdata = 1.15±0.04 . Because hyperbolic radius indicates the depth of the underlying hierarchical tree, these findings indicate an interesting progression in complexity with embryonic cells exhibiting the smallest degree of hierarchical organization and brain cells exhibiting the largest degree. We note that the HMDS methods produce estimates of the hyperbolic radius that depend on the embedding dimension D. This happens because the density of points increases exponentially with exponent R according to Equation . Results in Figure 10A-E are obtained for a 5D hyperbolic space. In panel F, we show how the estimates of Rdata decrease with embedding dimension in different data sets . Importantly, the relative differences in Rdata across cell types are maintained across a range of different embedding dimensions. The hyperbolic maps continue to have the smallest radius for mouse embryonic cells, larger values for mouse differentiated cells, and yet larger values for mouse brain and human cells. We also find that minimal embedding dimension for all of these datasets is D = 3, and smaller dimension fails to properly embed the data. We also tested robustness of the HMDS method to noise in the data. Towards that goal, we add varying amounts of the multiplicative Gaussian noise to the Lukk et al. data and fit the resulting Shepard diagrams. The fits produce stable convexity estimates for Shepard’s diagrams over a broad range of noise values . This robustness is observed up to very large noise values with ε = 0.5 when noise completely destroys the data structure. The reason for this robustness is that noise does not systematically shift the shape of the Shepard diagram, yielding the same fitting exponent under varying noise amounts.

While MDS embedding can be used to detect intrinsic geometry, it is not ideal for low dimensional visualization. One of the primary reasons that is common to all MDS-based algorithms is that they are not designed to attract similar points together like t-SNE. Consequently, MDS-based methods achieve poor clustering results. These limitations were solved by non-linear methods like t-SNE and UMAP, which however, are only performed in the Euclidean space. As a result, existing visualization methods may cause distortion of global structure in the data that has a global hyperbolic structure. Here we aim to adapt the t-SNE algorithm to work in hyperbolic space. To achieve this we use hyperbolic metric to evaluate global distances in the data while keeping the local clustering aspects of the algorithm. The standard t-SNE method effectively discards large distance information between distant points. We recently proposed a variant of t-SNE which aims to preserve global Euclidean structure in the data, which was called global tSNE. g-SNE method works by adding to the similarity distance measures present in the t-SNE another term that focuses on large Euclidean distances . When applied to Lukk et al. data, g-SNE preserves data distances very well . Despite the high quality of embedding, g-SNE cannot reveal the hierarchical structure of data which is only visible in hyperbolic embedding. Therefore, considering that human gene expression space is locally Euclidean and globally hyperbolic, we develop a hyperbolic t-SNE method that applies hyperbolic metric to global similarities as defined in g-SNE, while still using Euclidean metric for original local similarities. We find that h-SNE gives similar embedding accuracy as g-SNE, both of which largely outperform PCA and UMAP, with R = 0.841 for h-SNE compared to R = 0.744 for PCA and R = 0.627 for UMAP . The distance correlation of Shepard diagram generally quantifies the quality of embedding with respect to large distances, i.e. the global inter-class structure preservation. To measure the local structure preservation, we use the silhouette score which measures the quality of clustering. Here we find that h-SNE achieves higher silhouette score than g-SNE and significantly higher score than other algorithms. These quantitative improvements by h-SNE are also reflected in the improved local and global visualizations that the method provides. For local visualization, the clusters identified by h-SNE are well separated with respect to 15 different tissues and disease types . By comparison, the PCA representation does not separate the fifteen clusters very well,procona flower transport containers mixing nervous system neoplasm cells with the breast cancer cells . The non-neoplastic cell line are also not separated in the PCA representation from the solid tissue neoplasm cell line . The UMAP methods separate clusters better but generate too many disconnected components that are difficult to be matched to sample labels . In terms of global properties, the h-SNE visualization generates a clearer global hierarchical organization of clusters which is not attainable in g-SNE embedding: cells from nervous system neoplasm, breast cancer, non-neoplastic cell line and solid tissue neoplasm cell line are sequentially positioned at different branches in the disk ; in addition, the two principal hematopoietic and malignancy axes can be clearly identified in h-SNE, but not in UMAP . Finally, it is particularly interesting to note the differences in hierarchical positioning that are assigned to breast cancer cells . Many of these cells occupy points with smaller radii. Positions that are closer to the center of the hyperbolic space typically correspond to more de-differentiated cells, as we have already seen in the comparison between mouse embryonic cells and differentiated cells. Thus, the more central positions assigned to breast cancer cells are consistent with observations of them being close to de-differentiated cells.

The quality of h-SNE visualization is also illustrated by the topography with respect to gradient expressions of three marker genes: NCAM1 for nervous system neoplasm, ASPN for breast cancer, and PLOD2 for non-neoplastic cell line. These marker genes are highly expressed in distinct but continuous branches in h-SNE; by comparison, the expression patterns of these three genes are more difficult to organize in g-SNE, to cluster in UMAP, or to separate in PCA . In addition to visualizing discrete data, hyperbolic embedding is especially useful in representing temporally continuous data and predicting lineage information. Klimovskaia, et al. developed Poincar´e map method to visualize hierarchies in single-cell data. This method used similar idea as t-SNE but implemented hyperbolic metric in the representation space. This has lead to improvements in the representations of cell trajectories. However, the Poincar´e map method, being based on t-SNE, still largely discards large distance information. This problem can be well solved by h-SNE which is designed to capture global hyperbolic structure. For comparison with the Poincar´e map method, we select the mouse hematopoiesis data in Moignard et al.. We first apply HMDS method to determine the intrinsic geometry of the data and find that the data space is hyperbolic with Rdata = 1.72 . Then we apply h-SNE to the data and compare the results with Poincar´e map. The h-SNE method produces similar local clustering as in Poincar´e map, but it generates very distinct global pattern: the two differentiated branches 4SFG and 4SG extend around the disk with clear division along the angular variable in the h-SNE visualization . The corresponding pattern is not as clear in Poincar´e map.The Shepard diagrams of the embeddings show that h-SNE preserves data distances much better than Poincar´e map, especially the large distances . When predicting the pseudo time, the h-SNE method produces a clear pseudo time prediction with a much smaller variance compared to the Poincar´e map . Finally, as another example, we show the normalized gene expressions of two marker genes Gfi1b and Cdh5 , finding that these two genes are differentially expressed in different branches in h-SNE . This separation is not obvious in Poincar´e map . The clear hierarchical organization of cells in h-SNE map may help us better understand the relationships between cells at different stages. In this paper we developed a non-metric MDS in hyperbolic space, and showed how it can be used to detect the hidden geometry of data starting with an initial Euclidean representation. By applying this method to several gene expression datasets, we found that gene expression data exhibits Euclidean geometry locally and hyperbolic geometry globally. The lowest values were observed for embryonic cells and the highest values observed for brain cells in mouse data. Given that hyperbolic geometry is indicative of hierarchically organized data, and the spanned radius represents the depth of the network hierarchy, it is perhaps intuitive that the largest value would be observed for highly differentiated and specialized brain cells and the smallest value for the embryonic cells. The method that we used to detect the presence of hyperbolic geometry was based on non-metric MDS. One can also use methods from algebraic topology for this purpose, as has been recently demonstrated for metabolic networks underlying natural odor mixtures produced by plants and animals. The advantage of the topological method is that it is very sensitive to changes in the underlying geometry, including its dimensionality and hyperbolic radius. However, this method is computationally intensive and does not scale well to large datasets. In contrast, the non-metric MDS method is computationally much faster. Therefore, we recommend to use it as a first step in determining whether the underlying geometry is hyperbolic or Euclidean. If hyperbolic geometry is detected, then radial position of embedding points can be used to arrange data hierarchically.