Spatial transcriptomics (ST) technologies like 10x Visium, Slide-seq, Stereo-seq, and PIXEL-seq have revolutionized the study of gene expression patterns in tissues by introducing multiple data modalities, offering a comprehensive understanding of transcriptional dynamics and enabling diverse downstream analysis tasks. It becomes a challenge to utilize the multimodal data of ST in a unified and comprehensive way to accomplish multiple tasks and perform comprehensive data analysis. To this end, we developed an ST analysis tool, MUST, based on manifold learning and deep neural networks to align multimodal data and multiple downstream tasks into a universal latent space. Must uses a local ordering-based metric to measure the latent space in order to obtain a latent space that is applicable to multiple downstream tasks. Must then designs multimodal structure preserving loss functions for multimodal fusion based on deep neural networks. The effectiveness of Must was assessed through scientific indicators and biological significance, outperforming state-of-the-art methods with clear advantages in structure retention and similar or superior results in clustering metrics, while also demonstrating more precise organization and validated biomarkers from various perspectives. Overall, MuST presents a promising approach to spatial transcriptomics analysis, potentially advancing our understanding of complex biological systems.