👋
Welcome
to
Cuterwrite 's
Blog
A systematic introduction to the Triton tile-based GPU programming model and practical optimization techniques, from vector addition to FlashAttention.
A comprehensive guide on integrating Open WebUI with MCP protocol through MCPO, leveraging Claw Cloud Run's free container resources for zero-cost deployment.
This article introduces how to implement an efficient and intuitive Retrieval-Augmented Generation (RAG) service locally, integrating Open WebUI, Ollama, and the Qwen2.5 model through Docker. Steps include deploying Open WebUI, configuring Ollama to use the bge-m3 embedding model for document vectorization, and using the Qwen2.5 generation model to answer user queries. Ultimately, a localized system capable of document retrieval and answer generation is achieved. This method not only simplifies the operation process but also enhances data privacy protection and the application capabilities of generative AI.
This article introduces the Scalable Matrix Extension (SME) in the Arm architecture, focusing on its efficient matrix computation capabilities in the Streaming SVE mode, and the mechanism of using the ZA array for large-scale data storage and flexible access, providing powerful hardware acceleration support for high-performance computing applications.
This article introduces Arm's Scalable Vector Extension (SVE) and its enhanced version SVE2. They significantly improve the performance of data-intensive applications (such as HPC and ML) by providing variable-length vector registers, flexible per-lane predication, and a rich instruction set, and ensure portability across different hardware platforms through software binary compatibility. Additionally, SVE provides ACLE (ARM C Language Extensions) to assist developers in programming, allowing SVE instructions to be used directly in C/C++ code by calling intrinsic functions in the arm_sve.h header file for efficient vectorized operations.
Building a complete LLM application requires more than just having a powerful model. A thriving LLM ecosystem needs to cover all aspects from model training and optimization to deployment and application. This article will take you through various aspects of the LLM ecosystem, exploring how to truly apply LLM to real-world scenarios.
This article is reprinted from Zhihu Column: 14. RDMA Memory Window, Author: Savir. To more flexibly and conveniently control memory access permissions, the IB protocol designed MW. This article mainly introduces the function of MW, its relationship with MR, interfaces, and classification. Additionally, compared to the article on MR, it provides a more in-depth introduction to L_Key and R_Key.
This article is reprinted from Zhihu Column: 11. RDMA Shared Receive Queue, Author: Savir. The IB protocol significantly reduces the memory capacity requirements on the receiving end through the SRQ mechanism. This article mainly introduces the principles of SRQ and the similarities and differences with RQ.
This article is reprinted from Zhihu Column: 10. RDMA and Completion Queue, Author: Savir. CQ and QP are interdependent and serve as the medium for hardware to "report task status" to software. This article provides analysis and explanation of most of the content related to CQ in the protocol.
This article is reprinted from Zhihu Column: 9. RDMA Queue Pair, Author: Savir. QP is the most critical concept in RDMA technology, serving as the medium for software to "issue commands" to hardware. This article analyzes and explains most of the content related to QP in the protocol.