docker环境CPU运行Llama 3 8B大模型和对话客户端

创建容器
docker run -d --name glm -v /dp/docker/file/glm:/dp/glm -p 8100:8200 --privileged=true centos:7 /usr/sbin/init
登录容器
docker exec -it -u root glm /bin/bash

安装基础工具
yum install -y gcc patch libffi-devel python-devel zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel gcc-c++ gcc
yum install -y zlib*
yum install -y net-tools.x86_64
yum install -y openssh-server
yum install -y git
yum install -y wget
yum install -y python-setuptools
#git大文件管理要用到lfs
curl -s http://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sh
yum install git-lfs
git lfs install
python环境
conda方式 [centos 安装 Conda]

# 切换Python 3.12.2版本
conda activate PY3.12.2

# 创建并激活虚拟环境
python -m venv venv
source ./venv/bin/activate

升级gcc

先进入root。su，然后输入密码。
将gcc升级到9.3.1版本，使用下面的命令：
yum -y install centos-release-scl
yum install devtoolset-9-gcc*
scl enable devtoolset-9 bash

此时查看gcc版本:gcc -v
Thread model: posix
gcc version 9.3.1 20200408 (Red Hat 9.3.1-2) (GCC)

# 安装依赖包(一定要用国内代理下载量很大)
pip install llama-cpp-python -i http://pypi.mirrors.ustc.edu.cn/simple/ --trusted-host pypi.mirrors.ustc.edu.cn
pip install openai -i http://pypi.mirrors.ustc.edu.cn/simple/ --trusted-host pypi.mirrors.ustc.edu.cn
pip install uvicorn -i http://pypi.mirrors.ustc.edu.cn/simple/ --trusted-host pypi.mirrors.ustc.edu.cn
pip install starlette -i http://pypi.mirrors.ustc.edu.cn/simple/ --trusted-host pypi.mirrors.ustc.edu.cn
pip install fastapi -i http://pypi.mirrors.ustc.edu.cn/simple/ --trusted-host pypi.mirrors.ustc.edu.cn
pip install sse_starlette -i http://pypi.mirrors.ustc.edu.cn/simple/ --trusted-host pypi.mirrors.ustc.edu.cn
pip install starlette_context -i http://pypi.mirrors.ustc.edu.cn/simple/ --trusted-host pypi.mirrors.ustc.edu.cn
pip install pydantic_settings -i http://pypi.mirrors.ustc.edu.cn/simple/ --trusted-host pypi.mirrors.ustc.edu.cn

下载 Llama 3 8B 模型文件下载到挂载的/dp/glm中下次还要用
直接下载压缩后的模型权重，文件为GGUF格式，GGUF格式是为了快速推理和优化内存使用而设计的，
相比以前的GGML格式，GGUF支持更复杂的令牌化过程和特殊令牌处理，能更好地应对多样化的语言模型需求。就是因为有GGUF格式，
Llama 3大语言模型才可以在笔记本电脑上运行，同时GGUF就一个文件，也简化了模型交换和部署的过程
国内镜像：http://hf-mirror.com/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/tree/main
Q4/Q5 等代表模型权重的量化位数（其中Q是Quantization的缩小，即量化），是一种模型压缩技术，用于减少模型大小，同时降低对计算资源的需求（特别是内存），但又尽量保持模型的性能；数字4或5则代表量化精度的位数（Q4 是 4 位，Q5 是 5 位等），精度越高模型体积和内存使用也会越大，但仍然远小于未量化的基线模型
这里下载一个Meta-Llama-3-8B-Instruct.Q4_K_M.gguf

【部署llama.cpp】

工作笔记

docker环境CPU运行Llama 3 8B大模型和对话客户端