llama2 chat 7b run on MBP M1 max

최종 목적은 git PR 요약 및 개선점을 머지 이전에 알아서 AI 가 로컬에서 alert 하는게 목적이다.

이를 위해서는 그래도 1분이내 응답이 만료되어야하고, 응답이 1000자 까지 지원해줘야한다(다시보니 500자도 충분할 듯).

일단 다른 openai 나 gemini 등 SaaS API 는 대충 0.050/100만 개 토큰별로 비용발생하니까 생각보다 아까웠다 ㅜㅜ. 그래서 로컬에서 돌릴 수 있는 모델을 찾아보다가 허깅페이스에서 llama2 chat 7b chat 모델을 엑세스 받아서 사용해봄. 심지어 AI 모델 implement 를 위한 방법도 GPT 로 물어보면서 셋업하니까 요즘은 진짜 히고싶은거 있으면 물어보면 다 할 수 있어서 너무 좋다.(물론 개인적인 사용을 위한 용도까지 셋업해주고 상업용으로 AI 모델 셋업하는건 아예 다른얘기다. 옛날엔 GPU 코딩으로 CUDA 병렬처리까지 직접 다 했어야했는데 요즘은 또 모르겠다.)

즉, 시작부터 끝까지 AI 가 세팅해주고, 코딩해주고, diff 도 만들어주고, 요약도 해주고, 개선점도 알려줌. 물론 개인 사용에 한해서!

일단 git diff 를 직접 txt cat 뽑아서 하드코딩한 뒤 돌려봤고 6분 이상 걸림… 결과는 다음과 같다.

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from huggingface_hub import login
import torch

access_token = "hf_..."

login(access_token)

model_id = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model_id, token=access_token)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    token=access_token
)

chat_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto"
)

prompt = """
[INST]
아래는 여러 부분 요약을 합친 것입니다.

1. 전체 변경사항을 한눈에 볼 수 있도록 통합 요약을 해주세요.
2. 코드 품질, 성능, 유지보수성 측면에서 우려사항이 있으면 bullet point로 코멘트 해주세요.

--- PARTIAL SUMMARIES ---
diff --git a/docs/Java/38.md b/docs/Java/38.md
index 060ffbd..0befc32 100644
--- a/docs/Java/38.md
+++ b/docs/Java/38.md
@@ -1,26 +1,61 @@
 ---
 layout: default
-title: 실수방지 셋업
-date: 2025-10-02
+title: llama2 chat 7b ai mac 64GB RAM 에서 단순히 돌려보기
+date: 2025-10-04
 parent: 📌 Server
 ---
 
-## Table of contents
+Table of contents
+
 {: .no_toc .text-delta }
 
-1. TOC
-{:toc}
+최종 목적은 git PR 요약 및 개선점을 머지 이전에 알아서 AI 가 작성하도록 하는게 목적이다. 
+
+이를 위해서는 그래도 1분이내 응답이 만료되어야하고, 응답이 1000자 까지 지원해줘야한다. input 은 길게도 들어갈 수 있도록.
+
+일단 다른 openai 나 gemini 등 SaaS API 는 대충 0.050/100만 개 토큰별로 비용발생하니까 생각보다 아까웠다 ㅜㅜ. 그래서 로컬에서 돌릴 수 있는 모델을 찾아보다가 허깅페이스에서 llama2 chat 7b chat 모델을 엑세스 받아서 사용해봄. 심지어 AI 모델 implement 를 위한 방법도 GPT 로 물어보면서 셋업하니까 요즘은 진짜 히고싶은거 있으면 물어보면 다 할 수 있어서 너무 좋다.(물론 개인적인 사용을 위한 용도까지 셋업해주고 상업용으로 AI 모델 셋업하는건 아예 다른얘기다. 옛날엔 GPU 코딩으로 CUDA 
...
--- END ---
[/INST]
"""

outputs = chat_pipeline(
    prompt,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    max_new_tokens=500,
    pad_token_id=tokenizer.eos_token_id
)

print("Result:", outputs[0]["generated_text"])

결과는 다음과 같다.

Based on the provided code, here are some potential issues and concerns:

Lack of documentation: The code lacks proper documentation, making it difficult to understand the purpose and functionality of each line. Adding comments and documentation to the code can help address this issue.
Unused variables: Several variables are defined in the code but are not used anywhere. Removing these unused variables can help simplify the code and reduce potential bugs.
Inconsistent naming conventions: The code uses both camelCase and snake_case naming conventions, which can make it difficult to read and understand. Consistently using one naming convention throughout the code can help improve readability.
Unnecessary imports: Several imports are unnecessary and can be removed, such as `from huggingface_hub import login`.
Unused functions: The `login` function is defined but not used anywhere in the code. Removing this unused function can help simplify the code.
Potential SQL injection vulnerabilities: The code uses raw SQL queries without proper escaping, which can lead to SQL injection vulnerabilities. Properly escaping SQL queries can help mitigate these vulnerabilities.
Inconsistent date and time formatting: The code uses both `YYYY-MM-DD HH:mm:ss` and `YYYY-MM-DD HH:mm:SS` formats for date and time, which can cause confusion. Using a consistent format can help improve readability.
Potential performance issues: The code uses `torch.float16` for data types, which can lead to performance issues. Using `torch.float32` or `torch.float64` can help improve performance.
Lack of error handling: The code does not handle errors properly, which can lead to unexpected behavior. Properly handling errors and exceptions can help improve the reliability of the code.
Inconsistent variable naming: The code uses both `model_id` and `tokenizer` for different variables, which can make it difficult to understand their purpose. Using consistent variable names can help improve readability.

살짝 아쉬운건 markdown 에서 python 코드블럭으로 감싼부분이 실제 코드라고 믿는 것 같다. 그리고 bullet point 로 달라고 했는데 그냥 번호 매겨진 리스트로 나옴. 그리고 6분이나 걸림 ㅜㅜ. 7b 도 이정돈데 13b 모델은 어떨 지 모르겠다.

그래서 결론은?

PR 요약 개선은 그냥 Copilot 에 instruction.md 넣고 쓰는게 훨씬 빠르고 좋다.