llama2 chat 7b run on MBP M1 max

최종 목적은 git PR 요약 및 개선점을 머지 이전에 알아서 AI 가 로컬에서 alert 하는게 목적이다.

이를 위해서는 그래도 1분이내 응답이 만료되어야하고, 응답이 1000자 까지 지원해줘야한다(다시보니 500자도 충분할 듯).

일단 다른 openai 나 gemini 등 SaaS API 는 대충 0.050/100만 개 토큰별로 비용발생하니까 생각보다 아까웠다 ㅜㅜ. 그래서 로컬에서 돌릴 수 있는 모델을 찾아보다가 허깅페이스에서 llama2 chat 7b chat 모델을 엑세스 받아서 사용해봄. 심지어 AI 모델 implement 를 위한 방법도 GPT 로 물어보면서 셋업하니까 요즘은 진짜 히고싶은거 있으면 물어보면 다 할 수 있어서 너무 좋다.(물론 개인적인 사용을 위한 용도까지 셋업해주고 상업용으로 AI 모델 셋업하는건 아예 다른얘기다. 옛날엔 GPU 코딩으로 CUDA 병렬처리까지 직접 다 했어야했는데 요즘은 또 모르겠다.)

즉, 시작부터 끝까지 AI 가 세팅해주고, 코딩해주고, diff 도 만들어주고, 요약도 해주고, 개선점도 알려줌. 물론 개인 사용에 한해서!

일단 git diff 를 직접 txt cat 뽑아서 하드코딩한 뒤 돌려봤고 6분 이상 걸림… 결과는 다음과 같다.

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from huggingface_hub import login
import torch

access_token = "hf_..."

login(access_token)

model_id = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model_id, token=access_token)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    token=access_token
)

chat_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto"
)

prompt = """
[INST]
아래는 여러 부분 요약을 합친 것입니다.

1. 전체 변경사항을 한눈에 볼 수 있도록 통합 요약을 해주세요.
2. 코드 품질, 성능, 유지보수성 측면에서 우려사항이 있으면 bullet point로 코멘트 해주세요.

--- PARTIAL SUMMARIES ---
diff --git a/docs/Java/38.md b/docs/Java/38.md
index 060ffbd..0befc32 100644
--- a/docs/Java/38.md
+++ b/docs/Java/38.md
@@ -1,26 +1,61 @@
 ---
 layout: default
-title: 실수방지 셋업
-date: 2025-10-02
+title: llama2 chat 7b ai mac 64GB RAM 에서 단순히 돌려보기
+date: 2025-10-04
 parent: Server
 ---
 
-## Table of contents
+Table of contents
+
 {: .no_toc .text-delta }
 
-1. TOC
-{:toc}
+최종 목적은 git PR 요약 및 개선점을 머지 이전에 알아서 AI 가 작성하도록 하는게 목적이다. 
+
+이를 위해서는 그래도 1분이내 응답이 만료되어야하고, 응답이 1000자 까지 지원해줘야한다. input 은 길게도 들어갈 수 있도록.
+
+일단 다른 openai 나 gemini 등 SaaS API 는 대충 0.050/100만 개 토큰별로 비용발생하니까 생각보다 아까웠다 ㅜㅜ. 그래서 로컬에서 돌릴 수 있는 모델을 찾아보다가 허깅페이스에서 llama2 chat 7b chat 모델을 엑세스 받아서 사용해봄. 심지어 AI 모델 implement 를 위한 방법도 GPT 로 물어보면서 셋업하니까 요즘은 진짜 히고싶은거 있으면 물어보면 다 할 수 있어서 너무 좋다.(물론 개인적인 사용을 위한 용도까지 셋업해주고 상업용으로 AI 모델 셋업하는건 아예 다른얘기다. 옛날엔 GPU 코딩으로 CUDA 
...
--- END ---
[/INST]
"""

outputs = chat_pipeline(
    prompt,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    max_new_tokens=500,
    pad_token_id=tokenizer.eos_token_id
)

print("Result:", outputs[0]["generated_text"])

결과는 다음과 같다.

Based on the provided code, here are some potential issues and concerns:

Lack of documentation: The code lacks proper documentation, making it difficult to understand the purpose and functionality of each line. Adding comments and documentation to the code can help address this issue.
Unused variables: Several variables are defined in the code but are not used anywhere. Removing these unused variables can help simplify the code and reduce potential bugs.
Inconsistent naming conventions: The code uses both camelCase and snake_case naming conventions, which can make it difficult to read and understand. Consistently using one naming convention throughout the code can help improve readability.
Unnecessary imports: Several imports are unnecessary and can be removed, such as `from huggingface_hub import login`.
Unused functions: The `login` function is defined but not used anywhere in the code. Removing this unused function can help simplify the code.
Potential SQL injection vulnerabilities: The code uses raw SQL queries without proper escaping, which can lead to SQL injection vulnerabilities. Properly escaping SQL queries can help mitigate these vulnerabilities.
Inconsistent date and time formatting: The code uses both `YYYY-MM-DD HH:mm:ss` and `YYYY-MM-DD HH:mm:SS` formats for date and time, which can cause confusion. Using a consistent format can help improve readability.
Potential performance issues: The code uses `torch.float16` for data types, which can lead to performance issues. Using `torch.float32` or `torch.float64` can help improve performance.
Lack of error handling: The code does not handle errors properly, which can lead to unexpected behavior. Properly handling errors and exceptions can help improve the reliability of the code.
Inconsistent variable naming: The code uses both `model_id` and `tokenizer` for different variables, which can make it difficult to understand their purpose. Using consistent variable names can help improve readability.

살짝 아쉬운건 markdown 에서 python 코드블럭으로 감싼부분이 실제 코드라고 믿는 것 같다. 그리고 bullet point 로 달라고 했는데 그냥 번호 매겨진 리스트로 나옴. 그리고 6분이나 걸림 ㅜㅜ. 7b 도 이정돈데 13b 모델은 어떨 지 모르겠다.

그래서 결론은?

PR 요약 개선은 그냥 Copilot 에 instruction.md 넣고 쓰는게 훨씬 빠르고 좋다.

The final goal is for an AI running locally to summarize Git PRs and alert me about improvements before merge.

For that, the response should complete within 1 minute, and it should support responses up to around 1000 characters. Looking back, even 500 characters might be enough.

SaaS APIs like OpenAI or Gemini cost roughly 0.050 per 1 million tokens, so it felt more expensive than expected. So I looked for a model that could run locally, got access to the Llama 2 Chat 7B model on Hugging Face, and tried using it. I even asked GPT how to implement and set up the AI model. These days, if there is something I want to do, I can ask and usually get it done, which is really nice. Of course, setting something up for personal use and setting up an AI model commercially are completely different problems. In the past, you had to write CUDA parallel processing code directly with GPU programming, but I am not sure how much that has changed now.

In short, AI sets things up, writes code, creates diffs, summarizes, and points out improvements from start to finish. Of course, only for personal use.

First, I manually extracted a git diff into a txt/cat style input, hardcoded it, and ran it. It took more than 6 minutes. The result was as follows.

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from huggingface_hub import login
import torch

access_token = "hf_..."

login(access_token)

model_id = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model_id, token=access_token)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    token=access_token
)

chat_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto"
)

prompt = """
[INST]
Below is a combined summary of several parts.

1. Create an integrated summary so the whole change can be understood at a glance.
2. If there are concerns in terms of code quality, performance, or maintainability, comment on them as bullet points.

--- PARTIAL SUMMARIES ---
diff --git a/docs/Java/38.md b/docs/Java/38.md
index 060ffbd..0befc32 100644
--- a/docs/Java/38.md
+++ b/docs/Java/38.md
@@ -1,26 +1,61 @@
 ---
 layout: default
-title: Mistake-prevention setup
-date: 2025-10-02
+title: Running llama2 chat 7b ai on a Mac with 64GB RAM
+date: 2025-10-04
 parent: Server
 ---

...
--- END ---
[/INST]
"""

outputs = chat_pipeline(
    prompt,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    max_new_tokens=500,
    pad_token_id=tokenizer.eos_token_id
)

print("Result:", outputs[0]["generated_text"])

The result was as follows.

Based on the provided code, here are some potential issues and concerns:

Lack of documentation: The code lacks proper documentation, making it difficult to understand the purpose and functionality of each line. Adding comments and documentation to the code can help address this issue.
Unused variables: Several variables are defined in the code but are not used anywhere. Removing these unused variables can help simplify the code and reduce potential bugs.
Inconsistent naming conventions: The code uses both camelCase and snake_case naming conventions, which can make it difficult to read and understand. Consistently using one naming convention throughout the code can help improve readability.
Unnecessary imports: Several imports are unnecessary and can be removed, such as `from huggingface_hub import login`.
Unused functions: The `login` function is defined but not used anywhere in the code. Removing this unused function can help simplify the code.
Potential SQL injection vulnerabilities: The code uses raw SQL queries without proper escaping, which can lead to SQL injection vulnerabilities. Properly escaping SQL queries can help mitigate these vulnerabilities.
Inconsistent date and time formatting: The code uses both `YYYY-MM-DD HH:mm:ss` and `YYYY-MM-DD HH:mm:SS` formats for date and time, which can cause confusion. Using a consistent format can help improve readability.
Potential performance issues: The code uses `torch.float16` for data types, which can lead to performance issues. Using `torch.float32` or `torch.float64` can help improve performance.
Lack of error handling: The code does not handle errors properly, which can lead to unexpected behavior. Properly handling errors and exceptions can help improve the reliability of the code.
Inconsistent variable naming: The code uses both `model_id` and `tokenizer` for different variables, which can make it difficult to understand their purpose. Using consistent variable names can help improve readability.

One disappointing point is that it seems to believe the part wrapped in a Python markdown code block is actual code. I also asked for bullet points, but it returned a numbered list. And it took 6 minutes. If a 7B model takes this long, I am not sure what a 13B model would be like.

So the conclusion?

For PR summary improvement, it is much faster and better to just put instruction.md into Copilot and use that.