[논문] Towards Monosemanticity: Decomposing Language Models with Dictionary Learning

https://transformer-circuits.pub/2023/monosemantic-features/index.html

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Authors Trenton Bricken*, Adly Templeton*, Joshua Batson*, Brian Chen*, Adam Jermyn*, Tom Conerly, Nicholas L Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Ta

transformer-circuits.pub

Mechanical interpretability는 neural network를 작은 이해 가능한 요소들로 분해해 전체 network를 이해하려 노력한다.

이때 neuron 자체는 사람이 쉽게 이해할 수 있는 unit이 아님을 발견했다.
- 이는 neuron들이 polysemantic하기 때문이다.
- Polysemanticity의 주요 원인은 superposition이다.
  - Superposition이란 모델이 차원 수 보다 더 많은 독립적인 "feature"들을 나타내기 위해 하나 이상의 neuron들이 동시에 하나의 "feature"에 할당되는 현상을 뜻한다.

본 논문에서는 weak dictionary learning 알고리즘인 sparse autoencoder를 이용해 학습된 모델에서 learned feature들을 생성하는 방법을 제시한다.

이때 생성되는 feature들은 neuron보다 monosemantic하다. 즉, 이해하기 쉽다.

본 논문의 결과를 정리하면 다음과 같다.

Sparse Autoencoder들은 보다 monosemantic한 feature들을 추출한다.
Sparse Autoencoder는 neuron basis로는 확일할 수 없는 이해 가능한 feature들을 생성할 수 있다.
Sparse Autoencoder의 feature들을 통해 transformer 모델의 생성 결과를 바꿀 수 있다.
Sparse Autoencoder들은 비교적 universal한 feature들을 생성한다.
Autoencoder의 크기를 증가시킬수록 feature들이 "split"하는 경향을 보인다.
512개의 neuron들은 몇십만개의 feature들을 표현할 수 있다.
Feature들은 복잡한 시스템을 구현하는 "finite-state automata" 시스템처럼 연결되어 있다.

Problem Setup

Neural network들을 reverse engineering하는데 가장 큰 어려움은 curse of dimensionality이다.

모델이 더 커질수록 lantent space는 지수적으로 커진다.

우리가 이해하고자하는 모델은 ReLU가 적용되는 MLP layer를 가진 one-layer transformer이다.

우리의 목표는 MLP activation들을 이해 가능한 "features"로 분해하는 것이다.

또한 MLP layer들이 superposition을 가질 것으로 예상하기 때문에 neuron 수보다 더 많은 feature들로 분해한다.

256 neuron들의 1배에서 256배 많은 feature들로 분해하는 실험을 진행한다.

Features as a Decomposition

Neural network들이 activation space에 해석 가능한 linear direction이 있다는 것은 많은 연구를 통해 알려져있다.

만약 linear direction들이 해석 가능하다면 복잡한 방향이 만들어 질 수 있는 의미있는 "basic set" 방향들이 존재한다고 가정할 수 있다.

이런 direction들을 feature들이라 정의하고 우리가 모델을 분해하고자하는 기본 단위이다.
가끔 개별 neuron들이 이런 기본이 되는 이해 가능한 unit이 되기도 하지만 대부분의 경우 그렇지 않다.

따라서 activation vector $x^j$를 일반적인 feature들로 분해한다.

$\mathbf{x}^j \approx \mathbf{b} + \sum\limits_{i} f_i(\mathbf{x}^j) \mathbf{d}_i$

$x^j$: datapoint j의 길이가 $d_{MLP}$인 activation vector
$f_i(x^j)$: activation feature i
$d_i$: feature i의 방향을 나타내는 activation space에서의 unit vector
$b$: bias

이는 dictionary learning에서 자주 사용되는 linear matrix factorization의 일종이다.

제시된 sparse autoencoder 환경에서 feature activation들은 encoder의 output을 뜻한다.

$f_i(x) = \text{ReLU}(W_e(\mathbf{x} - \mathbf{b}_d) + \mathbf{b}_e)_i$

$W_e$: encoder의 weight matrix
$b_d, b_e$: pre-encoder bias, encoder bias
Feature direction: Decoder weight matrix $W_d$의 column들

Superposition Hypothesis

Decomposition과 superposition은 어떻게 연관되어 있는가?

Superposition hypothesis는 neuron들의 수보다 더 많은 feature들을 나타낸다 가정한다.

그 결과 feature direction들이 overcomplete basis를 형성한다 기대한다.
즉, neuron의 수보다 더 많은 direction $d_i$를 가지게 된다는 것이다.
또한 feature activation이 sparse하다 가정한다. (이는 superposition이 일어나기 위한 필요조건이다)

이런 가정들은 dictionary learning 문제와 수학적으로 동일하다.

What makes a good decomposition?

각 datapoint의 MLP activation을 feature들의 sparse weighted sum으로 표현했다고 가정해보자.

이런 decomposotion이 neural network를 이해하는데 유용한지 판별하는 방법은 세 가지가 있다.

1) 어떤 조건에서 각 feature가 active되는지 이해할 수 있다.
2) 각 feature가 미치는 전반적인 영향을 이해할 수 있다.
3) Feature들이 MLP layer의 기능의 대부분을 설명한다.

Why not use architectural approaches?

Toy Models of Superposition 논문에서는 충분히 큰 모델을 사용해 superposition이 애초에 존재하지 않는 모델을 만드는 것을 제안한다.

저자들은 비록 성능이 크게 떨어지더라도 이런 monosemantic 모델이 연구에 큰 도움이 될 것이기 때문에 많은 시간과 노력을 투자했다.

하지만 결론적으로 Cross-entropy loss를 사용하는 모델에서는 모델의 크기나 sparsity와 상관없이 superposition을 사용하는 것이 더 작은 loss를 가지기 때문에 superposition이 생기는 것을 확인했다.
Mean squared error를 사용하면 이 문제를 해결할 수 있지만 language model들은 MSE로 학습되지 않기 때문에 구조를 바꾸는 접근 방법은 사용하지 않는다.

Using Sparse Autoencoders To Find Good Decompositions

자연에 있는 많은 latent variable들은 sparse 할 것이라는 long-standing 가정이 존재한다.

여러 예시들을 통해 language model들에도 이런 가정이 적합하다 가장한다.

따라서 sparse하고 overcomplete한 decomposition을 찾는 것을 목표로 한다.

이는 sparse dictionary learning 문제와 동일하다.

한 가지 주의할 점은 문제를 overcomplete하게 만드는 것이 sparse disentanglement를 찾는 접근 방법과 매우 다르게 만든다는 것이다.

우리는 datapoint x의 feature activation $f_i(x)$를 주어진 feature direction $d_i$로 나타내려 한다.
- 이는 고차원 vector를 low-dimensional projection을 통해 결정하는 것과 동일하다.
이런 불가능한 작업을 가능하게 하는 가장 큰 요인은 고차원의 vector가 sparse하나는 점이다.

Dictionary learning의 다양한 기법 중 sparse autoencoder를 사용한 근사 dictionary learning을 채택한 이유는 두 가지이다.

1) Sparse autoencoder는 아주 간단하게 큰 dataset에 맞게 확장 가능하다.
2) 기존 iterative dictionary learning 기법들은 너무 많은 probability space를 찾기 때문에 "too strong"하다.
- 반대로 sparse autoencoder는 MLP와 유사한 구조를 가지고 superposition을 복원하는 능력도 비슷할 것이다.

Sparse Autoencoder Setup

간단한 autoencoder setup을 설명한다.

구조
- Input에 bias를 추가
- Encoder: Linear layer에 bias와 ReLU
- Decoder: Linear layer에 bias
학습
- Adam optimizer, MSE loss, L1 penalty

학습 과정에서 두 가지 중요한 사실을 발견했다.

1) 데이터의 크기가 매우 중요하다. 크기가 커질수록 더 이해가능하고 "날카로운" feature들을 만들 수 있었다.
2) 학습 과정에서 몇몇 neuron들이 activate하지 않는 문제를 발견했다. 이런 dead neuron들을 "resampling"하면 모델의 성능을 끌어올릴 수 있다.

How can we tell if the autoencoder is working?

기존 machine learning에서는 모델이 작동하는지 확인할 기준이 명확하다.

제시된 autoencoder의 경우 interpretability나 activation sparsity를 판단할 metric이 없어 다음과 같은 기준들을 합쳐 사용한다.

1) Manual inspection: feature들이 이해 가능한가?
2) Feature density: 전체 token에 어느 정도 fire하는지와 fire되는 feature 수가 매우 중요하다는 것을 발견했다.
3) Reconstruction loss: Autoencoder가 얼마나 MLP activaiton을 잘 복원하는지.
4) Toy models: Ground truth가 알려진 toy model로 autoencoder의 성능을 평가.

The (one-layer) model we're studying

One-layer transformer 모델을 사용하는 이점은 다음과 같다.

1) 적은 "true feature"들을 포함해 모든 feature들을 cover할 수 있고 쉽게 학습 가능하다.
2) 적은 데이터로 학습 가능하다
3) 각 feature들이 모델의 출력 logit에 미치는 영향을 보다 쉽게 이해할 수 있다.

실험에서는 weight initialization 랜덤 seed만 다른 두 개의 one-layer transformer (A, B)를 학습하고 사용한다.

Notation for Features

"A/1/2357"

A: feature가 추출된 모델 (A 또는 B)
1: dictionary learning run
- 학습된 factor 수와 L1 coefficient 값이 달라짐
2357: 해당 run의 특정 feature

만약 autoencoder가 아닌 transformer의 feature를 나타낼 때는 "A/neurons/32"라 적는다.

Interface for Exploring Features

https://transformer-circuits.pub/2023/monosemantic-features/vis/index.html

transformer-circuits.pub

Detailed Investigations of Individual Features

본 논문의 가장 중요한 주장은 dictionary learning은 neuron들보다 훨씬 더 monosemantic한 feature들을 추출 할 수 있다는 것이다.

이 섹션에서는 위 주장을 뒷받침하는 자세한 예시들을 보여준다.

예시로 사용하는 feature들은 다음과 같다.

Arabic script
DNA sequence
base64 string
Hebrew script

각각의 학습된 feature들에 대해 다음과 같은 주장을 설립한다.

1) 학습된 feature들은 예상된 context에서 높은 정확도로 activate 된다.
2) 학습된 feature들은 예상된 context에서 높은 민감도로 activate 된다.
3) 학습된 feature는 적절한 downstream behavior를 야기한다.
4) 학습된 feature는 다른 어떤 neuron과도 연관되어있지 않다.
5) 학습된 feature는 universal하다. 즉, 다른 모델에 적용된 dictionary learning에서도 찾을 수 있다.

위 claim 1-3을 증명하기 위해 각 context의 computational proxy를 고안한다.

하나의 token이 특정 context에 속할 likelihood를 예측하는 점수를 사용한다.
- 가정이 주어졌을 때 string의 log-likelihood vs. 전체 데이터 분포에서 string의 log-likelihood
- $log(P(s|context)/P(s))$

각 feature의 specificity는 polysemanticity를 제거하는데에 중요하다 믿는다.

만약 제시된 proxy가 feature가 드물고 특정 가능한 context에서만 activate한다는 것을 보여줄 수 있다면 polysemanticity를 제거했다 이해할 수 있다.

Arabic Script Feature

Activation specificity

가장 먼저 추출한 feature A/1/3450이 Arabic script에만 activate한다는 것을 보인다.

각 token마다 likelihodd ratio를 사용한 점수를 부여한다.
- $log(P(s|Arabic Script)/P(s))$
Arabic text token은 전체 분포에서 0.13% 정도이지만 추출된 feature를 active하게하는 token의 81%를 이룬다.
- Feature가 적게 activate 될 때는 25% 정도, 5 이상 activate 될 때는 98%를 이룬다.

Activation spectrum 5 이상의 경우 Arabic script에 대한 high specificity를 보인다.

Lower portion에 대해서는 다음 세 가지 가정을 제시한다.

1) Proxy is imperfect
- 다른 Unicode block들에 있는 자주 사용되는 character들 때문에 false negative를 가진다.
2) The model may be imperfect
- 특정 요소가 존재할 것이라는 "confidence"에 기반해 activate 된다면 약한 activation에서 잘못된 결과를 보여줄 수 있다.
3) The autoencoder may be imperfect
- Autoencoder의 넓이가 "true features"를 담기에 부족할 수 있다.

Low activation level의 false positive를 평가하는 하나의 방법은 "expected value plot"을 사용하는 것이다.

Activation이 커질수록 모델 예측값에 미치는 영향이 더 크기 때문에 activation으로 가중치를 준 feature activation 분포를 제시한다.

Activation sensitivity

위 예제를 보면 A/1/3450이 모든 Arabic script에 민감하지는 않다는 걸 쉽게 파악할 수 있다.

예를 들어 몇몇 prefix에는 반응하지 않는다

하지만 정확히 그 부분에 다른 Arabic script에 반응하는 feature인 A/1/3134가 activate 된다.

추가적으로 Arabic script에 반응하는 feature들(A/1/1466, A/1/3134, A/1/3399)이 존재한다.
이런 현상은 아래 Phenomenology 섹션에 추가로 정리되있다.

그럼에도 불구하고 feature와 Arabic script proxy activation은 0.74의 Pearson correlation을 가진다.

Feature downstream effects

학습된 feature들이 모델의 output에 이해가능한 영향을 미친다는 것을 보여준다.

이는 feature들이 underlying data의 특징만을 나타내는 것이 아니라 MLP의 기능적인 역할과 연결되어 있다 보여준다.

먼저 각 feature가 모델의 logit에 미치는 영향의 linear approximation을 확인해보자.

각 logit weight는 "A Mathematical Framework for Transformer Circuits"에서 제시된 path expansion approach를 사용해 얻는다.
- 각 feature의 direction에 MLP output weight를 곱하고, layer norm의 근사 기법을 사용하고, unembedding matrix를 곱한다.
- 또한 더 쉬운 시각화를 위해 median logit weight가 0이 되도록 shift 한다.

각 feature가 활성화 되어 있을 때 몇몇 output token들의 빈도는 높이고 다른 output token들의 빈도는 낮춘다.

아래 logit weight의 분포를 보면 우리가 선택한 feature가 활성화 되었을 때 빈도가 높아지는 오른쪽에 위치한 두번째 mode가 Arabic과 연관되어 있는 것을 알 수 있다.

실제 데이터에 미치는 영향을 시각화하기 위해 이 feautre를 ablate한다.

모델을 MLP layer까지 진행시키고 출력 activation을 feature들로 decode한다. 그 후 A/1/3450의 activation을 빼고 나머지 모델에 적용시킨다.
아래 시각화에서 파란색 밑줄은 ablation이 빈도를 높인 token, 빨간색 밑줄은 ablation이 빈도를 낮춘 token을 의미한다.

Downstream effect들을 확인하는 두번째 방법으로 feature activity가 높은 값으로 "pinned"된 모델을 실험한다.

먼저 모델이 continuation을 가질 것이라 예상되는 prefix 1,2,3,4,5,6,7,8,9,10을 활용한다. 이 prefix들로부터 관측되는 maximum 값을 A/1/3140으로 바꾸어 어떻세 sample들이 변화하는지 관측한다.

The features is not a neuron

다음으로 dictionary learning으로 찾은 feature가 단순히 monosemantic한 neuron을 제시한 건 아닌지 확인해본다.

먼저 각 뉴런이 가장 큰 activation을 보인 상위 20개 example들을 살펴봤지만 그중 Arabic이 포함된 neuron은 단 하나였고 그 neuron의 상위 20개 example에서도 Arabic example은 단 하나였다.
그 후 feature를 neuron basis로 살펴봤을 때 (아래 그림) 가장 큰 세개의 coefficient들은 모두 음수였다.

하지만 이 neuron들이 함께 상호작용을 통해 하나의 neuron이 높은 activation을 가지게 할 수 있지 않을까?

이를 확인하기 위해 feature의 activaiton과 가장 correlated 된 neuron(A/neurons/489)를 식별하고 neuron의 activation을 시각화해 나타낸다.

위 그림에서 볼 수 있듯이 Arabic은 매우 작은 일부분을 차지한다.

Logit weight 분석도 위와 동일한 결과를 보여준다.

Neuron이 다양한 언어에 반응하고 Arabic token들의 weight 값들은 아주 약간 양수 쪽으로 치우쳐져 있다.

마지막으로 feature A/1/3450과 neuron의 correlation을 scatter plot으로 확인해본다.

약간의 correlation이 존재하지만 크지 않다
x축으로는 logit weight가 잘 구분 되지 않지만 y 축으로는 잘 구분 되는 것을 볼수 있다.

최종적으로 feature가 하나의 neuron에 대응되지 않는다고 결론내릴 수 있다.

Universality

우리가 찾은 feature A/1/3450이 다른 모델에서도 일관적이게 관측되는 universal feature인지 탐구해본다.

B/1에서 가장 비슷한 feature를 찾았을 떄 correation 0.91의 B/1/1334를 찾을 수 있었다.

Activation과 logit weight가 놀라울 정도로 비슷한 분포를 보여준다.

Feature ablation에서도 비슷한 결과를 보인다.

더 확실한 비교를 위해 scatter plot을 이용해 activation과 logit weight를 시각화한다.

Logit weight의 경우 공유된 outliner mode가 중요하다고 생각된다.

또한 중앙에 다른 weight들은 이상적으로는 모두 0이여야 하지만 여러 feature들의 superposition에 의해 0이 될 수 없다고 추측한다.

---

Global Analysis

How Interpretable is the Typical Feature?

이 섹션에서는 세 가지 다른 접근 방법들을 사용해 feature가 neuron에 비해 얼마나 이해 가능한지 평가한다.

Human analysis
두가지 automated interpretability 기법

Manual human analysis

사람의 주관적인 평가를 통해 interpretability를 측정한다.

평가 목록으로는 explanation의 conficdence, explanation과 activation의 일관성, explanation과 logic weight의 일관성, specificity가 있다.
Feature activation의 specturm에서 균일하게 sample을 선택하고 각 interval을 분리해 평가를 진행했다.
412개의 feature activation interval들을 162개의 feature와 neuron들로 평가했다.

Automated interpretability - Activations

LLM(Claude)를 이용해 feature들이 activate 하는 example token들을에 대한 explanation을 생성한다. 그 후 explanation을 통해 모델이 unseen token에 대한 새로운 activation을 생성하게 한다.

구체적으로 각 feature에 대해 실제 activation과 예측된 activation의 Spearman correlation coefficient를 측정한다.
60개의 dataset example들이 각 9개의 token으로 이루어져 있어 각 feature 당 540개의 예측값을 생성하게 된다.

Automated interpretability - Logit weights

이전 section에서 생성한 explanation을 기반으로 다음 unseen logit token이 feature가 다음에 올것으로 예상하는 token이 맞는지 판단하도록 했다.

50/50 분포로 섞은 top positive logit token과 랜덤하게 선택된 logit token들로 나누어 테스트한다.

랜덤 추측은 50%이지만 실험 결과 feature의 경우 74%의 정확도를 보여준다.

Activation interval analysis

Feature interval이란 균등하게 나누어진 activation level에 가장 가까운 activation을 가지는 example들이다.

즉, activation strength에 따른 interpretability를 분석한다.

더 많이 activate 하는 feature 일수록 interpretation에 더 일관성이 있다는 사실을 보여준다.

이는 우리가 선정한 feature들이 옳바르지 않을 수 있다는 것을 제시하기도 한다.
- 만약 우리가 실제로 배우길 원했던 feature와 실제로 학습된 feature가 약간의 차이를 가진다면 이는 lower activation interval에서 일관성 없는 형식으로 나타나질 수 있다.

Caveats

Feature activation들은 lower interval 쪽으로 skew되어 있다.
평가된 feature들은 모든 feature들로부터 균등하게 선택되었고 interpretability는 importance와 연관되어 있을 수 있다.
- Activation이 큰 feature들은 무자구이로 선택된 feature들 보다 더 interpretable한 경향이 있다.

이러한 caveats들은 실험 결과에 영향을 미치지 않는다 판단했다.

How much of the model does our interpretation explain?

Interpretable feature들은 MLP를 어느 정도로 설명하는가?

이 질문을 답할 한 가지 방법은 얼마나 많은 loss가 우리의 feature들로 해석되는지 알아보는 것이다.

A/1의 경우 MLP layer의 log-likelihood loss reduction 중 79%가 우리의 feature들로 복원 가능하다.
즉, MLP activation을 autoencoder의 output으로 바꾸면 MLP를 완전히 없애는 loss의 21% 밖에 되지 않는다는 것이다.
- 더 많은 feature들을 사용하거나 더 작은 L1 coefficient를 이용해 loss를 줄일 수 있다.

Loss를 이용해 질문에 답을 구하는 것은 오해를 만들 수 있다.

Feature들이 long-tail 형태로 되어 있다 생각되기 때문에 설명 가능한 loss가 늘어날수록 나머지를 설명하는데 더 많은 feature들을 필요로 한다.
또한 모든 feature들이 monosemantic하거나 intrepretable하지 않다.

Automatic interpretability 관점에서 더 좋은 평가 지표를 만들 수 있다.

기존 activation을 explanation으로 예측된 activation으로 바꾸는 방법을 생각해 볼 수 있다.

이런 feature-based interpretation이 얼마나 모델을 설명하는지는 여전히 많은 연구를 필요로 한다.

---

Phenomenology

실제로 one-layer model에서 어떤 일이 일어나는지를 관점으로 모델을 분석해본다.

먼저 feature들의 기본적인 motif를 토론한다.

그 후 다른 dictionary learning으로 학습된 feature들과 비교를 한다.

이는 feature들이 universal하고 dictionary learning이 모델의 superposition geometry를 반영하는 feature splitting 절차라는 것을 제시한다.

마지막으로 feature들이 어떻게 서로 "finite state automata"의 형식으로 연관되어 있는지 탐구한다.

Feature Motifs

모델에서 어떤 종류의 feature들을 확일할 수 있는가?

Context features (DNA, base64) / token-in-context features (< in HTML)
- A/4의 경우 수 백개의 feature들이 다른 context에서의 "the"에 반응한다. 이는 feature splitting에서 더 자세히 다룬다.
"Trigram" features (19 in COVID-19)
- Attention 만으로도 구현될 수 있지만 모델은 MLP도 사용한다.
모든 feature들은 "action feature"와 "input feature"로 이해될 수 있다.
- 예를 들어 base64의 경우 base64 string에 반응하고 동시에 base64 string이 생성될 확률을 높인다.

Feature Splitting

한 가지 놀라운 점은 feature들이 cluster 형태로 나타난다는 것이다.

학습되는 feature 수를 늘릴수록 하나의 개념을 나타내는 더 많은 feature들이 형성되는 것을 확인했고 이를 feature splitting이라 명명한다.
아래 그림은 A/0, A/1, A/2의 2-D UMAP 결과이다.

비슷한 feature들은 dictionary vector들 사이에 더 작은 angle을 가진다는 것을 알 수 있다.

이는 비슷한 feature들이 모델에서 비슷한 행동 결과를 가져오기 때문에 neuron activation에 비슷한 영향을 준다 이해할 수 있다.

이는 모델을 해석하는데에 중요한 요소이다.

1) "옳바른 feature 수"를 정하는 것이 크게 중요하지 않다는 것을 의미한다.
- 더 적은 feature를 쓰면 feature들의 "summary"로 해석할 수 있기 때문이다.
2) 모델에서 관측되는 특이한 feature인 "collapsed" feature나 "split" feature를 설명하는데에 도움을 준다.
3) 마지막으로 superposition 이론이 연관된 "action sharing" feature들을 탐구하지 않았음을 의미한다.

---

Features which seemed like bugs

"Bug" 1: Single-Token Features

Dictionary learning이 매우 적은 sparse feature만을 사용하도록 제한했을 때 단 하나의 token에만 반응하는 feature가 존재한다.
- 이런 feature들은 간단한 biagram statistics를 사용해 학습하는 것이 더 효율적이고 MLP를 사용할 필요가 없다.
하지만 실제로는 단순히 단어 P에 반응하는 feature가 아니라 다른 context에 있는 P에 반응하는 feature라는 것이다.

"Bug" 2: Multiple Features for a Single Context

반대로 여러 feature들이 똑같은 context를 학습하는 것 같은 현상도 발견된다.
예를 들어 A/1은 base64에 반응하는 3개의 feature를 가지고 있다.
- 이는 feature spliting으로 이해될 수 있다.

먼저 A/1/2357과 A/1/2364를 비교해보자
- Logit weight 비교를 통해 둘은 비슷한 token들을 예측한다는 것을 알 수 있다.
- 한 가지 다른 점은 digit에 fire 하는 feature는 다음 예측값으로 digit을 예측하지 않는다는 것이다.

이는 tokenization 때문으로 유추한다.
- 토큰들은 [Bq] [8] [9] [mp]로 잘리지 않는다. 반드시 [Bq] [89] [mp]로 잘리게 되고 이는 digit에 반응하는 feature가 다음 예측값으로 digit을 예측하지 않는 이유이다.
그렇다면 세번째 feature인 A/1/1544는 어떤 의미를 가지는가?
- A/1/1544는 ASCII의 encode된 값을 선호하는 경향이 있다.

먼저 모델의 뭉쳐있는 feature들로 행동을 크게 분류해 이해하고 더 자세한 feature들로 미묘한 행동의 차이를 조사하는 것은 더 큰 모델에도 그대로 적용할 수 있는 유용한 접근 방법이다.

Universality

Universality는 중요한 문제이다. 하나의 모델을 연구한 결과를 일반화해 적용시킬 수 있고 추출된 feature들을 reproducible하게 만든다. 본 섹션에서는 두 가지 문제에 답을 구한다.

1) 두개의 one-layer 모델 사이에 universality를 구한다.
2) 다른 모델들에서 보고된 feature들을 통해 보다 더 넓은 universality를 탐구한다.

Comparing features between two one-layer transformers

Feature를 비교하기 위해서는 feature를 나타내는 model-independent한 방법이 필요하다.

첫번째 방법은 feature를 datapoint들에 value를 할당하는 function으로 보는 것이다.
- Feature를 vector로 나타내고 각 entry는 고정된 data point들에 대응할 수 있다.
- 이런 vector 사이에 correlation을 activation similarity라 명명한다.
두번째 방법은 feature를 downstream effect로 정의하는 것이다.
- 이는 logit weight로 근사할 수 있고 각 entry는 단어 token에 대응한다.
- 이런 vector들의 correlation은 logit weight similarity라 명명한다.

아래는 이전 Arabic feature의 그래프이다.

Arabic feature 외에도 다른 모든 feature들을 비교한다.

A/1과 B/1에서 가장 높은 correlation을 가지는 feature 끼리의 correaltion median 값은 0.72이다.
반대로 neuron의 경우 0.46에 그쳤다.
- 몇몇 낮은 activation correlation은 "feature splitting"의 영향이거나 다른 "true feature"들이 학습되었다고 이해할 수 있다.

다음 질문은 같은 token에 fire하는 feature들이 같은 logit weight를 가지는지이다.

하지만 위 그래프에서 볼 수 있듯이 중요한 token들은 두 모델에서 모두 높은 가중치를 가지지만 isotropic noise처럼 보이는 작은 영향을 주는 token들에 의해 낮은 correlation 값을 가지게 된다.

이러한 경향은 다른 feature들에도 널리 관측된다.

실제로 feature가 token 확률에 미치는 영향을 탐구해본다.

Feature의 activation vector를 token의 logit weight로 scale해 attribution vector를 계산한다. 이는 feature의 activity와 loss에 미치는 영향을 모두 고려한다.
그리고 이 attribution vector 사이에 correlation을 attribution similarity라 정의한다.
높은 attribution similarity는 두 모델에서 모두 active한 feature가 같은 token을 예측하는데에 유용하게 사용된다는 것을 보여준다.

Comparing features with the literature

다른 구조를 가지고 같은 데이터를 사용해 다르게 학습된 모델에서도 같은 feature가 관측되는지 비교한다.

One-layer SoLU 모델에서 발견된 많은 feature들과 비슷하다.
- base64, hexadecimal, all caps neuron 등
Smith 모델에서 발견된 feature들과도 비슷하다.
- German detector, title case detector 등
추상적인 수준에서 많은 feature들은 multimodal model (Goh et al.)에서 제시된 feature들과 비슷하다.
- Australia feature, Canada feature, Africa feature, Israel-Palestine feature 등
- 반대로 emotion neuron과 관련된 feature는 찾을 수 없었다.

"Finite State Automata"

One-layer model의 feature들의 한가지 흥미로운 사실은 "finite state automata" 같은 성질을 보인다는 것이다.

예를 들어 하나의 feature가 token의 확률을 높이면 그것이 반복적으로 다른 feature가 fire하게 한다.

모든 feature들은 A/0을 기준으로 선정되었다. A/1의 경우 더 복잡한 패턴을 보인다.

가장 간단한 경우는 스스로 excite하는 경우이다.

또 다른 경우는 two-node system의 경우이다. 이는 하나의 character가 두개의 Unicode로 분리되어 설명되는 언어에서 흔히 발견된다.

HTML의 경우 four-code system을 보여주기도 한다.

한가지 흥미로운 행동은 feature들이 문장을 기억하는 성향을 보인다는 것이다.

이는 mechanistic theory of memorization의 예제로 볼 수 있다.
또 모델이 아주 자세하고 미세한 경우에는 다르게 행동하는 anomaly detection의 예시로도 볼 수 있다.

Discussion

Theories of Superposition

Superposition에 대한 대부분의 이해는 Toy Model들로부터 얻었다.

이번 연구는 이전에 만들어진 Toy model이 몇가지 중요한 점을 고려하지 않았음을 보여준다.

Correlated된 feature들은 함께 fire하고 비슷한 action을 생성한다는 점이다.
- 이러한 feature들은 기하학적으로 비슷한 방향을 가진다.
또한 feature들이 one-dimensional하다는 것도 명확하지 않다.

그럼에도 불구하고 이번 실험은 superposition 가설과 linear representation 가슬을 검증한다.

Interpretable feature들을 찾을 수 있었고 activation level은 feature의 "confidence"와 대응하는 것처럼 보인다.
Logit weight들은 대부분 이해 가능하고 많은 "inference weight"들은 superposition의 결과로 해석된다.
또한 알맞은 수의 feature를 찾는 것이 큰 문제가 아님을 보여준다.

Are "Token in Context" Feature Real?

자주 발견된 motif는 "token-in-context" feature들이다.

몇몇 token은 말이 된다
- die in German ("the"), die in English ("death" or "dice")
하지만 많은 토큰들은 그렇지 않다.
- "the" in Physics, "the" in mathematics, ...

왜 "compositional code" 대신 "local code"를 관측하게 되는 걸까? 두 가지 가설이 존재한다.

1) Transformer는 compositional code를 사용하지만 dictionary learning이 local code를 쓰는 feature들을 생성한다.
2) Transformer가 정말로 local code를 사용하고 dictionary learning은 이를 정확히 나타낸다.

우리는 후자가 더 그럴싸하다고 생각한다. 더 정확한 예측값을 만들기 위해 local code로 context를 분리한다 추측할 수 있다.

'논문 > Interpretable AI' 카테고리의 다른 글

[논문] Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (0)	2025.04.04
[논문] Toy Models of Superposition (1)	2025.03.27
[논문] From attribution maps to human-understandable explanations through Concept Relevance Propagation (0)	2025.03.21
[논문] On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation (LRP) (0)	2025.03.20
[논문] Explainable Artificial Intelligence (XAI) 2.0: A manifesto of open challenges and interdisciplinary research directions (0)	2025.02.28

[논문] Towards Monosemanticity: Decomposing Language Models with Dictionary Learning

Problem Setup

Features as a Decomposition

Superposition Hypothesis

What makes a good decomposition?

Why not use architectural approaches?

Using Sparse Autoencoders To Find Good Decompositions

Sparse Autoencoder Setup

How can we tell if the autoencoder is working?

The (one-layer) model we're studying

Notation for Features

Interface for Exploring Features

Detailed Investigations of Individual Features

Arabic Script Feature

Activation specificity

Activation sensitivity

Feature downstream effects

The features is not a neuron

Universality

Global Analysis

How Interpretable is the Typical Feature?

Manual human analysis

Automated interpretability - Activations

Automated interpretability - Logit weights

Activation interval analysis

Caveats

How much of the model does our interpretation explain?

---

Phenomenology

Feature Motifs

Feature Splitting

Features which seemed like bugs

Universality

Comparing features between two one-layer transformers

Comparing features with the literature

"Finite State Automata"

Discussion

Theories of Superposition

Are "Token in Context" Feature Real?

'논문 > Interpretable AI' 카테고리의 다른 글

'논문/Interpretable AI' Related Articles

티스토리툴바