[논문] Deep Learning Recommendation Model for Personalization and Recommendation Systems (2019)

https://arxiv.org/abs/1906.00091

개인화와 추천 시스템에서 사용되는 deep learning networks들은 다른 deep learning networks와 다릅니다. 특히 범주형 특성을 다룰 수 있어야 하고 이는 아직 잘 연구되고 이해되지 않은 분야입니다. 본 논문에서는 범주형 특성을 활용한 state-of-the-art deep learning recommendation model을 소개합니다.

1 Introduction

개인화와 추천을 위한 모델의 구조에 영향을 준 두가지 관점

Veiw of recommendation systems
- 초기 추천 시스템은 전문가들 제품을 카테고리로 분류하고 고객이 원하는 카테고리를 선택하는 방식
- + Collaborative filtering: 고객의 과거 선택을 이용한 추천
- + Neighborhood methods: 고객과 제품을 그룹핑하는 방법
Predictive analytics
- 통계적 모델로 확률을 예측하는 방법
- Linear 또는 logistic regression 방법에서 deep networks를 사용하도록 진화해 옴
- 범주형 데이터를 사용하기 위해 embedding 기법을 사용
  - one-hot 또는 multi-hot vector를 latent space로 맵핑

본 논문에서는 위 두 관점을 융합한 개인화 모델을 제시한다.

범주형 데이터를 나타내는 sparse feature는 embedding을 사용
Dense feature는 multilayer perceptron (MLP)를 사용
최종적으로 또 다른 MLP를 이용해 통계적인 확률을 구함

본 논문에서는 이 모델을 DLRM이라 지칭한다.

2 Model Design and Architecture

2.1 Components of DLRM

2.1.1 Embeddings

범주형 데이터를 abstract space에 있는 dense representation으로 맵핑시키는 embedding 기법을 소개한다.

범주형 데이터는 sparse vector (one-hot vector 또는 multi-hot vector)로 나타내진다.

이 sparse vector는 embedding table lookup을 통해 dense vector로 mapping 된다.

Mapping one-hot vector $ e_i = [0,..., 1,..., 0] $
- $ W \in \mathbb{R}^{m \times d} $

$$ \mathbf{w}_i^T = \mathbf{e}_i^T W $$

Mapping multi-hot vector $ \mathbf{a}^T = [0, \ldots, \mathbf{a}_{i_1}, \ldots, \mathbf{a}_{i_k}, \ldots, 0] $
- $ A = [a_1, \ldots, a_t] $

$$ S = A^T W $$

2.1.2 Matrix Factorization

Terms

i-th product (1, ..., n): $ \mathbf{w}_i \in \mathbb{R}^d $
- i-th product in latent factor space
- 각 element는 유저의 선호에 영향을 미치는 제품의 특성을 나타냄 (ex: category, brand, ...)
j-th user (1, ..., m): $ \mathbf{v}_j \in \mathbb{R}^d $
- i-th user in latent factor space
- 각 element는 제품 특성에 대한 유저의 선호도를 나타냄
rating of i-th product by j-th user: $ r_{ij} \in \mathbb{R} $
Set S: 제품을 평가한 유저 집합
- Set S는 (i, j) tuples로 구성되어 있음

목표:

Set S를 이용해 set S에 존재하지 않는 rating pair $ r_{ij} $를 예측

Matrix factorization 접근 방법:

$$ \min \sum_{(i,j) \in S} \left( r_{ij} - \mathbf{w}_i^T \mathbf{v}_j \right) $$

그 후 $ W^T = [\mathbf{w}_1, \ldots, \mathbf{w}_m] $ 와 $ V^T = [\mathbf{v}_1, \ldots, \mathbf{v}_n] $ 를 사용해 full matrix $ R = [r_{ij}] $ 를 근사:

$$ R \approx W V^T $$

이때 W와 V는 각 행이 user/proejct latent factor space인 embedding table로 해석될 수 있다.

각 행의 내적은 유의미한 rating 예측값을 제시하고 이는 factorization machine과 DLRM에서 주요한 요소이다.

2.1.3 Factorization Machine

분류 문제에서는 예측 함수 $ \phi : \mathbb{R}^n \to T $ 를 구하는 것이 목표이다. ($ x \in \mathbb{R}^n $, $ y \in T $)

Factorization machine은 linear term과 second-order interaction term을 활용해 다음과 같은 예측 함수를 제시한다:

$$ \hat{y} = b + \mathbf{w}^T x + x^T \operatorname{upper}(VV^T)x $$

$ V \in \mathbb{R}^{n \times d}, \mathbf{w} \in \mathbb{R}^n, b \in \mathbb{R} $

Second-order interation이란 데이터가 pairwise로 결과에 영향을 주는 것을 뜻한다.

Factorization machine에서는 second-order interaction matrix를 latent factors로 분해해 Support vector machine (SVM) 보다 sparse 한 데이터를 더 잘 처리할 수 있다. 또 linear computational complexity를 가져 효율적인 계산이 가능하다.

2.1.4 Multilayer Perceptrons

Deep learning 모델의 가장 기본 모델인 multilayer perceptron (MLP)의 예측 함수는 다음과 같다:

$$ \hat{y} = W_k \sigma \big( W_{k-1} \sigma \big( \dots \sigma \big( W_1 x + b_1 \big) \dots \big) + b_{k-1} \big) + b_k $$

$ \text{matrix} \ W_l \in \mathbb{R}^{n_l \times n_{l-1}}, \quad \text{bias} \ b_l \in \mathbb{R}^{n_l} \quad \text{for layer} \ l = 1, \ldots, k. $

2.2 DLRM Architecture

위에서 설명된 모델들을 합쳐 개인화 모델을 만든다.

유저와 제품은 수많은 연속적인 특징과 범주 특징으로 나타내진다 가정

1. Input

범주 특징은 같은 차원의 embedding vector로 치환 (2.1.2)
연속적인 특징은 MLP를 이용해 embedding vector와 같은 길이를 가지는 dense feature로 치환 (2.1.4)

2. Second-order interactions

FM (2.1.3) 기법에서 영감을 받아 second-order interaction을 계산
- embedding vector와 processed dense feature의 모든 pair 사이에 dot product를 계산

3. Concatenation and post-processing

계산된 dot products는 original processed dense feature에 추가되어 또 다른 MLP를 통해 process 됨
- 최종적으로 sigmoid function을 통해 확률을 출력함

DLRM에 사용된 operators:

2.3 Comparison with Prior Models

DLRM vs. Wide and Deep, DeepFM, xDeepFM

DLRM에서는 second-order interaction 까지만 계산
- 보다 높은 order interaction은 computational/memory cost 만큼 가치 있지 않음
DLRM은 각 feature vector를 하나의 unit으로 취급
- Interaction이 서로 다른 feature 사이에만 계산됨
- Computational cost 감소

3 Parallelism

Complexity

DLRM는 CNN, RNN, GAN 보다도 많은 parameter를 사용
대부분의 parameter는 bottom MLP 인 embedding에 사용됨
Embedding table 하나에 수 GB

Solution

Model parallelism
- 여러 device에 model의 parameter를 나누어 저장
- 각 device는 embedding의 일부분만 저장하고 계산함
- Embedding 계산에 필요한 excessive memory 해결
Data parallelism
- Data를 mini-batches로 나누어 여러 device에 분배 후 동시에 process
- 계산적으로 무거운 MLP forward/backward computation 해결

Top MLP는 모든 embedding에 접근할 수 있어야 함.

Butterfly shuffle을 이용해 all-to-all communication 전략 사용

4 Data

데이터는 세가지 종류로 분류

Random
Synthetic
Public

Random과 Synthetic은 system perspective 실험을 위해 사용

Public은 실제 모델의 정확도를 계산하기 위해 사용

4.3 Public

Click logs for ad CTR prediction

Criteo AI Labs Ad Kaggle
Criteo Ad Terabyte

5 Experiments

GitHub: https://github.com/facebookresearch/dlrm

GitHub - facebookresearch/dlrm: An implementation of a deep learning recommendation model (DLRM)

An implementation of a deep learning recommendation model (DLRM) - facebookresearch/dlrm

github.com

5.1 Model Accuracy on Public Data Sets

Criteo Ad Kaggle 데이터 셋에 대해 DLRM 과 Deep and Cross network (DCN)의 정확도를 비교

Data set에 존재하는 feature에 맞게 모델 사이즈가 조절됨

DLRM
- Bottom MLP: Three hidden layers with 512, 256, 64 nodes.
- Top MLP: Two hidden layers with 512, 256 nodes
- Embedding dimension: 16
DCN
- Six corss layers
- Deep network with 512, 256 nodes

SGD와 Adagrad optimizer를 사용한 결과 DLRM이 약간 더 높은 정확도를 보임

5.2 Model Performance on a Single Socket/Device

간단한 sample 모델을 사용해 테스트를 진행

8 categorical features
- Processed thorugh an embedding table with 1M vectors
- 64 dimensions
512 continuous features
- Assembled into a vector of dimensions 512
Bottom MLP: Two layers
Top MLP: Four layers
2048K random samples with 1K mini-batches

결과

Caffe2로 구현한 모델 실행 결과 CPU에서 256초, GPU에서 62초 걸림
대부분의 시간이 embedding lookups와 fully connected layers에 사용

6 Conclusion

본 논문에서는 범주형 데이터를 사용한 새로운 deep learning-based 추천 모델을 제시한다.

'논문 > Knowledge Tracing' 카테고리의 다른 글

[논문] Structure-based Knowledge Tracing: An Influence Propagation View (1)	2025.01.13
[논문] Knowledge Tracing: A Survey (2023) (1)	2025.01.06
[논문] Context-Aware Attentive Knowledge Tracing (2020) (0)	2024.12.31
[논문] Leveraging Skill-to-Skill Supervision for Knowledge Tracing (2023) (2)	2024.12.27
[논문] Implementing Learning Principles with a Personal AI Tutor: A Case Study (2023) (1)	2024.12.23

guungyul 님의 블로그

[논문] Deep Learning Recommendation Model for Personalization and Recommendation Systems (2019)

1 Introduction