NLP Word2Vec/SGNS

NLP & DL

  • ํŠน์ˆ˜ ๋ชฉ์ ์ด ์•„๋‹Œ, ๋ฒ”์šฉ์ (์ผ๋ฐ˜์ )์œผ๋กœ ์“ฐ์ผ Word Embedding์„ ๋งŒ๋“ ๋‹ค.
  • embedding์˜ ๋ฐฉ๋ฒ•

    • ๋”ฐ๋ผ์„œ ๋ฌธ์žฅ ์† ๋‹จ์–ด์˜ ๋งฅ๋ฝ(์˜๋ฏธ)๋ฅผ ํŒŒ์•…ํ•  ์ค„ ์•ˆ๋‹ค.
    • ์ฆ‰, semantic ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค.

  • ์ฐธ๊ณ :

    • ๋ถ„ํฌ ๊ฐ€์„ค:
    • ๊ฐ™์€ ๋ฌธ๋งฅ์˜ ๋‹จ์–ด, ์ฆ‰ ๋น„์Šทํ•œ ์œ„์น˜์— ๋‚˜์˜ค๋Š” ๋‹จ์–ด๋Š” ๋น„์Šทํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„๋‹ค. ๋”ฐ๋ผ์„œ ์–ด๋–ค ๊ธ€์—์„œ ๋น„์Šทํ•œ ์œ„์น˜์— ์กด์žฌํ•˜๋Š” ๋‹จ์–ด๋Š” ๋‹จ์–ด ๊ฐ„์˜ ์œ ์‚ฌ๋„๊ฐ€ ๋†’๋‹ค๊ณ  ํŒ๋‹จ.
    ๋นˆ๋„ ๊ธฐ๋ฐ˜์˜ ๋ฌธ์„œ ์ˆ˜์น˜ํ™” ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ๋ฌธ์„œ ์ˆ˜์น˜ํ™”
    ์นด์šดํŠธ ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ• ์˜ˆ์ธก ๋ฐฉ๋ฒ•
    ๋น ๋ฅด๋‹ค
    ๋‹จ์–ด๋“ค์˜ ๋ณต์žกํ•œ ํŠน์ง•๊นŒ์ง€ ์ž˜ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋‹ค.
    Bag of word(BOW), TF-IDF ํ™œ์šฉํ•œ SVD ๋“ฑ Word Embedding / Word2Vec


Word2Vec

Word2Vec Word Embedding
ํŠน์ • ๋ชฉ์ ์ด ์•„๋‹Œ ๋ฒ”์šฉ์ ์ธ ๋ชฉ์ ์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค.
- ๋ฐฉ๋Œ€ํ•œ ์–‘์˜ ์•„๋ฌด๋ฌธ์„œ๋‚˜ ์ฝ”ํผ์Šค๋ฅผ ํ•™์Šตํ•˜์—ฌ ๋‹จ์–ด๋“ค์ด ์–ด๋–ค ๊ด€๊ณ„๋ฅผ ๊ฐ–๋„๋ก ๋ฒกํ„ฐํ™”(์ˆ˜์น˜ํ™”)ํ•˜๋Š” ๊ธฐ์ˆ ์ด๋‹ค.
- ๋”ฐ๋ผ์„œ ๋‹จ์–ด๋“ค์˜ ์˜๋ฏธ๊ฐ€ ๋ฒ”์šฉ์ ์ด๋‹ค.
classification ๋“ฑ์˜ ํŠน์ • ๋ชฉ์ ์„ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ๊ทธ๋•Œ๋งˆ๋‹ค ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹.
- ๋‹จ์–ด๋“ค์ด ํŠน์ • ๋ชฉ์ ์— ๋งž๊ฒŒ ๋ฒกํ„ฐํ™”๋œ๋‹ค.
์‚ฌํ›„์ ์œผ๋กœ ๊ฒฐ์ •๋˜๋Š” Word Embedding ๊ณผ ๋‹ฌ๋ฆฌ ์‚ฌ์ „์— ํ•™์Šตํ•˜์—ฌ ๋‹จ์–ด์˜ ๋งฅ๋ฝ์„ ์ฐธ์กฐํ•œ ๋ฒกํ„ฐํ™”๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค.
- ๋ถ„ํฌ๊ฐ€์„ค ์ด๋ก  ์‚ฌ์šฉ: ''์ฃผ๋ณ€ ๋‹จ์–ด๋“ค์„ ์ฐธ์กฐํ•˜๋Š” ๋“ฑ ๋‹จ์–ด๋“ค์˜ ๋ถ„ํฌ๋ฅผ ํ†ตํ•ด ํ•ด๋‹น ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ํŒŒ์•…ํ•œ๋‹ค'', ๋ž€ ๋œป๋‹จ์–ด์˜ - ์ฃผ๋ณ€ ๋‹จ์–ด(๋งฅ๋ฝ:context)๋ฅผ ์ฐธ์กฐํ•˜์—ฌ ํ•ด๋‹น ๋‹จ์–ด๋ฅผ ์ˆ˜์น˜ํ™”ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ํ•ด๋‹น ๋‹จ์–ด๋Š” ์ธ์ ‘ ๋‹จ์–ด๋“ค๊ณผ ๊ด€๊ณ„๊ฐ€ ๋งบ์–ด์ง€๊ณ  ์ธ์ ‘ ๋‹จ์–ด๋“ค ๊ฐ„์—๋Š” ๋‹จ์–ด ๋ฒกํ„ฐ์˜ ์œ ์‚ฌ๋„๊ฐ€ ๋†’๋‹ค.
Word Embedding ๋ฒกํ„ฐ๋Š” ์‚ฌํ›„์ ์œผ๋กœ ๊ฒฐ์ •๋˜๊ณ , ํŠน์ • ๋ชฉ์ ์˜ ์šฉ๋„์— ํ•œ์ •๋œ๋‹ค.
๋ฐฉ๋ฒ•: continuous back of word (CBOW), Skip-gram

  • ๋Œ€ํ‘œ์ ์ธ Word2Vec >

    CBOW Skip-Gram
    ๋ฌธ์žฅ์„ ๋‹จ์–ด๋กœ ์ „์ฒ˜๋ฆฌํ•œ๋‹ค. ๋ฌธ์žฅ์„ ๋‹จ์–ด๋กœ ์ „์ฒ˜๋ฆฌํ•œ๋‹ค.
    ์ˆ˜์น˜ํ™”ํ•˜๊ณ  ์‹ถ์€ ๋‹จ์–ด๊ฐ€ output ๋˜๋„๋ก
    ๋„คํŠธ์›Œํฌ ๊ตฌ์„ฑ์ฃผ๋ณ€ ๋‹จ์–ด๋ฅผ
    * ์ˆœ์„œ:
    input
    ex: (input) alic, bit ๋“ฑ ์—ฌ๋Ÿฌ ๊ฐœ ... โ†’
    hidden layer โ†’
    (output) hurt
    ๋”ฐ๋ผ์„œ: hidden layer = ์ค‘๊ฐ„์ถœ๋ ฅ.
    CBOW ๋ฅผ ๊ฑฐ๊พธ๋กœ ํ•œ ๊ฒƒ.
    - input 1๊ฐœ ,
    - output ์—ฌ๋Ÿฌ ๊ฐœ
    * ์ˆœ์„œ:
    (input) hurt โ†’
    hidden layer โ†’
    (output) alic, bit ๋“ฑ ์—ฌ๋Ÿฌ ๊ฐœ ...
    - AE ๋ฐฐ์šธ ๋•Œ, ๋ชจ๋ธ ์ „์ฒด ํ•™์Šต ์‹œํ‚จ ํ›„,
    autoencoding ๋ถ€๋ฌธ๋งŒ ๋”ฐ๋กœ ๋นผ์„œ
    ๋ชฉ์ ์— ๋งž๊ฒŒ ๋Œ๋ฆฐ ๊ฒƒ์ฒ˜๋Ÿผ
    Skip-Gram๋„ ๊ทธ๋ ‡๊ฒŒ ์ง„ํ–‰ํ•จ.
    ์ฆ‰, ์—ฌ๋Ÿฌ ๊ฐœ๋กœ ํ•œ ๊ฐœ ์˜ˆ์ธก ์ฆ‰, ํ•œ ๊ฐœ๋กœ ์—ฌ๋Ÿฌ ๊ฐœ ์˜ˆ์ธก

  • ์›๋ฆฌ >

    ์˜ˆ๋ฅผ ๋“ค์–ด

    • (input) hurt๋ฅผ one-hot encodingํ•ด์„œ hurt์˜ ์œ„์น˜(index) ๋ฅผ ํŒŒ์•…ํ•œ ํ›„,
    • (output) alic๋ฅผ ๋„ฃ์—ˆ์„ ๋•Œ hurt์˜ ์œ„์น˜(index)๋ฅผ ์ฐพ๋„๋ก ํ•จ
  • ์˜ˆ์ธก ์‹œ, ๋” ํŽธ๋ฆฌํ•œ ๊ฑด Skip-Gram.
  • ๋‹จ์–ด ํ•˜๋‚˜๋งŒ ๋„ฃ์œผ๋ฉด output์ด ๋‚˜์˜ค๋ฏ€๋กœ

  • ๋‹จ์  >
  • ๋™์Œ์ด์˜์–ด๋ฅผ ์ œ๋Œ€๋กœ ํŒŒ์•…ํ•˜์ง€ ๋ชปํ•œ๋‹ค.

    • ์‹ค์ œ word2vec์˜ ์œ„์น˜๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ ๊ฐ€๊นŒ์šด ๊ฑฐ๋ฆฌ์— ์žˆ๋Š” ๊ฐ’๋“ค์˜ ํ‰๊ท ์œผ๋กœ ๊ณ„์‚ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์—
    • ํ•ด๊ฒฐ๋ฐฉ๋ฒ•: ELMo
    • embedding ํ•  ๋•Œ, ๋ฌธ๋งฅ์— ๋”ฐ๋ผ ๊ฐ€๋ณ€์ ์œผ๋กœ vector๋ฅผ ๋งŒ๋“ ๋‹ค. ์ฆ‰, ๋งฅ๋ฝํ™”๋œ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ
  • ์ถœ๋ ฅ์ธต์„ 'softmax'๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๊ณ„์‚ฐ๋Ÿ‰์ด ๋งŽ๋‹ค.

    • softmax๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ : one-hot ์ธ์ฝ”๋”ฉ ์œ„ํ•ด์„œ
    • ๊ทธ๋Ÿฐ๋ฐ softmax๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„  ์ „์ฒด ๋‹จ์–ด๋ฅผ 0~1์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ์ „๋ถ€ ๊ณ„์‚ฐ์„ ์ง„ํ–‰ํ•˜๋Š”๋ฐ, ์ด๋•Œ ์ „์ฒด ๋‹จ์–ด๊ฐ€ 3๋งŒ ๊ฐœ ๋“ฑ์ง€๊ฐ€ ๋„˜์–ด๊ฐ€๋Š” ์ •๋„๋กœ ํฐ vocab์ผ ๋• ๊ณ„์‚ฐ๋Ÿ‰์ด ๋งŽ๋‹ค.
    • ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•: Skip-Gram Negative Sampling(SGNS)

      • SGNS๋Š” sigmoid ์‚ฌ์šฉ
  • OOV(Out Of Vocbulary)

    • ํ•ด๊ฒฐ๋ฐฉ๋ฒ•: FastText

      • ๋นˆ๋„์ˆ˜๊ฐ€ ์ ์€ ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋„ OOV ๋ฌธ์ œ์— ํ•ด๊ฒฐ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค
  • ๋ฌธ์„œ ์ „์ฒด์— ๋Œ€ํ•ด์„  ๊ณ ๋ ค ๋ชปํ•œ๋‹ค.

    • ํ•ด๊ฒฐ๋ฐฉ๋ฒ•: GloVe

      • ๋นˆ๋„๊ธฐ๋ฐ˜(TF-IDF) + ํ•™์Šต๊ธฐ๋ฐ˜(Embedding) ๋ฐฉ๋ฒ• ํ˜ผ์šฉ
      • TF-IDF: ๋ฌธ์„œ ์ „์ฒด์— ๋Œ€ํ•œ ํ†ต๊ณ„๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ๋‹จ์–ด๋ณ„ ์˜๋ฏธ๋Š” ๊ณ ๋ คํ•˜์ง€ ๋ชปํ•œ๋‹ค๋Š” ๋‹จ์ ๊ณผ Word2Vec: ์ฃผ๋ณ€ ๋‹จ์–ด๋งŒ์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฌธ์„œ ์ „์ฒด์— ๋Œ€ํ•ด์„œ๋Š” ๊ณ ๋ คํ•˜์ง€ ๋ชปํ•œ๋‹ค.


  • CODE

    • SGNS + CNN
    • ์†Œ์„ค alice in wonderland์— ์‚ฌ์šฉ๋œ ๋‹จ์–ด๋“ค์„ 2์ฐจ์› feature๋กœ vectorํ™” ํ•œ๋‹ค.
    • ๊ทธ๋ฆผ์œผ๋กœ ๋จผ์ € ๋ณด๊ธฐ:

    • image-20200729173124866
    • image-20200729173147483

    ๋„คํŠธ์›Œํฌ์—๋Š” center๊ฐ’ ๋„ฃ์Œ

    • x๊ฐ’์ธ 7์„ input ํ–ˆ์„ ๋•Œ output์ด y๊ฐ’์œผ๋กœ 8์ด ๋‚˜์˜ฌ ๋•Œ์˜ ๋„คํŠธ์›Œํฌ๋‹ค.
    • 2๊ฐœ ๋‰ด๋Ÿฐ์œผ๋กœ ์ค„์˜€์„ ๋•Œ์˜ latent layer๋ฅผ ์ „์ฒด ํ•™์Šต ํ›„ ๋”ฐ๋กœ ๋นผ๋‚ด๊ณ ,
    • ์ด๋•Œ ๋‚˜์˜จ x์ขŒํ‘œ์™€ y์ขŒํ‘œ๋กœ 2D์ƒ์˜ plt์— ๊ทธ๋ฆผ์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด, ๋งฅ๋ฝ ์ƒ ๊ฐ€๊นŒ์šด ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„ ๋‹จ์–ด๋“ค๋ผ๋ฆฌ ๋ญ‰์ณ์ ธ ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

code

STEP 1

  • ํŒจํ‚ค์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import OneHotEncoder
    import matplotlib.pyplot as plt
    import nltk
    import numpy as np
    import pandas as pd
    from nltk.corpus import stopwords
    from nltk.stem import WordNetLemmatizer
    import string
    from nltk import pos_tag
    from nltk.stem import PorterStemmer
    import collections
    from tensorflow.keras.layers import Input, Dense, Dropout
    from tensorflow.keras.models import Model

  • ์ „์ฒ˜๋ฆฌ

    def preprocessing(text): # ํ•œ line(sentence)๊ฐ€ ์ž…๋ ฅ๋จ 
        
        # step1. ํŠน๋ฌธ ์ œ๊ฑฐ
        text2 = "".join([" " if ch in string.punctuation else ch for ch in text]) 
        # for ch in text: ํ•œ sentence์—์„œ ํ•˜๋‚˜์˜ character๋ฅผ ๋ณด๊ณ , string.punctuation:[!@#$% ๋“ฑ]์„ ๊ณต๋ฐฑ์ฒ˜๋ฆฌ('')=์ œ๊ฑฐ ํ•จ  
        tokens = nltk.word_tokenize(text2)
        tokens = [word.lower() for word in tokens] # ์œ„ ์ œ๊ฑฐ์—์„œ ์‚ด์•„๋‚จ์€ ๊ฒƒ๋“ค๋งŒ .lower() = ์†Œ๋ฌธ์ž๋กœ ๋ฐ”๊ฟ”์„œ word์— ๋„ฃ์–ด์คŒ 
    
    	# step2. ๋ถˆ์šฉ์–ด ์ฒ˜๋ฆฌ(์ œ๊ฑฐ)
    	stopwds = stopwords.words('english')
    	tokens = [token for token in tokens if token not in stopwds] # stopword์— ์—†๋Š” ๊ฒƒ๋งŒ token ๋ณ€์ˆ˜์— ์ €์žฅ 
    
    	# step3. ๋‹จ์–ด์˜ ์ฒ ์ž๊ฐ€ 3๊ฐœ ์ด์ƒ์ธ ๊ฒƒ๋งŒ ์ €์žฅ 
    	tokens = [word for word in tokens if len(word)>=3] 
    
    	# step4. stemmer: ์–ด๊ฐ„(prefix) ์ถ”์ถœ(์–ด๋ฏธ(surffix) ์ œ๊ฑฐ)  ex: goes -> go / going -> go
    	stemmer = PorterStemmer()
    	tokens = [stemmer.stem(word) for word in tokens]
    
    	# step5. ๋‹จ์–ด์˜ ํ’ˆ์‚ฌ ํƒœ๊น…(tagging)
    	tagged_corpus = pos_tag(tokens) # ex: (alic, NNP), (love, VB)
    
    	Noun_tags = ['NN','NNP','NNPS','NNS']
    	Verb_tags = ['VB','VBD','VBG','VBN','VBP','VBZ']
    
    	# ๋‹จ์–ด์˜ ์›ํ˜•(ํ‘œ์ œ์–ด,Lemma)์„ ํ‘œ์‹œํ•œ๋‹ค 
    	## ํ‘œ์ œ์–ด(Lemma)๋Š” ํ•œ๊ธ€๋กœ๋Š” 'ํ‘œ์ œ์–ด' ๋˜๋Š” '๊ธฐ๋ณธ ์‚ฌ์ „ํ˜• ๋‹จ์–ด' ์ •๋„์˜ ์˜๋ฏธ. ๋™์‚ฌ์™€ ํ˜•์šฉ์‚ฌ์˜ ํ™œ์šฉํ˜• (surfacial form) ์„ ๋ถ„์„
    	## ์ฐธ๊ณ : https://wikidocs.net/21707
    	## ๊ฑ ํ˜•์šฉ์‚ฌ/๋™์‚ฌ๋ฅผ ์‚ฌ์ „ํ˜• ๋‹จ์–ด๋กœ ๋งŒ๋“ค์—ˆ๋‹ค ์ƒ๊ฐํ•˜๊ธฐ.... 
    	# ex: belives -> (stemmer)believe(๋ฏฟ๋‹ค) // belives -> (lemmatizer)belief(๋ฏฟ์Œ) 
    	# (cooking, N) -> cooking / (cooking, V) -> cook
    	## ํ•œ๊ตญ์–ด ์˜ˆ์‹œ:
        """
        lemmatize ํ•จ์ˆ˜๋ฅผ ์‰ฝ๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 
        ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์ง€์ผœ์ง„ ๋‹จ์–ด๊ฐ€ ์ž…๋ ฅ๋˜์—ˆ์„ ๋•Œ Komoran ์„ ์ด์šฉํ•˜์—ฌ ํ˜•ํƒœ์†Œ ๋ถ„์„์„ ํ•œ ๋’ค, 
        VV ๋‚˜ VA ํƒœ๊ทธ๋ฅผ ๊ฐ€์ง„ ๋‹จ์–ด์— '-๋‹ค'๋ฅผ ๋ถ™์ž…๋‹ˆ๋‹ค. 
        ๋‹จ, '์‰ฌ๊ณ ์‹ถ๋‹ค' ์™€ ๊ฐ™์€ ๋ณตํ•ฉ ์šฉ์–ธ๋„ '์‰ฌ๋‹ค' ๋กœ ๋ณต์›๋ฉ๋‹ˆ๋‹ค.
        ์ถœ์ฒ˜: https://lovit.github.io/nlp/2019/01/22/trained_kor_lemmatizer/
        """
    	lemmatizer = WordNetLemmatizer()
    	
    	# ํ’ˆ์‚ฌ์— ๋”ฐ๋ผ ๋‹จ์–ด์˜ lemma๊ฐ€ ๋‹ฌ๋ผ์ง„๋‹ค 
    	# (cooking, N) -> cooking / (cooking, V) -> cook
    	def prat_lemmatize(token,tag):
        	if tag in Noun_tags:
            	return lemmatizer.lemmatize(token,'n')
        	elif tag in Verb_tags:
            	return lemmatizer.lemmatize(token,'v')
        	else:
            	return lemmatizer.lemmatize(token,'n')
    
    	pre_proc_text =  " ".join([prat_lemmatize(token,tag) for token,tag in tagged_corpus])      
        
        return pre_proc_text

  • ์†Œ์„ค alice in wonderland๋ฅผ ์ฝ์–ด์˜จ๋‹ค.

    lines = []
    fin = open("./dataset/alice_in_wonderland.txt", "r")
    for line in fin:
        if len(line) == 0: 
            continue # ์†Œ์„ค txt๋‚ด ์—”ํ„ฐ ์—†์• ๊ธฐ
        lines.append(preprocessing(line))
    fin.close()

  • ๋‹จ์–ด๋“ค์ด ์‚ฌ์šฉ๋œ ํšŸ์ˆ˜๋ฅผ ์นด์šดํŠธ ํ•œ๋‹ค.

    counter = collections.Counter()
    
    for line in lines:
        for word in nltk.word_tokenize(line):
          counter[word.lower()] += 1

  • ์‚ฌ์ „์„ ๊ตฌ์ถ•ํ•œ๋‹ค.

    • ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋œ ๋‹จ์–ด๋ฅผ 1๋ฒˆ์œผ๋กœ ์‹œ์ž‘ํ•ด์„œ ๋ฒˆํ˜ธ๋ฅผ ๋ถ€์—ฌํ•œ๋‹ค.
    word2idx = {w:(i+1) for i,(w,_) in enumerate(counter.most_common())} # ex: [(apple:50), (cat: 43), ...]
    idx2word = {v:k for k,v in word2idx.items()} # ex: [(50: apple), (43: cat), ...]

  • Trigram์œผ๋กœ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

    xs = []     # ์ž…๋ ฅ ๋ฐ์ดํ„ฐ
    ys = []     # ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ
    for line in lines:
        # ์‚ฌ์ „์— ๋ถ€์—ฌ๋œ ๋ฒˆํ˜ธ๋กœ ๋‹จ์–ด๋“ค์„ ํ‘œ์‹œํ•œ๋‹ค.
        ## ๊ฐ ๋ฌธ์žฅ์„ tokenizeํ•ด์„œ ์†Œ๋ฌธ์ž๋กœ ๋ฐ”๊พธ๊ณ  word2idx๋กœ ๋ณ€ํ™˜ 
        embedding = [word2idx[w.lower()] for w in nltk.word_tokenize(line)] # word2idx: value๊ฐ’์ธ index๋ฒˆํ˜ธ๊ฐ€ ๋‚˜์˜ด 
        
        
        # Trigram์œผ๋กœ ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค์„ ๋ฌถ๋Š”๋‹ค.
        ## .trigrams(=3)๋งŒํผ ๋Š์–ด์„œ ์—ฐ์†๋œ ๋ฌธ์žฅ์œผ๋กœ ๋ฌถ๊ธฐ ex: triples = [(1,2,3), (3,5,3), ...]
        triples = list(nltk.trigrams(embedding))
        
        
        # ์™ผ์ชฝ ๋‹จ์–ด, ์ค‘๊ฐ„ ๋‹จ์–ด, ์˜ค๋ฅธ์ชฝ ๋‹จ์–ด๋กœ ๋ถ„๋ฆฌํ•œ๋‹ค. 
        w_lefts = [x[0] for x in triples]   # [1, 2, ...8]
        w_centers = [x[1] for x in triples] # [2, 8, ...13]
        w_rights = [x[2] for x in triples]  # [8, 13, ...7]
        
        # ์ž…๋ ฅ (xs)      ์ถœ๋ ฅ (xy)
        # ---------    -----------
        # 1. ์ค‘๊ฐ„ ๋‹จ์–ด --> ์™ผ์ชฝ ๋‹จ์–ด
        # 2. ์ค‘๊ฐ„ ๋‹จ์–ด --> ์˜ค๋ฅธ์ชฝ ๋‹จ์–ด
        xs.extend(w_centers)
        ys.extend(w_lefts)
        xs.extend(w_centers)
        ys.extend(w_rights)

  • ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ one-hot ํ˜•ํƒœ๋กœ ๋ฐ”๊พธ๊ณ , ํ•™์Šต์šฉ๊ณผ ์‹œํ—˜์šฉ์œผ๋กœ ๋ถ„๋ฆฌํ•œ๋‹ค.

    vocab_size = len(word2idx) + 1  # ์‚ฌ์ „์˜ ํฌ๊ธฐ # vocab_size = 1787 # + 1 ํ•ด์ค˜์•ผ ๋ฐ‘์— ohe ํ•  ๋•Œ, vocab ๋๊นŒ์ง€ ์ „๋ถ€๋ฅผ ohe ํ•  ์ˆ˜ ์žˆ์Œ 
    
    ohe = OneHotEncoder(categories = [range(vocab_size)]) # ohe = OneHotEncoder(categories=[range(0, 1787)])
    X = ohe.fit_transform(np.array(xs).reshape(-1, 1)).todense() # .todense = .toarray()์™€ ๋™์ผํ•จ: ๊ฒฐ๊ณผ๋ฅผ ๋ฐฐ์—ด ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ 
    Y = ohe.fit_transform(np.array(ys).reshape(-1, 1)).todense()

    X.shape = (13868, 1787) / y.shape = (13868, 1787)


STEP 2. ํ•™์Šต์šฉ/์‹œํ—˜์šฉ data๋กœ ๋ถ„๋ฆฌ

Xtrain, Xtest, Ytrain, Ytest, xstr, xsts = train_test_split(X, Y, xs, test_size=0.2) 
# xs๋ฅผ ์“ด ์ด์œ ? => ๋’ค์—์„œ ๊ฐ€๊นŒ์šด ๋‹จ์–ด๋ผ๋ฆฌ ๊ทธ๋ฆผ(plt) ๊ทธ๋ฆด ๋•Œ ์“ฐ๋ ค๊ณ 

shape ์ฐธ๊ณ  >

np.array(xs).shape
Out[19]: (13868,)
np.array(xstr).shape
Out[20]: (11094,)
np.array(xsts).shape
Out[21]: (2774,)
np.array(Xtrain).shape
Out[22]: (11094, 1787)
np.array(Xtest).shape
Out[23]: (2774, 1787)
np.array(Ytrain).shape
Out[24]: (11094, 1787)
np.array(Ytest).shape
Out[25]: (2774, 1787)

  • ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ์ƒ์„ฑํ•œ๋‹ค.

    BATCH_SIZE = 128
    NUM_EPOCHS = 20
    
    input_layer = Input(shape = (Xtrain.shape[1],), name="input") # shape = batch(None) ๋นผ๊ณ  y feature์˜ shape๋งŒ ๋„ฃ์–ด์ฃผ๋ฉด ๋จ 
    first_layer = Dense(300, activation='relu', name = "first")(input_layer)
    first_dropout = Dropout(0.5, name="firstdout")(first_layer)
    second_layer = Dense(2, activation='relu', name="second")(first_dropout)
    third_layer = Dense(300,activation='relu', name="third")(second_layer)
    third_dropout = Dropout(0.5,name="thirdout")(third_layer)
    fourth_layer = Dense(Ytrain.shape[1], activation='softmax', name = "fourth")(third_dropout)
                      # Ytrain.shape[1] = Xtrain์˜ shape๊ณผ ๋™์ผํ•ด์•ผ ํ•จ 
                      # activation='softmax': one-hot์ด ์ถœ๋ ฅ๋˜๊ธฐ ๋•Œ๋ฌธ์— softmax์—ฌ์•ผ ํ•จ 
    model = Model(input_layer, fourth_layer)
    model.compile(optimizer = "rmsprop", loss="categorical_crossentropy") 

    loss="categorical_crossentropy": ๋งŒ์•ฝ one-hot์ด ์•„๋‹ˆ๋ผ, ์ˆซ์ž(vocab์˜ index)๊ฐ€ ์ถœ๋ ฅ๋œ๋‹ค๋ฉด, loss="sparse_categorical_crossentropy"


  • ํ•™์Šต

    hist = model.fit(Xtrain, Ytrain, 
                   batch_size=BATCH_SIZE,
                   epochs=NUM_EPOCHS,
               validation_data = (Xtest, Ytest))

  • Loss history๋ฅผ ๊ทธ๋ฆฐ๋‹ค

    plt.plot(hist.history['loss'], label='Train loss')
    plt.plot(hist.history['val_loss'], label = 'Test loss')
    plt.legend()
    plt.title("Loss history")
    plt.xlabel("epoch")
    plt.ylabel("loss")
    plt.show()

    image-20200729170655602


STEP3. ๋‹จ์–ด๋“ค๋ผ๋ฆฌ์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ทธ๋ฆผ์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋Š” code

  • Word2Vec ์ˆ˜์น˜ ํ™•์ธ

    # Extracting Encoder section of the Model for prediction of latent variables
    # ํ•™์Šต์ด ์™„๋ฃŒ๋œ ํ›„ ์ค‘๊ฐ„(hidden layer)์˜ ๊ฒฐ๊ณผ ํ™•์ธ: = Word2Vec layerํ™•์ธ. 
    # (word2vec: word๋ฅผ vec(์ˆ˜์น˜)๋กœ ํ‘œํ˜„. ์ €๋ฒˆ ์ˆ˜์—…์—์„œ w์˜ ๊ฐ’์„ '.get_weight()'ํ•ด์„œ ํ™•์ธํ–ˆ์„ ๋•Œ์˜ ๊ฐ’์ด ๋‚˜์˜ฌ ๋“ฏ)
    encoder = Model(input_layer, second_layer)
    
    # Predicting latent variables with extracted Encoder model
    reduced_X = encoder.predict(Xtest) # Xtest ๋„ฃ์€ ๊ฒƒ์ฒ˜๋Ÿผ ์ž„์˜์˜ ๋‹จ์–ด๋ฅผ ์ž…๋ ฅํ•˜๋ฉด reduced_X = ํ•ด๋‹น ๋‹จ์–ด์˜ Word2Vecํ˜•ํƒœ๋กœ ์ถœ๋ ฅ๋จ 
    
    # ์‹œํ—˜ ๋ฐ์ดํ„ฐ์˜ ๋‹จ์–ด๋“ค์— ๋Œ€ํ•œ 2์ฐจ์› latent feature(word2vec ๋งŒ๋“œ๋Š” layer)์ธ reduced_X๋ฅผ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„(ํ‘œ)์œผ๋กœ ์ •๋ฆฌํ•œ๋‹ค.
    final_pdframe = pd.DataFrame(reduced _X)
    final_pdframe.columns = ["xaxis","yaxis"]
    final_pdframe["word_indx"] = xsts # test ์šฉ์ด๋ฏ€๋กœ train/test split ํ•  ๋•Œ ๊ฐ™์ด ๋‚˜๋ˆด๋˜ xstr, xsts ์ค‘ y๊ฐ’์ธ xsts ์‚ฌ์šฉ 
    final_pdframe["word"] = final_pdframe["word_indx"].map(idx2word) # index๋ฅผ word๋กœ ๋ณ€ํ™˜ํ•จ 
    
    # ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์—์„œ 100๊ฐœ๋ฅผ ์ƒ˜ํ”Œ๋งํ•œ๋‹ค.
    rows = final_pdframe.sample(n = 100)
    labels = list(rows["word"])
    xvals = list(rows["xaxis"])
    yvals = list(rows["yaxis"])

    [final_pdframe] > Out[26]: xaxis yaxis word_indx word 0 0.301799 0.000000 25 take 1 0.590210 0.810300 468 pick 2 0.672298 0.000000 1 say 3 0.408792 0.520896 9 know 4 0.387678 0.605502 30 much ... ... ... ... 2769 1.309759 0.851837 27 mock 2770 0.000000 0.423953 622 master 2771 0.196061 0.299570 83 good 2772 0.000000 0.024289 1516 deserv 2773 0.470771 0.550808 497 plan

    [2774 rows x 4 columns]

  • ์ƒ˜ํ”Œ๋ง๋œ 100๊ฐœ ๋‹จ์–ด๋ฅผ 2์ฐจ์› ๊ณต๊ฐ„์ƒ์— ๋ฐฐ์น˜

    • ๊ฑฐ๋ฆฌ๊ฐ€ ๊ฐ€๊นŒ์šด ๋‹จ์–ด๋“ค์€ ์„œ๋กœ ๊ด€๋ จ์ด ๋†’์€ ๊ฒƒ
    plt.figure(figsize=(15, 15))  
    
    for i, label in enumerate(labels):
      x = xvals[i]
      y = yvals[i]
      plt.scatter(x, y)
      plt.annotate(label,xy=(x, y), xytext=(5, 2), textcoords='offset points',
                   ha='right', va='bottom', fontsize=15)
    plt.xlabel("Dimension 1")
    plt.ylabel("Dimension 2")
    plt.show()

    image-20200729170642147



Skip-Gram Negative Sampling(SGNS)

  • Skip-Gram์˜ softmax๋ฅผ ํ™œ์šฉํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ณ„์‚ฐ๋Ÿ‰์ด ๋งŽ๋‹ค๋Š” ๋‹จ์ ์„ sigmoid ์‚ฌ์šฉํ•˜์—ฌ ๋ณด์™„ํ•จ

    • ๊ฐ’์ด 0~1์‚ฌ์ด ๊ฐ’์ด ์•„๋‹ˆ๋ผ, 0 ์•„๋‹ˆ๋ฉด 1์ธ ์ด์ง„ ๋ถ„๋ฅ˜๋กœ ๋‚˜์˜ด
    • ๋”ฐ๋ผ์„œ ๊ณ„์‚ฐ๋Ÿ‰ ๊ฐ์†Œ

  • ๋ฐฉ๋ฒ•:

    Skip-Gram Negative Sampling:

    1. n-gram์œผ๋กœ ์„ ํƒํ•œ ๋‹จ์–ด ์Œ์—๋Š” label = 1์„ ๋ถ€์—ฌํ•˜๊ณ , ๋žœ๋คํ•˜๊ฒŒ ์„ ํƒํ•œ ๋‹จ์–ด ์Œ์—๋Š” label = 0์„ ๋ถ€์—ฌํ•ด์„œ ์ด์ง„ ๋ถ„๋ฅ˜

    N-gram์œผ๋กœ ์„ ํƒํ•œ ๋‹จ์–ด ์Œ์€ ์„œ๋กœ ์—ฐ๊ด€๋œ ๋‹จ์–ด๋กœ ์ธ์‹๋จ

    1. 2๊ฐœ์˜ input์— ๊ฐ๊ฐ์˜ input, target ๊ฐ’ ์ž…๋ ฅ
    2. ๊ฐ๊ฐ vector ๊ฐ’ ๊ณ„์‚ฐ
    3. ๋‘ ๊ฐ’ concat(or dot or add)
    4. sigmoid ๊ณ„์‚ฐํ•˜์—ฌ
    5. label(0 or 1) ๊ฐ’์ด ๋‚˜์˜ค๊ฒŒ
    6. ํ•™์Šต์ด ์™„๋ฃŒ๋œ ํ›„์—๋Š” ์•„๋ž˜์˜ ์™ผ์ชฝ ๋„คํŠธ์›Œํฌ์— ํŠน์ • ๋‹จ์–ด๋ฅผ ์ž…๋ ฅํ•˜๋ฉด, ๊ทธ ๋‹จ์–ด์— ๋Œ€ํ•œ word vector๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

skip-gram๊ณผ skip-Gram Negative Sampling ์ฐจ์ด

Skip-Gram Skip-Gram Negative Sampling
input 1๊ฐœ
input: input data
output: target data
input 2๊ฐœ
input[1] : input data
input[2] : target data
output: label
label ็„ก label ๆœ‰: 1 or 0์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Œ(์ด์ง„๋ถ„๋ฅ˜)
n-gram์œผ๋กœ ์„ ํƒํ•œ ๋‹จ์–ด ์Œ์—๋Š” label = 1
๋žœ๋คํ•˜๊ฒŒ ์„ ํƒํ•œ ๋‹จ์–ด ์Œ์—๋Š” label = 0
์ถœ๋ ฅ์ธต: softmax ์‚ฌ์šฉํ•˜์—ฌ 0~1์‚ฌ์ด์˜ ๊ฐ’
loss='categorical_crossentropy'
๋”ฐ๋ผ์„œ argmax() ํ•จ
์ถœ๋ ฅ์ธต: sigmoid ์‚ฌ์šฉํ•˜์—ฌ ์ด์ง„๋ถ„๋ฅ˜
loss="binary_crossentropy"
๊ฑฐ๋ฆฌ ์—ฐ์‚ฐ(cosine ๋“ฑ)์„ ํ•˜์ง€ ์•Š๋Š”๋‹ค.
latent layer์—์„œ ๋ฒกํ„ฐ ์—ฐ์‚ฐ์„ ํ†ตํ•ด ๋‚˜์˜จ x,y ์ขŒํ‘œ๋กœ ๊ทธ๋ฆผ์„ ๊ทธ๋ฆฌ๋˜๊ฐ€ ํ•ด์„œ
๋งฅ๋ฝ ์ƒ ๋น„์Šทํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„ ๋‹จ์–ด๋“ค์„ ์ฐพ์•„๋‚ผ ์ˆ˜ ์žˆ๋‹ค.
๊ฑฐ๋ฆฌ ์—ฐ์‚ฐ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค.
๋‘ ๊ฐœ์˜ input ๊ฐ’์—์„œ ๋‚˜์˜จ vector ๊ฐ’์„ ํ•˜๋‚˜๋กœ ํ•ฉ์น  ๋•Œ dot ํ•จ์ˆ˜๋ฅผ ์“ฐ๋ฉด ๊ฑฐ๋ฆฌ ์—ฐ์‚ฐ์„ ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค. ์ด๋•Œ cosine ๊ฑฐ๋ฆฌ ํ•จ์ˆ˜๋ฅผ ์“ธ ์ˆ˜๋„ ์žˆ๋‹ค.
๊ทธ๋Ÿฐ๋ฐ, concate ์ด๋‚˜ add ํ•จ์ˆ˜๋ฅผ ์“ฐ๋ฉด ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ์„ ๋ชปํ•œ๋‹ค.


image-20200729151904052


SGNS์˜ Embedding ํ™œ์šฉ

  1. raw data ์ „์ฒ˜๋ฆฌ
  2. Trigram์œผ๋กœ ํ•™์Šตํ•  data ์ƒ์„ฑ
  3. ๊ธ์ •(1), ๋ถ€์ •(0) data ์ƒ์„ฑ:SGNS์šฉ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑ

    rand_word = np.random.randint(1, len(word2idx), len(xs))
    x_pos = np.vstack([xs, ys]).T
    x_neg = np.vstack([xs, rand_word]).T
    
    y_pos = np.ones(x_pos.shape[0]).reshape(-1,1)
    y_neg = np.zeros(x_neg.shape[0]).reshape(-1,1)
    x_total = np.vstack([x_pos, x_neg])
    y_total = np.vstack([y_pos, y_neg])
    X = np.hstack([x_total, y_total])
    np.random.shuffle(X)
  4. SGNS ๋ชจ๋ธ ๋นŒ๋“œ
  5. embedding, dot, reshape, sigmoid, binary_crossentropy ๋“ฑ
  6. SGNS ๋ชจ๋ธ ํ•™์Šต
  7. SGNS์˜ Embedding ๋ชจ๋ธ ๋งŒ๋“ค๊ณ , ๊ทธ ๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜(w)๋งŒ ๋”ฐ๋กœ ๋นผ์„œ ์ €์žฅํ•จ
  8. ์—ฌ๊ธฐ๊นŒ์ง€๊ฐ€ ๋ฒ”์šฉ ๋ชฉ์ ์˜ SGNS์˜ Embedding์„ ๋งŒ๋“  ์ ˆ์ฐจ.
  9. ์•„๋ž˜๋ถ€ํ„ด ๋ถˆํŠน์ • word data์— SGNS๋กœ ํ•™์Šตํ•œ Embedding ๊ธฐ๋ฒ•์„ ์ ์šฉํ•ด๋ณด๋Š” ๊ฒƒ์ž„
  10. raw data๋กœ train/test data split
  11. ํ™œ์šฉํ•  CNN ๋ชจ๋ธ ๋นŒ๋“œ(complie ๊นŒ์ง€)
  12. CNN ๋ชจ๋ธ ํ•™์Šต ์ „์— SGNS์˜ Embedding์˜ ๊ฐ€์ค‘์น˜(w) load(๋ถˆ๋Ÿฌ์˜ค๊ธฐ)
  13. CNN ๋ชจ๋ธ fit ํ•  ๋•Œ, SKNS์—์„œ ํ•™์Šตํ•œ W๋ฅผ ์ ์šฉ: model.layers[1].set_weights(We)
  14. plt ๊ทธ๋ฆฌ๊ฑฐ๋‚˜ ์„ฑ๋Šฅ ํ™•์ธ
  15. ์„ฑ๋Šฅ ํ™•์ธ:

    y_pred = model.predict(x_test)
    y_pred = np.where(y_pred > 0.5, 1, 0)
    print ("Test accuracy:", accuracy_score(y_test, y_pred))

  • Google's trained Word2Vec model:

    • SGNS ๋ฐฉ์‹
    • Pre-trained ๋ฐฉ์‹
    • ๋ฌธ์„œ โ†’ Vectorํ™”(์ˆ˜์น˜ํ™”) โ†’ ์ผ๋ฐ˜ DL๋กœ ๋ฐ”๋กœ ํ•™์Šต ๊ฐ€๋Šฅ


์‘์šฉ ๋ฐ ๋ฐœ์ „์— ์žˆ์–ด ๊ถ๊ธˆํ•œ ์ 

  • Word2Vec์˜ code ไธญ ๋นˆ๋„์ˆœ์œผ๋กœ index๋ฅผ ๋ถ€์—ฌํ–ˆ์—ˆ๋‹ค.

    word2idx = {w:(i+1) for i,(w,_) in enumerate(counter.most_common())} 
  • ๊ทธ๋Ÿฐ๋ฐ ๋ฐ”๋กœ ๋‹ค์Œ, trigram์œผ๋กœ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์„ฑํ•  ๋•

    embedding = [word2idx[w.lower()] for w in nltk.word_tokenize(line)]

    ์ •๋ง ๋‹จ์ˆœํžˆ word2idx๋ฅผ ๋‹จ์–ด ์ฐพ๋Š” ์šฉ๋„๋กœ๋งŒ ์ผ๋‹ค.

    triples = list(nltk.trigrams(embedding))
  • ๋งŒ์•ฝ, ์œ„ embedding์„ sort ํ•ด์„œ idx number๋ฅผ ์žฌ์ •๋ ฌํ•˜๊ฑฐ๋‚˜, pre-processing ๋‹จ๊ณ„์— embedding์„ ๋„ฃ๋Š”๋‹ค.

    ๊ทธ๋ฆฌ๊ณ  CNN, LSTM ๋ชจ๋ธ์„ ๋Œ๋ฆฐ๋‹ค๋ฉด, ๋นˆ๋„๊ฐ€ ๋น„์Šทํ•œ ๋‹จ์–ด๋“ค๋ผ๋ฆฌ ๋ฌถ์ผ ๊ฒƒ์ด๋‹ค.

    ๊ทธ๋Ÿผ ๋‹จ์–ด์˜ ์ค‘์š”๋„ ์ˆœ์œผ๋กœ ๊ฑฐ๋ฆฌ๋ฅผ ์ธก์ •ํ•  ํ…๋ฐ, ๊ทธ๋Ÿผ index number 1์ธ ๋‹จ์–ด๋ฅผ ์ฐพ๊ณ , ๊ทธ ๋‹จ์–ด์™€ ๋‹ค๋ฅธ ๋‹จ์–ด ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•ด ํŠน์ • ์ ์ˆ˜ ๊ตฌ๊ฐ„ ์ด์™ธ์˜ ๊ฒƒ๋“ค์„ ๋”ฐ๋กœ ๋ชจ์•„๋‘”๋‹ค๋ฉด?

  • ์˜ˆ๋ฅผ ๋“ค์–ด ๊ฐ€์žฅ ๋งŽ์ด ๋‚˜์˜จ ๋‹จ์–ด happy์™€ ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ธก์ •ํ•˜๊ณ , 0.5 ์ดํ•˜์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„์ธ ๊ฒƒ๋“ค์„ ๋”ฐ๋กœ ๋นผ์„œ Another_vocab์— ๋ชจ์•„๋‘”๋‹ค. ๊ธฐ์กด vocab์˜ ๊ฒƒ๋“ค์€ ํ•ด๋‹น Document, Sentence์˜ ํ•ต์‹ฌ keyword ๋“ค์ผ ๊ฑฐ๊ณ , ์ฃผ์ธ๊ณต๋“ค์ด๊ฒ ์ง€(๊ฒฝ์šฐ์— ๋”ฐ๋ผ์„  ํ•„์š”๊ฐ€ ์—†๋Š” ๋‹จ์–ด์ผ์ˆ˜๋„ ์žˆ๊ฒ ๋‹ค.)

    ์ด๋ ‡๊ฒŒ ๋‹ค๋ฅธ ๋ฌธ์„œ๋„ ์ด๋Ÿฌํ•œ process๋ฅผ ์ง„ํ–‰ํ•ด Anothervocab2๋ฅผ ๋งŒ๋“ ๋‹ค.

    ๊ทธ ๋‹ค์Œ Anothervocab์™€ Anothervocab_2์˜ cosine ์œ ์‚ฌ๋„๋ฅผ ๋‹ค์‹œ ๊ตฌํ•ด axis๋กœ ํ†ตํ•ฉํ–ˆ์„ ๋•Œ์˜ ์œ ์‚ฌ๋„๋Š” ํ•ด๋‹น ๋ฌธ์„œ ์‚ฌ์ด์˜ ๊ฒน์น˜๋Š” ๋‹จ์–ด๋“ค ์ˆ˜์น˜๊ฒ ์ง€.

  • ์ด๊ฑธ 100๋…„์น˜ ์‹ ๋ฌธ data์— ๋…„๋ณ„๋กœ ์ ์šฉํ•œ๋‹ค๋ฉด, 1988๋…„ ์‹ ๋ฌธ๊ณผ 2020๋…„ ์‹ ๋ฌธ์ด ์œ ์‚ฌ๋„๊ฐ€ ๋†’์„ ๋•Œ, 1988๋…„ ๋ฐ 2020๋…„์˜ ๊ตญ๋ฏผ๋“ค์˜ ๊ด€์‹ฌ์‚ฌ๊ฐ€ ์ผ์น˜ํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ?




  • ์ฐธ๊ณ :

    ์•„๋งˆ์ถ”์–ด ํ€€ํŠธ, blog.naver.com/chunjein

    ์ฝ”๋“œ ์ถœ์ฒ˜: ํฌ๋ฆฌ์Šˆ๋‚˜ ๋ฐ”๋ธŒ์‚ฌ ์™ธ. 2019.01.31. ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ฟก๋ถ with ํŒŒ์ด์ฌ [ํŒŒ์ด์ฌ์œผ๋กœ NLP๋ฅผ ๊ตฌํ˜„ํ•˜๋Š” 60์—ฌ ๊ฐ€์ง€ ๋ ˆ์‹œํ”ผ]. ์—์ด์ฝ˜

ยฉ 2020 jynee