NLP Ask Me Anything

NLP ๋ถ„์•ผ์—์„œ ๋”ฅ๋Ÿฌ๋‹์˜ ๊ณ ๊ธ‰ ์‘์šฉ

  • DMN
  • Ask Me Anything

    • attention score layer
    • story layer
    • episodic memory layer
    • answer layer


ํ…์ŠคํŠธ ์ž๋™ ์ƒ์„ฑ

์˜ˆ์ œ ๋ฌธ์žฅ: I love you very much

  • ๋ฌธ์ž ๋‹จ์œ„์˜ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ ์ƒ์„ฑ

    • ์•„๋ž˜์ฒ˜๋Ÿผ ๋˜๋„๋ก ์ƒ์„ฑ: ๋ณดํ†ต์˜ ์‹œ๊ณ„์—ด Batch data์ฒ˜๋Ÿผ ์ƒ์„ฑ

      image-20200822112750536

      y๊ฐ’์„ word ๋‹จ์œ„๊ฐ€ ์•„๋‹ˆ๊ณ  characters(ex: a, b, c, ... z) ๋กœ ์„ค์ •

  • LSTM์œผ๋กœ ํ•™์Šต

    • I love you๋ฅผ ๋„ฃ์–ด๋„ v๊ฐ€ ๋‚˜์˜ค๊ฒŒ๋” ์‹ ๊ฒฝ๋ง ์† activation์€ softmax ํ•จ์ˆ˜๋ฅผ ์“ด๋‹ค.
    • compile ์‹œ, loss ํ•จ์ˆ˜๋Š” categorical_crossentropy ์‚ฌ์šฉ
  • ์ˆœ์„œ:

    1. ๋ฌธ์žฅ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ raw data ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
    2. ์ „์ฒ˜๋ฆฌ
    3. word ์•„๋‹ˆ๊ณ  character ๋‹จ์œ„๋กœ ๋ถ„๋ฅ˜
    4. ์‹œ๊ณ„์—ด x data ์ƒ์„ฑ. (์œ„ ์˜ˆ์‹œ์˜ x ์ฒ˜๋Ÿผ)
    5. Converting indices into vectorized format

      • X, Y๋ฅผ np.zeros
    6. Model Building
    7. softmax: ์ถœ๋ ฅ์ธต์˜ ๊ฐ’์ด [0.3, 0.4, 0.8] ๋“ฑ์œผ๋กœ ๋‚˜์˜ค๋ฉด ์ด ์ดํ•ฉ์ด 1์ด ๋‚˜์˜ค๊ฒŒ๋” ํ™•๋ฅ ๋ถ„ํฌ ๋‹ค์‹œ ๊ณ„์‚ฐ. ์ด๋•Œ ๋‚˜์˜จ ๊ฐ’๋“ค์˜ ์ฐจ์ด๋ฅผ ๋” ํฌ๊ฒŒ ์กฐ์ž‘ํ•˜๊ณ  ์‹ถ์„ ๋•Œ, ๋ฒ ํƒ€๊ฐ€ ๋“ค์–ด๊ฐ„ ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ„์‚ฐ. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด model.predict(x) ์‹œ, ์›ํ•˜๋Š” ๋ฌธ์ž์˜ ์ˆ˜์น˜๊ฐ€ ๋‚˜์˜ฌ ํ™•๋ฅ ์ด ๋†’์•„์ง„๋‹ค (์—ญ์œผ๋กœ ์›ํ•˜์ง€ ์•Š๋Š” ๋‹จ์–ด๊ฐ€ ๋‚˜์˜ฌ ํ™•๋ฅ ์€ ์ ์–ด์ง„๋‹ค.)
    8. softmax ํ•จ์ˆ˜๋ฅผ ์“ฐ๋Š” skip-gram์€ ๊ณ„์‚ฐ๋Ÿ‰์ด ๋งŽ๋‹จ ๋‹จ์ ์ด ์žˆ๋Š”๋ฐ, ์ด๋ฅผ SGNS๊ฐ€ ๋ณด์™„ํ•œ๋‹ค.
    9. ์˜ˆ์ธก์น˜๋ฅผ softmax ํ™•๋ฅ ๋กœ ๋ฝ‘์•„ ๋‹ค์‹œ ์—ญ์—ฐ์‚ฐ(exp) ํ•˜๋Š” ํ•จ์ˆ˜ ์ƒ์„ฑ
    10. np.random.multinomial๋กœ sampling

      def pred_indices(preds, metric=1.0):
         preds = np.asarray(preds).astype('float64')
         preds = np.log(preds) / metric
         exp_preds = np.exp(preds)
         preds = exp_preds/np.sum(exp_preds)
         probs = np.random.multinomial(1, preds, 1)
         return np.argmax(probs)
      > * ๋‹คํ•ญ ๋ถ„ํฌ (Multinomial distribution):
      >
      >   ๋‹คํ•ญ ๋ถ„ํฌ๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ฐ’์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ๋…๋ฆฝ ํ™•๋ฅ ๋ณ€์ˆ˜๋“ค์— ๋Œ€ํ•œ ํ™•๋ฅ ๋ถ„ํฌ๋กœ, ์—ฌ๋Ÿฌ ๋ฒˆ์˜ ๋…๋ฆฝ์  ์‹œํ–‰์—์„œ ๊ฐ๊ฐ์˜ ๊ฐ’์ด ํŠน์ • ํšŸ์ˆ˜๊ฐ€ ๋‚˜ํƒ€๋‚  ํ™•๋ฅ ์„ ์ •์˜ํ•œ๋‹ค. ๋‹คํ•ญ ๋ถ„ํฌ์—์„œ ์ฐจ์›์ด 2์ธ ๊ฒฝ์šฐ ์ดํ•ญ ๋ถ„ํฌ๊ฐ€ ๋œ๋‹ค.
      >
      >   > ์ถœ์ฒ˜: [์œ„ํ‚ค๋ฐฑ๊ณผ. ๋‹คํ•ญ๋ถ„ํฌ]([https://ko.wikipedia.org/wiki/%EB%8B%A4%ED%95%AD_%EB%B6%84%ED%8F%AC](https://ko.wikipedia.org/wiki/๋‹คํ•ญ_๋ถ„ํฌ))
    11. Train & Evaluate the Model
    12. batch

      1. randint๋กœ randomํ•˜๊ฒŒ ์‹œ์ž‘ํ•˜๋„๋ก ์„ค์ •
    13. ํ™•๋ฅ  ์ž„์˜ ์„ค์ • ํ›„ ๋‹จ์–ด ์ƒ์„ฑ

      • [0.2, 0.7,1.2] ์ฒ˜๋Ÿผ

        for diversity in [0.2, 0.7,1.2]:
        a = np.array([0.9, 0.2, 0.4])
        b = 1.0
        e = np.exp(a/b)
        
        print(e/np.sum(e))

        '0.9'์ฒ˜๋Ÿผ ์œ ๋… ํ•˜๋‚˜์˜ ๊ฐ’์ด ํด ๊ฒฝ์šฐ ๋‹ค๋ฅธ ๊ฐ’๋“ค๊ณผ์˜ ์ฐจ์ด๊ฐ€ ๋” ์ปค์ง

        print(e/np.sum(e))
        [0.40175958 0.2693075 0.32893292]
        print(e/np.sum(e))
        [0.47548496 0.23611884 0.2883962 ]
        a = np.array([0.6, 0.2, 0.4])
        b = 1.0
        e = np.exp(a/b)
        a = np.array([0.9, 0.2, 0.4])
        b = 1.0
        e = np.exp(a/b)
    14. model.predict(x):

      • ์˜ˆ์ธก(predict): model.predict(x) = > [0.01, 0.005, 0.3, 0.8 ...]
    15. ๋ฌธ์ž ์ถ”์ถœ:

      sys.stdout.write(pred_char)
      sys.stdout.flush()


DMN

  • Dynamic Memory Networks

    • ์•„๋ž˜ 5๊ฐ€์ง€์˜ N/W๊ฐ€ ๊ฒฐํ•ฉ๋˜์–ด ์žˆ๋Š” ๋ชจ์Šต
    • Input Module
    • Question Module
    • Episodic Memory Module
    • Answer Module
    • attention score N/W (FNN)

Ask Me Anything

  • Q โ†’ A: Question & Answering
  • DL ์‚ฌ์šฉ
  • ๋…ผ๋ฌธ ์ €์ž๋Š” GRU ์‚ฌ์šฉ
  • ํŠน์ง•: Q&A๋ฅผ ๊ธฐ์–ตํ•˜๋Š” ํ•˜๋‚˜์˜ ๊ฒฝํ—˜ ๋‹จ์œ„์ธ Episode๋ฅผ ๊ธฐ์–ตํ•˜๋Š” ์žฅ์น˜๊ฐ€ ์žˆ๋‹ค.
  • ์ˆœ์„œ:

    1. Input ๋ฌธ์žฅ(text sequence)๊ณผ
    2. attention ์—ฐ์‚ฐ์ด ๋“ค์–ด๊ฐ„ question ๋ฐ›์•„
    3. attention ์—ฐ์‚ฐ: attention score
    4. episodic memory ๊ตฌ์„ฑํ•œ ํ›„,
    5. ์ผ๋ฐ˜์ ์ธ ๋‹ต๋ณ€์„ ์ค„ ์ˆ˜ ์žˆ๊ฒŒ๋” ๊ตฌ์„ฑํ•œ ๋„คํŠธ์›Œํฌ
    6. image-20200731122306179

    ์ถœ์ฒ˜: Ankit Kumar์™ธ, 2016.05, Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. ์—์„œ 'attention score' ์ถ”๊ฐ€

  • attention process
  • attention score ๊ณ„์‚ฐ

    • attention score
    • ์—ฌ๋Ÿฌ ๋ฌธ์žฅ์— Epsodic stroy๊ฐ€ ์žˆ๊ณ  ์ด๊ฒƒ์— ๋Œ€ํ•œ ๋‹ต์„ ์ฐพ์„ ๋•Œ ์งˆ๋ฌธ๊ณผ ๊ฐ€์žฅ ๊ด€๋ จ์ด ๋†’์€ (์ €์žฅ๋œ) ๋ฌธ์žฅ์— ์ ์ˆ˜๋ฅผ ๋งค๊ธฐ๊ธฐ ์œ„ํ•œ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•จ
    • ์ฆ‰, ๋‹ต์„ ๋‚ด๊ธฐ ์œ„ํ•ด ์–ด๋–ค ๋ฌธ์žฅ์— attention์„ ํ•ด์•ผ ํ•˜๋Š”์ง€ attention score๋กœ ๊ณ„์‚ฐํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ๊ธฐ๊ณ„๋ฒˆ์—ญ, test ๋ถ„๋ฅ˜, part-of-speech tagging, image captioning, Dialog system(chatbot) ๊ฐ€๋Šฅ
  • mission: ์ฃผ์–ด์ง„ Question์˜ ์˜๋ฏธ๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„คํŠธ์›Œํฌ๋ฅผ ๊ตฌ์„ฑํ•ด์•ผ ํ•œ๋‹ค.

    • '์˜๋ฏธ๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋„๋ก' : ์กฐ์‘์–ด(Anaphora resolution) ํ•ด์„.

  • input module = story module

    1. Input์— ๋„ฃ์„ ๋ฌธ์žฅ๋“ค์„ 1ํ–‰์œผ๋กœ ๋ถ™์ด๊ณ , [EOS] ๋กœ ๊ตฌ๋ถ„
    ๋ฌธ์žฅ 1 ๋ฌธ์žฅ 2 ๋ฌธ์žฅ3
    When I was young, I passed test <EOS> But, Now Test is so crazy <EOS> Because The test level pretty hard more and more.
    1. Embedding layer ํˆฌ์ž…
    2. RNN ๊ฑฐ์ณ์„œ
    3. Hidden layer ์ถœ๋ ฅ์€ ๋‹ค์‹œ n๊ฐœ์˜ ๋ฌธ์žฅ(c1, c2, c3 ๋“ฑ)์œผ๋กœ ์ถœ๋ ฅ
    4. episodic memory module ํˆฌ์ž…

  • Question module

    1. Question ๋ฌธ์žฅ ํˆฌ์ž…
    2. Embedding layer ํˆฌ์ž…
    3. RNN ๊ฑฐ์ณ์„œ
    4. episodic memory module ํˆฌ์ž…

  • episodic memory module

    • input module(๋ฌธ์žฅ๋งˆ๋‹ค)+Question module+ atttention mechanism์ถœ๋ ฅ๋œ ๊ฑธ ๋ฐ˜๋ณตํ•ด์„œ ๋‚ด๋ถ€์˜ episodic memory๋ฅผ ๋ฐ˜๋ณต update
    • "์–ด๋–ป๊ฒŒ update?"
    • input module์˜ embedding value๊ณผ atttention score ๊ณ„์‚ฐํ•˜์—ฌ RNN layer์— ํ†ต๊ณผ ์‹œํ‚ค๊ธฐ

      • ์ด๋•Œ, atttention score: atttention score layer์˜ ์ถœ๋ ฅ์ธต์—์„œ ๋‚˜์˜จ w์ธg๋ฅผ input module์˜ ์ถœ๋ ฅ๊ฐ’์ธ c1,c2,c3 ๋“ฑ๊ณผ ๊ณ„์‚ฐํ•œ ๊ฐ’
    • Question module์˜ embedding value ๊ฐ’์„ RNN layer์— ํ†ต๊ณผ ์‹œํ‚ค๊ธฐ

      • ์ด๋•Œ, w = Q
    • 2๋ฅผ answer์„ outputํ•  Answer Module์˜ RNN layer์— ํ†ต๊ณผ ์‹œํ‚ด

      • ์ด๋•Œ, w = m
    • episodic memory module์˜ RNN layer๋ฅผ ๋ฐ˜๋ณตํ•  ๋•Œ๋งˆ๋‹ค attetion score๊ณ„์‚ฐ
    • ๊ทธ๋ ‡๊ฒŒ ํ•ด์„œ ๋‚˜์˜จ attetion score ๊ฐ’ ์ค‘ ๊ฐ€์žฅ ๋†’์€ ๊ฒƒ(g) ์ฐพ์Œ

      • atttention mechanism
      • ๊ทธ๋ ‡๊ฒŒ ํ•ด์„œ ๋‚˜์˜จ attetion score ๊ฐ’ ์ค‘ ๊ฐ€์žฅ ๋†’์€ ๊ฒƒ(g) ์ฐพ๊ณ 
      • g๋กœ ๋‹ค์‹œ ๋„คํŠธ์›Œํฌ ํ˜•์„ฑ

        • 2์ธต ๊ตฌ์กฐ๊ฐ€ ๋จ
      • memory update mechanism

        • attention score๋กœ ๊ฐ€์ค‘ ํ‰๊ท 

  • ์šฉ์–ด:

    • c: Input์˜ ์ถœ๋ ฅ
    • m: episodic memory module ์ถœ๋ ฅ๊ฐ’์ด์ž attention score์˜ ์ž…๋ ฅ๊ฐ’
    • q: question layer์˜ ์ถœ๋ ฅ๊ฐ’
    • g: attention score์˜ ์ถœ๋ ฅ๊ฐ’


code:

image-20200812161218399

  • ์ˆœ์„œ

    1. ํŒจํ‚ค์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
    2. ์ „์ฒ˜๋ฆฌ
    3. 2-1. Document Data processing(raw data)
    4. ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
    5. 3-1. Raw Document Data ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
    6. 3-2. train/test data split ํ•ด์„œ ๊ฐ€์ ธ์˜ค๊ธฐ
    7. vocab ๋งŒ๋“ค๊ธฐ
    8. 4-1. Train & Test data๋ฅผ ํ•œ๊บผ๋ฒˆ์— ๋ฌถ์–ด์„œ vocab์„ ๋งŒ๋“ฆ

      • collections.Counter()
    9. 4-2. word2indx / indx2word ๋งŒ๋“ฆ

      • padding
    10. ๋ฒกํ„ฐํ™”
    11. 5-1. vocab_size ๋ณ€์ˆ˜ ์„ค์ •

      • len(word2indx)
    12. 5-2. story์™€ question ๊ฐ๊ฐ์˜ max len ๋ณ€์ˆ˜ ์„ค์ •

      • ๋’ค์—์„œ padding ๋งž์ถฐ ์ฃผ๋ ค๊ณ  max len ์„ค์ •ํ•ด์คŒ
    13. 5-3. ๋ฒกํ„ฐํ™” ์‹œํ‚ด

      • raw data์™€ word2indx, ๊ฐ ๋ชจ๋“ˆ(story, question)์˜ maxlen์„ ํ•จ์ˆ˜์— ๋„ฃ์–ด padding, categorical ๋“ฑ์„ ์ง„ํ–‰ํ•จ
    14. ๋ชจ๋ธ ๋นŒ๋“œ
    15. 6-1. train/test data split
    16. ์ด๋•Œ, Xstrain, Xqtrain, Ytrain = data_vectorization(datatrain, word2indx, storymaxlen, questionmaxlen) ์ด๊ณ , datavectorization์˜ return ๊ฐ’์€

      • pad_sequences(Xs, maxlen=story_maxlen)
      • pad_sequences(Xq, maxlen=question_maxlen)
      • to_categorical(Y, num_classes=len(word2indx))
    17. 6-2. Model Parameters ์„ค์ •
    18. 6-3. Inputs
    19. 6-4. Story encoder embedding
    20. 6-5. Question encoder embedding
    21. 6-6. ๋ชจ๋“ˆ ๋งŒ๋“ค์–ด์คŒ

      • Question module๋Š” ์œ„์—์„œ ๋งŒ๋“ค์–ด์ค€ ๊ฑธ๋กœ ์‚ฌ์šฉํ•จ
      • attention score layer

        • dot์œผ๋กœ ๋งŒ๋“ฆ
      • story module

        • ์ด layer๋Š” story layer์˜ input์—์„œ ์‹œ์ž‘ํ•˜์—ฌ question layer๋Š” ๊ฑด๋„ˆ๋›ฐ๊ณ  ๋˜ ๋‹ค๋ฅธ embedding layer๋ฅผ ๊ฑฐ์ณ, ์ถ”ํ›„์— dot layer์™€ add๋ฅผ ํ•ด์ฃผ๋ ค๊ณ  ๋งŒ๋“ฆ
      • episodic memory module

        • dotํ•œ layer์™€ ๋ฐ”๋กœ ์œ„์˜ storyencoderc๋ฅผ addํ•ด์„œ ๋งŒ๋“ค์–ด์ง€๊ฒŒ ๋จ
      • answer module

        • episodic memory layer(response) + quetion layer
    22. compile

    model = Model(inputs=[storyinput, questioninput], outputs=output)

    • input์— story์™€ question ๋‘ ๊ฐœ ์จ์คฌ๋‹จ ๊ฒƒ!
    1. **fit **
    2. loss plot
    3. ์ •ํ™•๋„ ์ธก์ •(predict)
    4. ์ ์šฉ

image-20200803190130197


  • ํŒจํ‚ค์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

    import collections
    import itertools
    import nltk
    import numpy as np
    import matplotlib.pyplot as plt
    import random
    from tensorflow.keras.layers import Input, Dense, Activation, Dropout
    from tensorflow.keras.layers import LSTM, Permute
    from tensorflow.keras.layers import Embedding
    from tensorflow.keras.layers import Add, Concatenate, Dot
    from tensorflow.keras.models import Model
    from tensorflow.keras.preprocessing.sequence import pad_sequences
    from tensorflow.keras.utils import to_categorical

์ „์ฒ˜๋ฆฌ

  • Raw Document Data processing

    # ๋ฌธ์„œ ๋‚ด์šฉ ์˜ˆ์‹œ : 3๋ฌธ์žฅ์˜ story(episodic story)
    # ํ˜„์žฌ๊นŒ์ง€ NLP๋Š” ํ•œ ๋ฌธ์žฅ ์•ˆ์—์„œ ๋‹จ์–ด๋“ค์˜ ์˜๋ฏธ๋ฅผ ํŒŒ์•…ํ•˜๊ณ  ์ด๋ฅผ ํ†ตํ•ด ํ•œ ๊ฐœ์˜ ๋ฌธ์žฅ์„ ๋ถ„์„ํ•˜๋Š” ์ˆ˜์ค€์— ๊ทธ์ณ์žˆ๋‹ค(step1 ์ˆ˜์ค€).
    # ์ด๋•Œ, episodic story๋Š” 'ํ•œ ๋ฌธ์žฅ ์•ˆ'์ด ์•„๋‹ˆ๋ผ ๋ฌธ์žฅ '๊ฐ„'์˜ ๋‹จ์–ด๋“ค์˜ ๊ด€๊ณ„(=๋ฌธ์žฅ ๊ฐ„์˜ ๊ด€๊ณ„)๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๋ฐ์— ์˜์˜๊ฐ€ ์žˆ์œผ๋ฉฐ ๋”ฐ๋ผ์„œ ๋งค์šฐ ๋ถ„์„์ด ์–ด๋ ต๋‹ค(์ด๋Š” ์‹œ์˜ ์˜์—ญ์ด๋‹ค. step2). ๋‚˜์•„๊ฐ€ ๋ฌธ๋‹จ(Paragrape) ๊ฐ„์˜ ๊ด€๊ณ„๊นŒ์ง€ ํŒŒ์•…ํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค(์ด๋Š” ์†Œ์„ค์˜ ์˜์—ญ์ด๋‹ค. step3).  
    """
    data ์ƒ๊น€์ƒˆ
    # 1 Mary moved to the bathroom.\n
    # 2 Daniel went to the garden.\n
    # 3 Where is Mary?\tbathroom\t1 
    """
    ## Question๊ณผ answer์€ #\t : tab ์œผ๋กœ ๊ตฌ๋ถ„๋˜์–ด ์žˆ๋‹ค.
    # Return: # 3๊ฐœ(Stories, question, answer)๋ฅผ return ํ•ด์คŒ 
    # Stories = ['Mary moved to the bathroom.\n', 'John went to the hallway.\n']
    # questions = 'Where is Mary? '
    # answers = 'bathroom'
    #----------------------------------------------------------------------------
    def get_data(infile):
      stories, questions, answers = [], [], []
      story_text = []
      fin = open(infile, "r") 
      for line in fin:
          lno, text = line.split(" ", 1)
          if "\t" in text: # >data ์ƒ๊น€์ƒˆ<์—์„œ \t๊ฐ€ ์žˆ๋Š” 3๋ฒˆ์„ ๋งํ•˜๋Š” ๊ฒƒ์ž„ 
              question, answer, _ = text.split("\t") #\t์œผ๋กœ ๊ตฌ๋ถ„ํ•ด์„œ quetion๊ณผ answer ๊ตฌ๋ถ„  # ์ˆซ์ž(ex:1)
              stories.append(story_text) 
              questions.append(question)
              answers.append(answer) # >data ์ƒ๊น€์ƒˆ< ์—์„œ 3๋ฒˆ์˜ \t ์•ž์˜ answer๋ฌธ 
              story_text = []
          else:
              story_text.append(text) # ์‚ฌ์‹ค์ƒ ํ•ด๋‹น ํ•จ์ˆ˜๋Š” else ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š” ๊ฒƒ. 
      fin.close()
      return stories, questions, answers

๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

  • Raw Document Data ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

    Train_File = "./dataset/qa1_single-supporting-fact_train.txt"
    Test_File = "./dataset/qa1_single-supporting-fact_test.txt"

  • get the data

    data_train = get_data(Train_File) # ์ถœ๋ ฅ: stories, questions, answers
    data_test = get_data(Test_File)
    print("\n\nTrain observations:",len(data_train[0]),"Test observations:", len(data_test[0]),"\n\n")

    Train observations: 10000 Test observations: 1000


vocab ๋งŒ๋“ค๊ธฐ

  • Building Vocab dictionary from Train & Test data

    • Train & Test data๋ฅผ ํ•œ๊บผ๋ฒˆ์— ๋ฌถ์–ด์„œ vocab์„ ๋งŒ๋“ฆ
    dictnry = collections.Counter() # collections.Counter() ์ด์šฉํ•˜์—ฌ ๋‹จ์–ด๋“ค์ด ์‚ฌ์šฉ๋œ count ์กฐํšŒํ•  ์˜ˆ์ • 
    for stories, questions, answers in [data_train, data_test]:
      for story in stories:
          for sent in story:
              for word in nltk.word_tokenize(sent):
                  dictnry[word.lower()] +=1
      for question in questions:
          for word in nltk.word_tokenize(question):
              dictnry[word.lower()]+=1
      for answer in answers:
          for word in nltk.word_tokenize(answer):
              dictnry[word.lower()]+=1
  • word2indx / indx2word ๋งŒ๋“ฆ

    # collections.Counter()๊ณผ ๊ตฌ์กฐ๋Š” ๊ฐ™์€๋ฐ, ๋‹จ์–ด index๋Š” 1๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๊ฒŒ ๋ฐ”๊ฟ”์คŒ.  
    word2indx = {w:(i+1) for i,(w,_) in enumerate(dictnry.most_common())} 
    word2indx["PAD"] = 0 # padding
    indx2word = {v:k for k,v in word2indx.items()} 
    # ์œ„์—์„œ word2indx["PAD"] ํ•ด์ค˜์„œ print(indx2word) ํ•˜๋ฉด, ๋งจ ๋งˆ์ง€๋ง‰์— ',0: 'PAD'' ๊ฐ€ ๋“ค์–ด๊ฐ€ ์žˆ๋‹ค. 

๋ฒกํ„ฐํ™”

  • vocab_size ๋ณ€์ˆ˜ ์„ค์ •

    vocab_size = len(word2indx) # vocab_size = 22 -> ์ฆ‰ 21๊ฐœ ๋‹จ์–ด๋งŒ์ด ์“ฐ์ธ ๊ฒƒ(ํ•˜๋‚˜๋Š” ํŒจ๋”ฉ)
    print("vocabulary size:",len(word2indx))
    print(word2indx)
    • vocabulary size: 22
    • {'to': 1, 'the': 2, '.': 3, 'where': 4, 'is': 5, '?': 6, 'went': 7, 'john': 8, 'sandra': 9, 'mary': 10, 'daniel': 11, 'bathroom': 12, 'office': 13, 'garden': 14, 'hallway': 15, 'kitchen': 16, 'bedroom': 17, 'journeyed': 18, 'travelled': 19, 'back': 20, 'moved': 21, 'PAD': 0}
  • story์™€ question ๊ฐ๊ฐ์˜ max len ๋ณ€์ˆ˜ ์„ค์ •

    story_maxlen = 0
    question_maxlen = 0
    
    for stories, questions, answers in [data_train, data_test]:
      for story in stories:
          story_len = 0
          for sent in story:
              swords = nltk.word_tokenize(sent)
              story_len += len(swords)
          if story_len > story_maxlen:
              story_maxlen = story_len # story ์ค‘ ๊ฐ€์žฅ ๊ธด ๋ฌธ์žฅ ์ฐพ๊ธฐ(=๋‹จ์–ด๊ฐ€ ๊ฐ€์žฅ ๋งŽ์€ ๊ฑฐ)
              
      for question in questions:
          question_len = len(nltk.word_tokenize(question))
          if question_len > question_maxlen: 
              question_maxlen = question_len # question ์ค‘ ๊ฐ€์žฅ ๊ธด ๋ฌธ์žฅ ์ฐพ๊ธฐ 
              
    print ("Story maximum length:", story_maxlen, "Question maximum length:", question_maxlen)

    Story maximum length: 14 Question maximum length: 4

  • Converting data into Vectorized form

    • ์œ„์˜ ๋ฌธ์žฅ์„ ์ˆ˜์น˜ํ™”ํ•จ
    def data_vectorization(data, word2indx, story_maxlen, question_maxlen):  
      Xs, Xq, Y = [], [], []
      stories, questions, answers = data
      for story, question, answer in zip(stories, questions, answers):
          xs = [[word2indx[w.lower()] for w in nltk.word_tokenize(s)] for s in story] # vocab์˜ index๋กœ ๋‹จ์–ด๋ฅผ ํ‘œ์‹œํ•œ๋‹ค(์ˆ˜์น˜ํ™”ํ•œ๋‹ค)
          xs = list(itertools.chain.from_iterable(xs)) # chain.from_iterable(['ABC', 'DEF']) --> ['A', 'B', 'C', 'D', 'E', 'F']
          xq = [word2indx[w.lower()] for w in nltk.word_tokenize(question)]
          Xs.append(xs)
          Xq.append(xq)
          Y.append(word2indx[answer.lower()]) # Y = answer
      return pad_sequences(Xs, maxlen=story_maxlen), pad_sequences(Xq, maxlen=question_maxlen),\
             to_categorical(Y, num_classes=len(word2indx))
             # ๊ฐ€์žฅ ๊ธด ๋ฌธ์žฅ(maxlen=story_maxlen))์„ ๊ธฐ์ค€์œผ๋กœ ๋ฌธ์žฅ์˜ ๊ธธ์ด๋ฅผ ํ†ต์ผ์‹œํ‚จ๋‹ค. ์ด๊ฒƒ๋ณด๋‹ค ์งง์€ ๋ถ€๋ถ„์€ padding(0)์œผ๋กœ ์ฑ„์›€
             # y: anwser์ด๊ณ , ์—ฌ๊ธฐ์„  ํ•œ ๋‹จ์–ด๋กœ ๋‚˜์˜จ๋‹ค. ์ฆ‰, ํ•œ ๋‹จ์–ด = ์ˆซ์ž 1๊ฐœ 
             # to_categorical: ์•ˆ ์“ฐ๊ณ  sparse categorical์„ ์จ๋„ ok 
    • ํ•จ์ˆ˜ data_vectorization() ไธญ
    • xs = [[word2indx[w.lower()] for w in nltk.word_tokenize(s)] for s in story]

      xs > Out[19]: [[8, 7, 20, 1, 2, 13, 3], [10, 19, 1, 2, 17, 3]] story > Out[20]: ['John went back to the office.\n', 'Mary travelled to the bedroom.\n']

    • xs = list(itertools.chain.from_iterable(xs)) xs > Out[22]: [8, 7, 20, 1, 2, 13, 3, 10, 19, 1, 2, 17, 3]
    • Xs.append(xs) ํ•ด์คŒ์œผ๋กœ์จ for๋ฌธ ํ†ตํ•ด์„œ output ๋œ ๊ฒƒ๋“ค์„ list ํ˜•ํƒœ๋กœ ์ถ•์ ํ•ด์คŒ
    • padding ํ•ด์คŒ padsequences(Xs, maxlen=storymaxlen) # story_maxlen = 14 Out[31]: array([[ 0, 8, 7, 20, 1, 2, 13, 3, 10, 19, 1, 2, 17, 3]])
    • padsequences(Xs, maxlen=storymaxlen) Out[31]: array([[ 0, 8, 7, 20, 1, 2, 13, 3, 10, 19, 1, 2, 17, 3]])
    • padsequences(Xq, maxlen=questionmaxlen) Out[32]: array([], shape=(0, 4), dtype=int32)
    • tocategorical(Y, numclasses=len(word2indx)) Out[33]: array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)

๋ชจ๋ธ ๋นŒ๋“œ

  • train/test data split

    Xstrain, Xqtrain, Ytrain = data_vectorization(data_train, word2indx, story_maxlen, question_maxlen)
    Xstest, Xqtest, Ytest = data_vectorization(data_test, word2indx, story_maxlen, question_maxlen)
    
    print("Train story",Xstrain.shape,"Train question", Xqtrain.shape,"Train answer", Ytrain.shape)
    print( "Test story",Xstest.shape, "Test question",Xqtest.shape, "Test answer",Ytest.shape)

    print > Train story (10000, 14) Train question (10000, 4) Train answer (10000, 22) Test story (1000, 14) Test question (1000, 4) Test answer (1000, 22)


  • Model Parameters ์„ค์ •

    EMBEDDING_SIZE = 128
    LATENT_SIZE = 64
    BATCH_SIZE = 64
    NUM_EPOCHS = 40

  • Inputs

    story_input = Input(shape=(story_maxlen,)) # story_maxlen = 14
    question_input = Input(shape=(question_maxlen,))

  • Story encoder embedding

    story_encoder = Embedding(input_dim=vocab_size, # vocab_size: 22
                            output_dim=EMBEDDING_SIZE, # EMBEDDING_SIZE* = 128(ํ•œ ๋‹จ์–ด๋ฅผ 128๊ฐœ์˜ vector๋กœ ํ‘œ์‹œ, embedding layer์˜ colum ๋‹ด๋‹น)
                            input_length=story_maxlen)(story_input) # story_maxlen = 14
    story_encoder = Dropout(0.2)(story_encoder)

  • Question encoder embedding

    question_encoder = Embedding(input_dim=vocab_size,
                               output_dim=EMBEDDING_SIZE,
                               input_length=question_maxlen)(question_input)
    question_encoder = Dropout(0.3)(question_encoder)



attention score layer
  • attention score layer

    match = Dot(axes=[2, 2])([story_encoder, question_encoder]) 
    • Match between story and question: story and question๋ฅผ dot ์—ฐ์‚ฐ ์ˆ˜ํ–‰.

      • ์—ฌ๊ธฐ์„œ dot ์—ฐ์‚ฐ์€ attention score๋กœ ์‚ฌ์šฉํ•จ
    • storyencoder = [None, 14, 128], questionencoder = [None, 4, 128]

      • match = [None, 14, 4]
    • axes=[2, 2]? story D2(=128 = embedding vector)์™€ question D2(=128 = embedding vector)๋ฅผ dot ํ•ด๋ผ

      • ์ฆ‰, (x, 128)๊ณผ (128,y)๋กœ ํ•œ์ชฝ์„ transpose ์‹œ์ผœ์„œ ์—ฐ์‚ฐ ์ˆ˜ํ–‰
    • ์šฐ์„  story input์˜ embedding layer์˜ ์ถœ๋ ฅ์€ story_encoder = (None, ํ•œ story์— ์‚ฌ์šฉ๋œ ์ตœ๋Œ€ ๋‹จ์–ด ๊ฐœ์ˆ˜(=14), embedding vector(128)) ์ด๋‹ค.
    • question input์˜ embedding layer์˜ ์ถœ๋ ฅ์€ question_encoder = (None, ํ•œ question์— ์‚ฌ์šฉ๋œ ์ตœ๋Œ€ ๋‹จ์–ด ๊ฐœ์ˆ˜(=14), embedding vecotr(128)) ์ด๋‹ค.
    • dot -> (None)์„ ๋นผ๊ณ  (row, colum)๋ผ๋ฆฌ(=14, 128)๊ณผ (128, 14)๊ฐ€ ์—ฐ์‚ฐ ์ˆ˜ํ–‰



story layer
  • story layer

    story_encoder_c = Embedding(input_dim=vocab_size, # vocab_size = 22
                              output_dim=question_maxlen, # question_maxlen = 4 
                              input_length=story_maxlen)(story_input) # story_maxlen = 14 
    
    story_encoder_c = Dropout(0.3)(story_encoder_c) # story_encoder_c.shap=(14, 4)
    • ์ด layer๋Š” story layer์˜ input์—์„œ ์‹œ์ž‘ํ•˜์—ฌ question layer๋Š” ๊ฑด๋„ˆ๋›ฐ๊ณ  ๋˜ ๋‹ค๋ฅธ embedding layer๋ฅผ ๊ฑฐ์ณ, ์ถ”ํ›„์— dot layer์™€ add๋ฅผ ํ•˜๊ฒŒ ๋จ


episodic memory layer
  • episodic memory layer

    response = Add()([match, story_encoder_c]) # dotํ•œ layer์™€ ๋ฐ”๋กœ ์œ„์˜ story_encoder_c๋ฅผ addํ•จ => (14, 4)
    response = Permute((2, 1))(response) # ๊ฒฐ๋ก  shape = (4, 14) # Permute((2, 1)): (D2, D1)์œผ๋กœ transpose. permute๊ฐ€ transpose๋ณด๋‹ค ๋” ์ถ•์ด๋™์ด ์ž์œ ๋กœ์›€ 


answer layer
  • episodic memory layer(response) + quetion layer

    answer = Concatenate()([response, question_encoder])
    answer = LSTM(LATENT_SIZE)(answer) # LATENT_SIZE = 64
    answer = Dropout(0.2)(answer)
    answer = Dense(vocab_size)(answer) # shape=(None, 22) # ๋งˆ์ง€๋ง‰ dense๋Š” vocab_size=22(๋‹จ์–ด๋“ค์˜ ์ด ๊ฐœ์ˆ˜)๋กœ!
    output = Activation("softmax")(answer) # shape=(None, 22)

compile
  • ๋ชจ๋ธ ๋นŒ๋“œ ๋งˆ์ง€๋ง‰

    model = Model(inputs=[story_input, question_input], outputs=output) # ํ•ฉ์ณค์œผ๋‹ˆ input์„ []๋กœ ์จ์ฃผ๋Š” ๊ฒƒ 
    model.compile(optimizer="adam", loss="categorical_crossentropy") # ์ฒ˜์Œ์— to_categorial ์•ˆ ํ•ด์คฌ์œผ๋ฉด loss="sparse_categorical_crossentropy" ํ•ด์•ผํ•จ 
    print (model.summary())

fit
  • ๋ชจ๋ธ ํ•™์Šต

    # Model Training
    history = model.fit([Xstrain, Xqtrain], [Ytrain], # Ytrain: answer
                      batch_size = BATCH_SIZE, 
                      epochs = NUM_EPOCHS,
                      validation_data=([Xstest, Xqtest], [Ytest])) # ytest??? fit ํ•˜๋ฉด 0,0,0 ์ด๋˜ ๊ฒŒ 13,14 ๋“ฑ์ง€๋กœ ๋ฐ”๋€œ
    • Ytest.shape Out[78]: (1000, 22)
    • Ytest Out[79]: array([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
    • ytest.shape Out[87]: (1000,)
    • ytest Out[92]: array([15, 12, 16, 15, 16, 15, 14 ... ])

loss plot

  • loss plot

    plt.title("Episodic Memory Q & A Loss")
    plt.plot(history.history["loss"], color="g", label="train")
    plt.plot(history.history["val_loss"], color="r", label="validation")
    plt.legend(loc="best")
    plt.show()

์ •ํ™•๋„ ์ธก์ •

  • get predictions of labels

    ytest = np.argmax(Ytest, axis=1)
    Ytest_ = model.predict([Xstest, Xqtest])
    ytest_ = np.argmax(Ytest_, axis=1)

์ ์šฉ

  • ์ ์šฉ

    • Select Random questions and predict answers
    NUM_DISPLAY = 10
     
    for i in random.sample(range(Xstest.shape[0]),NUM_DISPLAY):
      story = " ".join([indx2word[x] for x in Xstest[i].tolist() if x != 0])
      question = " ".join([indx2word[x] for x in Xqtest[i].tolist()])
      label = indx2word[ytest[i]]
      prediction = indx2word[ytest_[i]]
      print(story, question, label, prediction)


์ถœ๋ ฅ์ธต

  • ์ถœ๋ ฅ์ธต์ด 0 or 1 ์ฒ˜๋Ÿผ ํ•˜๋‚˜์ผ ๋•Œ

    • Binary classification. ๋”ฐ๋ผ์„œ sigmoid - binary-crossentropy ์‚ฌ์šฉ
    • y yHat
      0 0
      0 1
      1 1

    ์ •ํ™•๋„: 2/3

  • ์ถœ๋ ฅ์ธต์ด ๋‘ ๊ฐœ ์ด์ƒ ๋‚˜์˜ฌ ๋•Œ

    • multi-classification. ๋”ฐ๋ผ์„œ softmax - categorical-crossentropy ์‚ฌ์šฉ
    • one-hot ๊ตฌ์กฐ
    • y yHat
      0 1 0 0 1 0
      0 0 1 0 1 0
      1 0 0 1 0 0

    ์ •ํ™•๋„: 2/3

  • ์ถœ๋ ฅ์ธต์— '1'์ด ์—ฌ๋Ÿฌ ๊ฐœ์ธ ๊ตฌ์กฐ. one-hot ๊ตฌ์กฐ๊ฐ€ ์•„๋‹ ๋•Œ

    • multi-labeled classification. ๋”ฐ๋ผ์„œ sigmoid - binary-crossentropy ์‚ฌ์šฉ
    • ์ž…๋ ฅ ๋‰ด๋Ÿฐ ๊ฐ๊ฐ์— ๋Œ€ํ•ด binary-classification ํ•ด์•ผ ํ•จ
    • y yHat
      0 1 0 0 1 0
      0 0 1 0 1 0
      1 0 0 1 0 0

    ์ •ํ™•๋„: 9๊ฐœ ์ค‘ 7๊ฐœ ๋งž์ถค. ๋”ฐ๋ผ์„œ 7/9

    • ์œ„์—์ฒ˜๋Ÿผ row ์ „์ฒด๊ฐ€ ๋‹ค ๋งž์•˜์„ ๋•Œ ๋งž์•˜๋‹ค๊ณ  ๋ณด๋Š” ๊ฒŒ ์•„๋‹ˆ๊ณ , row+colum์œผ๋กœ ๊ฐ๊ฐ ๊ฐœ๋ณ„๋กœ ๋ด„




  • ์ฐธ๊ณ :

    • ์•„๋งˆ์ถ”์–ด ํ€€ํŠธ, blog.naver.com/chunjein
    • ํฌ๋ฆฌ์Šˆ๋‚˜ ๋ฐ”๋ธŒ์‚ฌ ์™ธ. 2019.01.31. ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ฟก๋ถ with ํŒŒ์ด์ฌ [ํŒŒ์ด์ฌ์œผ๋กœ NLP๋ฅผ ๊ตฌํ˜„ํ•˜๋Š” 60์—ฌ ๊ฐ€์ง€ ๋ ˆ์‹œํ”ผ]. ์—์ด์ฝ˜
    • https://frhyme.github.io/python-libs/ML_multilabel_classfication/
ยฉ 2020 jynee