(NLP ๊ธฐ์ดˆ) ๋ฌธ์„œ ์ •๋ณด ์ถ”์ถœ

NLP

  • ์ •๊ทœํ‘œํ˜„์‹
  • ์ฒญํ‚น
  • ์นญํ‚น



๋ฌธ์„œ ์ •๋ณด ์ถ”์ถœ


์ •๊ทœํ‘œํ˜„์‹


re ๋ชจ๋“ˆ ํ•จ์ˆ˜

  • ์ฝ์–ด๋ณด๊ธฐ

devanix. "ํŒŒ์ด์ฌ โ€“ ์ •๊ทœ์‹ํ‘œํ˜„์‹(Regular Expression) ๋ชจ๋“ˆ"



์ฒญํ‚น(Chunking)

  • ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํ’ˆ์‚ฌ๋กœ ๊ตฌ(pharase)๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ์„ Chunking์ด๋ผ ํ•˜๊ณ , ์ด ๊ตฌ(pharase)๋ฅผ chunk๋ผ ํ•œ๋‹ค.
  • ๋ฌธ์žฅ์„ ๊ฐ ํ’ˆ์‚ฌ๋กœ ๊ตฌ๋ถ„ํ•˜๊ณ , Chunking์— ์˜ํ•ด ๊ตฌ๋กœ ๊ตฌ๋ถ„ํ•˜๋ฉด ๋ฌธ์žฅ์˜ ์˜๋ฏธ๋ฅผ ํŒŒ์•…ํ•˜๊ธฐ ์šฉ์ดํ•ด ์ง„๋‹ค.
  • ๋ฌธ์žฅ์—์„œ (DT + JJ + NN), (DT + JJ + JJ + NN), (JJ + NN), ๋“ฑ์˜ ์‹œํ€€์Šค๋Š” ๋ชจ๋‘ ๋ช…์‚ฌ๊ตฌ (NP : Noun phrase)๋กœ ํŒ๋‹จํ•œ๋‹ค
  • If a tag pattern matches at overlapping locations, the leftmost match takes precedence

    image-20200816015803693


  • ์ˆœ์„œ

    1. grammar ์ •์˜
    2. ๋”•์…”๋„ˆ๋ฆฌ ์ •์˜: cp = nltk.RegexpParser(grammar)
    3. sentence data ๋ถˆ๋Ÿฌ์˜ค๊ธฐ(ํ˜น์€ ํ…Œ์ŠคํŠธ๋ฅผ ์œ„ํ•ด์„œ๋ผ๋ฉด ๋งŒ๋“ค๊ธฐ)
    4. ๋”•์…”๋„ˆ๋ฆฌ์— ๋”ฐ๋ผ sentence ๋ถ„์„:

    cp.parse(sentence)


  • Base code

    import nltk
    grammar = 
    """
    NP: {<DT|PP\$>?<JJ>*<NN>}	  # rule 1
      {<NNP>+}                  # rule 2
    """
    
    cp = nltk.RegexpParser(grammar)
    
    
    sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down",
    "RP"), ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"),
    ("hair", "NN")]
    
    
    cp.parse(sentence)

    (S (NP Rapunzel/NNP) let/VBD down/RP (NP her/PP$ long/JJ golden/JJ hair/NN))

    result.draw()

image-20200816015725145



์นญํ‚น(Chinking)

  • ํŠน์ • ๋ถ€๋ถ„์„ chunk ๋ฐ–์œผ๋กœ ๋นผ๋‚ด๋Š” ๊ฒƒ์„ chinking์ด๋ผ ํ•œ๋‹ค. Chink๋Š” ๋ฌธ์žฅ์—์„œ chunk๋ฅผ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€ ๋ถ€๋ถ„์„ ์˜๋ฏธํ•œ๋‹ค
  • ๋ฌธ์žฅ ์ „์ฒด๋ฅผ chunk๋กœ ์ •์˜ํ•˜๊ณ , ํŠน์ • ๋ถ€๋ถ„์„ chinkingํ•˜๋ฉด ๋‚˜๋จธ์ง€ ๋ถ€๋ถ„์ด chunk๊ฐ€ ๋œ๋‹ค. Chinking์„ ์ด์šฉํ•ด์„œ chunking์„ ํ•  ์ˆ˜๋„ ์žˆ๋‹ค
  • code:
grammar = 
 r"""
NP:
{<.*>+}              # Chunk everything
}<VBD|IN>+{          # Chink sequences of VBD and IN(๋นผ๋‚ด๋Š” ๋ถ€๋ถ„)
"""

sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
("dog", "NN"), ("barked", "VBD"), ("at", "IN"),
("the", "DT"), ("cat", "NN")]

cp = nltk.RegexpParser(grammar)
cp.parse(sentence)

image-20200717135504838


Chunk์˜ ๊ตฌ์กฐ - IOB tags

  • Chunk๋‚ด์˜ ๊ฐ ํ’ˆ์‚ฌ์˜ ์œ„์น˜์— ๋”ฐ๋ผ B (Begin), I (Inside), O (Outside)๋ฅผ ๋ถ™์ธ๋‹ค (chunk tag).
  • B-NP๋Š” NP chunk์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์„ ์˜๋ฏธํ•˜๊ณ , I-NP๋Š” NP chunk์˜ ๋‚ด๋ถ€ ๋ถ€๋ถ„์„ ์˜๋ฏธํ•œ๋‹ค.
  • Chunk ๊ตฌ์กฐ๋Š” IOB tags๋กœ ํ‘œํ˜„ํ•  ์ˆ˜๋„ ์žˆ๊ณ , ํŠธ๋ฆฌ ๊ตฌ์กฐ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

    • NLTK์—์„œ๋Š” ํŠธ๋ฆฌ ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

image-20200717142504374


  • code >

    • conll2000.iob_sents('train.txt')[99]

    [('Over', 'IN', 'B-PP'), ('a', 'DT', 'B-NP'), ('cup', 'NN', 'I-NP'), ('of', 'IN', 'B-PP'), ('coffee', 'NN', 'B-NP'), (',', ',', 'O'), ('Mr.', 'NNP', 'B-NP'), ('Stone', 'NNP', 'I-NP'), ('told', 'VBD', 'B-VP'), ('his', 'PRP$', 'B-NP'), ('story', 'NN', 'I-NP'), ('.', '.', 'O')]


  • ์ ˆ(Clause)

    • ๋ฌธ๋ฒ•์— clause (์ ˆ)๋ฅผ ์ •์˜ํ•˜๋ฉด ๋ฌธ์žฅ์„ ์•„๋ž˜์™€ ๊ฐ™์ด ๋ถ„์„ (chunking) ํ•  ์ˆ˜ ์žˆ๋‹ค.
    • Recursion in Linguistic Structure
    grammar = r"""
    NP: {<DT|JJ|NN.*>+} # Chunk sequences of DT, JJ, NN
    PP: {<IN><NP>} # Chunk prepositions followed by NP
    VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
    CLAUSE: {<NP><VP>} # Chunk NP, VP
    """
    cp = nltk.RegexpParser(grammar)
    sentence = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN"),
    ("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]
    print(cp.parse(sentence))

    (S (NP Mary/NN) saw/VBD (CLAUSE (NP the/DT cat/NN) (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))

    image-20200717162332408

    • .RegexpParser()์— loop = 2๋ฅผ ์ง€์ •ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด clause ์•ˆ์— ๋˜ ๋‹ค๋ฅธ clause๋ฅผ ์žฌ๊ท€์ (recursion)์œผ๋กœ ๋ถ„์„ํ•œ๋‹ค. ์ด์™€ ๊ฐ™์ด ๋ฌธ์žฅ์— ๋งž๊ฒŒ ํŠธ๋ฆฌ๋ฅผ ๊นŠ๊ฒŒ ๊ตฌ์„ฑํ•˜๋Š” ๊ฒƒ์„ cascaded chunking (๊ณ„๋‹จ์‹ chunk) ์ด๋ผ ํ•œ๋‹ค.
    cp = nltk.RegexpParser(grammar, loop=2)
    print(cp.parse(sentence))

    loop ๊ฑธ์–ด์ฃผ๋ฉด ์ ˆ ์†์˜ ์ ˆ์ด ๋“ค์–ด๊ฐ€๋Š” ํ˜•ํƒœ๋กœ ๊ตฌ๋ถ„ํ•ด์ค€๋‹ค.

    (S (NP John/NNP) thinks/VBZ (CLAUSE (NP Mary/NN) (VP saw/VBD (CLAUSE (NP the/DT cat/NN) (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))

    image-20200717162643660


Named Entity Recognition (NER) - ๊ฐœ์ฒด๋ช… ์ธ์‹

  • NER ๋ถ™์—ฌ๋†“์œผ๋ฉด Q&A ๊ฐ€๋Šฅํ•˜๋‹ค(๋‹ต์„ ์ฐพ์•„ ์ œ์‹œํ•ด์ฃผ๋Š” ์ฑ—๋ด‡ ๊ฐ™์€ ๊ฑฐ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Œ)

    sent = nltk.corpus.treebank.tagged_sents()[22]
    print(nltk.ne_chunk(sent, binary=True))

    (S The/DT (NE U.S./NNP) is/VBZ one/CD of/IN ... according/VBG to/TO (NE Brooke/NNP) T./NNP ... the/DT (NE University/NNP) of/IN (NE Vermont/NNP College/NNP) of/IN (NE Medicine/NNP) ./.)

    • binary=True ์•ˆ ์“ฐ๊ณ  ๊ทธ๋ƒฅํ•˜๋ฉด
    (nltk.ne_chunk(sent))

    (S The/DT (GPE U.S./NNP) is/VBZ one/CD of/IN ... according/VBG to/TO (PERSON Brooke/NNP T./NNP Mossman/NNP) ... the/DT (ORGANIZATION University/NNP) of/IN (PERSON Vermont/NNP College/NNP) of/IN (GPE Medicine/NNP) ./.)





  • reference:

    • ์•„๋งˆ์ถ”์–ด ํ€€ํŠธ, blog.naver.com/chunjein
    • ์ฝ”๋“œ ์ถœ์ฒ˜: ํฌ๋ฆฌ์Šˆ๋‚˜ ๋ฐ”๋ธŒ์‚ฌ ์™ธ. 2019.01.31. ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ฟก๋ถ with ํŒŒ์ด์ฌ [ํŒŒ์ด์ฌ์œผ๋กœ NLP๋ฅผ ๊ตฌํ˜„ํ•˜๋Š” 60์—ฌ ๊ฐ€์ง€ ๋ ˆ์‹œํ”ผ]. ์—์ด์ฝ˜
ยฉ 2020 jynee