brown corpus使って簡単なテストをしてみました。
以下ソース。すっごく時間かかるので、10センテンスのうち、1センテンスから
種のhmm taggerを作り、残り9センテンスでunsupervised learning してみました。
unsupervised learningのiterationは5回です。自分のMacで5分くらいかかりますか。
---- baum.py ----
import nltk
import re
from nltk.corpus import brown
from nltk.probability import FreqDist, ConditionalFreqDist, \
ConditionalProbDist, DictionaryProbDist, DictionaryConditionalProbDist, \
LidstoneProbDist, MutableProbDist, MLEProbDist
def cut_corpus(num_sents):
sentences = brown.tagged_sents(categories='news')[:num_sents]
return sentences
def calc_sets(sentences):
tag_re = re.compile(r'[*]|--|[^+*-]+')
tag_set = set()
symbols = set()
cleaned_sentences = []
for sentence in sentences:
for i in range(len(sentence)):
word, tag = sentence[i]
word = word.lower() # normalize
symbols.add(word) # log this word
# Clean up the tag.
tag = tag_re.match(tag).group()
tag_set.add(tag)
sentence[i] = (word, tag) # store cleaned-up tagged token
cleaned_sentences += [sentence]
return cleaned_sentences, list(tag_set), list(symbols)
from nltk.tag.hmm import HiddenMarkovModelTrainer
def baum_demo(sentences,learn_ratio,tagset,symbols):
print "Baum-Welch demo"
#print "tagset = " + str(tagset)
#print "symbols = " + str(symbols)
edge = int(len(sentences) * learn_ratio)
tagged = sentences[:edge]
untagged = sentences[edge:]
trainer = HiddenMarkovModelTrainer(tagset, symbols)
hmm = trainer.train_supervised(tagged,
estimator=lambda fd, bins: LidstoneProbDist(fd, 0.1, bins))
print "generated tagged initial tagger"
trainer2 = HiddenMarkovModelTrainer(tagset, symbols)
training = []
for i in range(len(untagged)):
item = untagged[i]
sent = []
for j in range(len(item)):
sent += [(item[j][0],None)]
training += [sent]
#print training
unsuperv = trainer.train_unsupervised(training, model=hmm,
max_iterations=5)
print '\nunsupervised:\n'
unsuperv.test(sentences[:10], verbose=False)
print '\nsupervised:\n'
trainer3 = HiddenMarkovModelTrainer(tagset, symbols)
hmm2 = trainer.train_supervised(sentences,
estimator=lambda fd, bins: LidstoneProbDist(fd, 0.1, bins))
hmm2.test(sentences[:10], verbose=False)
if __name__ == '__main__':
sentences = cut_corpus(10)
cleaned, tagset, symbols = calc_sets(sentences)
baum_demo(cleaned,0.1,tagset,symbols)
---- end of baum.py ----
結果なのですが、、、
Baum-Welch demo
generated tagged initial tagger
iteration 0 logprob -1834.03048923
iteration 1 logprob -1608.92558988
iteration 2 logprob -1544.94196931
iteration 3 logprob -1432.04010957
iteration 4 logprob -1315.68316984
unsupervised:
accuracy over 284 tokens: 35.21
supervised:
accuracy over 284 tokens: 100.00
元のデータに対してテストして、unsupervised でacc. 35%, supervised で、acc. 100%という結果になりました。
文の数を多くすると、どんどん実行時間が長引きます。
logprobがまだ小さな値なので、多分学習というところまで
いってないのだと思います。
0 件のコメント:
コメントを投稿