Mecha Lady Shogi Blog: 10月 2012

From yesterday, Twitter API behavior is changed, and the posts do
not contain Content-Length are refused by API server.

(Today (10/28 0:00JST), this behavior is fixed and no more treatments are needed. (kimrin))

This issue's workaround is already pull-requested by jschauma and
this code works fine.
In details, see GitHub:
https://github.com/tweepy/tweepy/pull/214/files

Most people who install tweepy is using easy_install, so actual tweepy/binder.py file
is under the egg file.
In my Mac, this file is in /Library/Python/2.6/site-packages/ and begin with tweepy-.
So silly and tentative work around is:
1. extract egg file
2. modify tweepy/binder.py (only add 3 lines)
3. zip'ed and replace with old file (remain old file with renamed)

I hope this treatment is not needed in near the future (we can modify this changes
by using easy_install instructions).

kimrin

そういえばNLTKのnltk.tag.hmmにunsupervised learningがあったなぁと思い、
brown corpus使って簡単なテストをしてみました。

以下ソース。すっごく時間かかるので、10センテンスのうち、1センテンスから
種のhmm taggerを作り、残り9センテンスでunsupervised learning してみました。
unsupervised learningのiterationは5回です。自分のMacで5分くらいかかりますか。

---- baum.py ----

import nltk
import re
from nltk.corpus import brown
from nltk.probability import FreqDist, ConditionalFreqDist, \
ConditionalProbDist, DictionaryProbDist, DictionaryConditionalProbDist, \
LidstoneProbDist, MutableProbDist, MLEProbDist

def cut_corpus(num_sents):
sentences = brown.tagged_sents(categories='news')[:num_sents]
return sentences

def calc_sets(sentences):
tag_re = re.compile(r'[*]|--|[^+*-]+')
tag_set = set()
symbols = set()

cleaned_sentences = []
for sentence in sentences:
for i in range(len(sentence)):
word, tag = sentence[i]
word = word.lower() # normalize
symbols.add(word) # log this word
# Clean up the tag.
tag = tag_re.match(tag).group()
tag_set.add(tag)
sentence[i] = (word, tag) # store cleaned-up tagged token
cleaned_sentences += [sentence]

return cleaned_sentences, list(tag_set), list(symbols)

from nltk.tag.hmm import HiddenMarkovModelTrainer

def baum_demo(sentences,learn_ratio,tagset,symbols):
print "Baum-Welch demo"
#print "tagset = " + str(tagset)
#print "symbols = " + str(symbols)

edge = int(len(sentences) * learn_ratio)
tagged = sentences[:edge]
untagged = sentences[edge:]

trainer = HiddenMarkovModelTrainer(tagset, symbols)
hmm = trainer.train_supervised(tagged,
estimator=lambda fd, bins: LidstoneProbDist(fd, 0.1, bins))
print "generated tagged initial tagger"

trainer2 = HiddenMarkovModelTrainer(tagset, symbols)
training = []
for i in range(len(untagged)):
item = untagged[i]
sent = []
for j in range(len(item)):
sent += [(item[j][0],None)]
training += [sent]
#print training

unsuperv = trainer.train_unsupervised(training, model=hmm,
max_iterations=5)
print '\nunsupervised:\n'
unsuperv.test(sentences[:10], verbose=False)
print '\nsupervised:\n'
trainer3 = HiddenMarkovModelTrainer(tagset, symbols)
hmm2 = trainer.train_supervised(sentences,
estimator=lambda fd, bins: LidstoneProbDist(fd, 0.1, bins))
hmm2.test(sentences[:10], verbose=False)

if __name__ == '__main__':
sentences = cut_corpus(10)
cleaned, tagset, symbols = calc_sets(sentences)
baum_demo(cleaned,0.1,tagset,symbols)

---- end of baum.py ----

結果なのですが、、、

Baum-Welch demo
generated tagged initial tagger
iteration 0 logprob -1834.03048923
iteration 1 logprob -1608.92558988
iteration 2 logprob -1544.94196931
iteration 3 logprob -1432.04010957
iteration 4 logprob -1315.68316984

unsupervised:

accuracy over 284 tokens: 35.21

supervised:

accuracy over 284 tokens: 100.00

元のデータに対してテストして、unsupervised でacc. 35%, supervised で、acc. 100%という結果になりました。

文の数を多くすると、どんどん実行時間が長引きます。
logprobがまだ小さな値なので、多分学習というところまで
いってないのだと思います。

Mecha Lady Shogi Blog

2012年10月28日日曜日

tweepy failed to post by HTTP error 411

2012年10月21日日曜日

blogタイトル及び将棋ソフト名称変更について

2012年10月20日土曜日

unsupervised learning by using NLTK

自己紹介

ブログアーカイブ