2009年9月27日日曜日

[Ruby][Snippet] 文字列(英文)を文に分割する

英語の文字列を、文に分割するモンキーパッチ。こなれてないですが。。

Snipplr に上げてみた。
http://snipplr.com/view/20306/split-string-into-sentences/

仕様
  • . (ピリオド), ?, ! で英文をセンテンスに分割
  • Mr. Dr. などの敬称、St. Mt. に対応
  • 固有名詞(人名、企業名など)の略称に対応
  • " や ) で文が終わる場合の .", .) に対応
できてないこと
  • 小数点の扱い


# split_sentence.rb

class String
def split_sentence
ary = self.gsub(/\n/," ").split(/([^\.\?\!]+[\.\?\!])/)
ary.delete("")
sentences = Array.new
str = ""
for i in 0..ary.size-1
next if ary[i].size == 0 || ary[i] =~ /^\s*$/
str << ary[i]
next if str =~ /Mr|Mrs|Ms|Dr|Mt|St\.$/
if (i < ary.size-1)
next if ary[i] =~ /[A-Z]\.$/
next if ary[i+1] =~ /^\s*[a-z]/
end
if ary[i+1] =~ /^\"/
str << '"'
ary[i+1].sub!(/^\"/,"")
elsif ary[i+1] =~ /^\)/
str << ')'
ary[i+1].sub!(/^\)/,"")
end
sentences << str.sub(/^\s+/,"")
str = ""
end
sentences
end
end


使いかた

require 'split_sentence'
str = "
Sue looked out the window. What was there to count? There was only an empty yard and the blank side of the house seven meters away. An old ivy vine, going bad at the roots, climbed half way up the wall. The cold breath of autumn had stricken leaves from the plant until its branches, almost bare, hung on the bricks.
\"What is it, dear?\" asked Sue.
\"Six,\" said Johnsy, quietly. \"They're falling faster now. Three days ago there were almost a hundred. It made my head hurt to count them. But now it's easy. There goes another one. There are only five left now.\""

str.split_sentence.each do |s|
puts s
end


(実行結果)

Sue looked out the window.
What was there to count?
There was only an empty yard and the blank side of the house seven meters away.
An old ivy vine, going bad at the roots, climbed half way up the wall.
The cold breath of autumn had stricken leaves from the plant until its branches, almost bare, hung on the bricks.
"What is it, dear?"
asked Sue.
"Six," said Johnsy, quietly.
"They're falling faster now.
Three days ago there were almost a hundred.
It made my head hurt to count them.
But now it's easy.
There goes another one.
There are only five left now."

0 件のコメント:

コメントを投稿