Python - NLTK - テキストコーパスと語彙資源へのアクセス(corpusモジュール、単語トークン数、len関数、set関数)

開発環境

macOS High Sierra - Apple
Emacs (Text Editor)
Python 3.6 (プログラミング言語)

入門自然言語処理 (Steven Bird (著)、Ewan Klein (著)、Edward Loper (著)、萩原正人 (翻訳)、中山敬広 (翻訳)、水野貴明 (翻訳)、オライリージャパン)の2章(テキストコーパスと語彙資源へのアクセス)、2.8(演習問題)2.を取り組んでみる。

コード(Emacs)

Python 3

#!/usr/bin/env python3
import nltk

print('2.')

words = nltk.corpus.gutenberg.words('austen-emma.txt')
print(words[:10])
print(words[-10:])
print(f'単語トークン数: {len(words)}')
print(f'異なる単語トークン数: {len(set(words))}')

入出力結果(Terminal, Jupyter(IPython))

$ ./sample2.py
2.
['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER']
['answered', 'in', 'the', 'perfect', 'happiness', 'of', 'the', 'union', '.', 'FINIS']
単語トークン数: 192427
異なる単語トークン数: 7811
$

Kamimura's blog

ほしい物リスト

2018年7月4日水曜日

Python - NLTK - テキストコーパスと語彙資源へのアクセス(corpusモジュール、単語トークン数、len関数、set関数)

0 コメント:

コメントを投稿