Shylock Hg

My own blog powered by Hugo and Ivy.

Accessing web and local text


1.Handling plain web text

1.1.Accessing web text

Accessing web text as bellow:

import urllib

url = ''

#get the string of text file
raw = urllib.urlopen(url).read()

#with proxy
proxy = {'http':''}
raw = urllib.urlopen(url,proxies=proxy).read()

1.2.Tokenizing the text

Tokenizing a text(string) to produce a list of tokens.

#context same as upper

#tokenize the text string
tokens = nltk.word_tokenize(raw)

1.3.Creating nltk.Text object

We can handle text by nltk after creating nltk.Text object from text.

#context same as upper

#creating nltk.Text object by tokens
text = nltk.Text(tokens)

Then we can handle text by API belong to nltk.Text object,such as:

#context same as upper

#get collocations of text

2.Handling HTML document

2.1.Accessing HTML document

Accessing HTML document as bellow:

import urllib

#get html document by url
url = ''
html_doc = urllib.urlopen(url).read()

#get plain text content of html document
raw = nltk.clean_html(html_doc)

#tokenize the text string
tokens = nltk.wor_tokenize(raw)

#create nltk.Text object
text = nltk.Text(tokens)

note:But there are still a lot content that we not need.You can clean them by hand or use the professional tools BeautifulSoup

3.Handling Searching Engine Results

4.Handling RSS Feeds

Accessing RSS Feeds as bellow:

import feedparser
import nltk

#get blog
llog = feedparser.parse('!feed=atom')

#overview blog

#post count

#get post
post = llog.entries[2]

#get content of post
content = post.content[0].value

#tokenize content
tokens = nltk.word_tokenize(nltk.html_clean(content))

5.Handling Local Files

As bellow:

import nltk

#read raw 
f = open('document.txt')
raw =

#read line
f = open('document.txt','rU'
for line in f:

#accessing nltk corpora
path ='corpora/gutenberg/melville-moby_dick.txt')
raw = open(path,'rU').read()

6.Handling PDF,MSWord & other Binary Format

Use the third library or extract by hand.

7.Handling User Input

str = raw_input('Enter some text:')

8.The common nlp data transition

As bellow: