Abstract
Twitter has attracted millions of users to share and disseminate
most up-to-date information, resulting in large volumes of data produced
everyday. However, many applications in Information Retrieval (IR) and Natural
Language Processing (NLP) suffer severely from the noisy and short nature of
tweets. In this paper, we propose a novel framework for tweet segmentation in a
batch mode, called HybridSeg. By splitting tweets into meaningful segments, the
semantic or context information is well preserved and easily extracted by the
downstream applications. HybridSeg finds the optimal segmentation of a tweet by
maximizing the sum of the stickiness scores of its candidate segments. The
stickiness score considers the probability of a segment being a phrase in English
(i.e., global context) and the probability of a segment being a phrase within
the batch of tweets (i.e., local context). For the latter, we propose and
evaluate two models to derive local context by considering the linguistic
features and term-dependency in a batch of tweets, respectively. HybridSeg is
also designed to iteratively learn from confident segments as pseudo feedback.
Experiments on two tweet data sets show that tweet segmentation quality is
significantly improved by learning both global and local contexts compared with
using global context alone. Through analysis and comparison, we show that local
linguistic features are more reliable for learning local context compared with
term-dependency. As an application, we show that high accuracy is achieved in
named entity recognition by applying segment-based part-of-speech (POS)
tagging.
Aim
The main aim is to achieve high accuracy in named entity
recognition by the task of tweet segmentation.
Scope
The scope
is to split a tweet into a sequence of consecutive segments, so that the tweets
are preserved and easily extracted by downstream applications.
Existing System
Both tweet segmentation and named entity recognition are
considered important subtasks in NLP. Many existing NLP techniques heavily rely
on linguistic features, such as POS (Parts-of-speech) tags of the surrounding
words, word capitalization, trigger words and gazetteers. These linguistic
features, together with effective supervised learning algorithms and
conditional random field (CRF)), achieve very good performance on formal text
corpus. However, these techniques experience
severe performance deterioration on tweets because of the noisy and short
nature of the latter.
Disadvantages
Given the limited length of a tweet (i.e., 140 characters) and
no restrictions on its writing styles, tweets often contain grammatical errors,
misspellings, and informal abbreviations. The error-prone and short nature of
tweets often make the word-level language models for tweets less reliable.
Advantages
HybridSeg is designed to iteratively learn from confident
segments as pseudo feedback. Experiments on two tweet data sets show that tweet
segmentation quality is significantly improved by learning both global and
local contexts compared with using global context alone. Through analysis and
comparison, we show that local linguistic features are more reliable for
learning local context compared with term-dependency. As an application, we
show that high accuracy is achieved in named entity recognition by applying
segment-based part-of-speech (POS) tagging.
System Architecture
SYSTEM CONFIGURATION
HARDWARE REQUIREMENTS:-
· Processor - Pentium –III
· Speed - 1.1 Ghz
· RAM - 256 MB(min)
· Hard Disk - 20 GB
· Floppy Drive - 1.44 MB
· Key Board - Standard
Windows Keyboard
· Mouse - Two or Three Button Mouse
· Monitor - SVGA
SOFTWARE REQUIREMENTS:-
·
Operating
System : Windows 7
·
Front
End :
JSP AND SERVLET
·
Database :
MYSQL
References
Aixin
Sun ; Jianshu Weng ; Qi He “TWEET SEGMENTATION AND ITS APPLICATION TO NAMED
ENTITY RECOGNITION” Knowledge and Data Engineering, IEEE Transactions on (Volume:27 ,
Issue: 2 ) May 2014.
No comments:
Post a Comment