Abstract
Short-text
messages such as tweets are being created and shared at an unprecedented rate.
Tweets, in their raw form, while being informative, can also be overwhelming.
For both end-users and data analysts, it is a nightmare to plow through
millions of tweets which contain enormous amount of noise and redundancy.In
this paper, we propose a novel continuous summarization framework called Sumblr
to alleviate the problem. In contrast to the traditional document summarization
methods which focus on static and small-scale dataset, Sumblr is designed to
deal with dynamic, fast arriving, and large-scale tweet streams. Our proposed
framework consists of three major components. First, we propose an online tweet
stream clustering algorithm to cluster tweets and maintain distilled statistics
in a data structure called Tweet Cluster Vector (TCV). Second, we develop a
TCV-Rank summarization technique for generating online summaries and historical
summaries of arbitrary time durations. Third, we design an effective topic evolution
detection method, which monitors summary-based/volume-based variations to
produce timelines automatically from tweet streams. Our experiments on
large-scale real tweets demonstrate the efficiency and effectiveness of our
framework.
Aim
The
main aim is to summarize the continuous tweets stream using a protocol “Sumblr”.
And also to generate timelines in the context of streams.
Scope
The
scope is to propose an online tweet stream clustering algorithm, to develop a
TCV (Tweet Cluster Vector)-Rank summarization technique, to design an effective
topic evolution detection method.
Existing System
Stream Data Clustering
Stream
Data clustering has been widely studied in the literature. BIRCH clusters the
data based on an in-memory structure called CF-tree instead of the original
large dataset. Bradley et al. proposed a scalable clustering framework which
selectively stores important portions of the data, and compresses or discards
other portions. CluStream is one of the most classic stream clustering methods.
It consists of an online micro-clustering component and an offline
macroclustering component. The pyramidal time frame was also proposed in to
recall historical micro-clusters for different time durations. A variety of
services on the Web such as news filtering, text crawling, and topic detecting
etc. have posed requirements for text stream clustering.
Document/Microblog Summarization
Document summarization can be categorized as
extractive and abstractive. The former selects sentences from the documents,
while the latter may generate phrases and sentences that do not appear in the
original documents. In this paper, we focus on extractive summarization.
Extractive document summarization has received a lot of recent attention. Most
of them assign salient scores to sentences of the documents, and select the
top-ranked sentences. Some works try to extract summaries without such salient
scores. Wang et al. used the Symmetric Non-negative Matrix Factorization (SNMF)
to cluster sentences and choose sentences in each cluster for summarization. He
et al. proposed to summarize documents from the perspective of data
reconstruction, and select sentences that can best reconstruct the original
documents. Xu et al. modeled documents (hotel reviews) as multi-attribute
uncertain data and optimized a probabilistic coverage problem of the summary.
Disadvantages
(1) Efficiency
- tweet streams are always very large in scale, hence the summarization
algorithm should be highly efficient; (2) Flexibility
- it should provide tweet summaries of arbitrary time durations.
(3)
Topic evolution - it should
automatically detect sub-topic changes and the moments that they happen.
Short-text
messages such as tweets are being created and shared at an unprecedented rate.
Tweets, in their raw form, while being informative, can also be overwhelming.
For both end-users and data analysts, it is a nightmare to plow through
millions of tweets which contain enormous amount of noise and redundancy.
Existing
techniques fail to provide effective analysis on clusters formed over different
time durations.
Proposed System
•
We propose a continuous tweet stream summarization framework, namely Sumblr, to
generate summaries and timelines in the context of streams.
•
We design a novel data structure called TCV for stream processing, and propose
the TCV-Rank algorithm for online and historical summarization.
•
This project proposes a topic evolution detection algorithm which produces
timelines by monitoring three kinds of variations.
Advantages
· This
project discovers the changing dates and generates timelines dynamically during
the process of continuous summarization. Moreover, ETS (Evolutionary Timeline
Summarization) does
not focus on efficiency and scalability issues, which are very important in our
streaming context.
· This
project detects topic evolution and produces summaries/timelines in an online
fashion.
· Effective
topic evolution detection method, which monitors summary-based/volume-based
variations to produce timelines automatically from tweet streams.
System Architecture
SYSTEM CONFIGURATION
HARDWARE REQUIREMENTS:-
· Processor - Pentium –III
· Speed - 1.1 Ghz
· RAM - 256 MB(min)
· Hard Disk - 20 GB
· Floppy Drive - 1.44 MB
· Key Board - Standard
Windows Keyboard
· Mouse - Two or Three Button Mouse
· Monitor - SVGA
SOFTWARE REQUIREMENTS:-
·
Operating
System : Windows 7
·
Front
End :
JSP AND SERVLET
·
Database :
MYSQL
References:
Zhenhua
Wang, Lidan Shou, Ke Chen, Gang Chen, Mehrotra S. “ON SUMMARIZATION AND TIMELINE GENERATION FOR EVOLUTIONARY TWEET STREAMS”Knowledge and Data Engineering,
IEEE Transactions on Volume: 27, Issue: 5 August 2014.
No comments:
Post a Comment