Friday, 23 October 2015

On Summarization and Timeline Generation for Evolutionary Tweet Streams

Abstract

Short-text messages such as tweets are being created and shared at an unprecedented rate. Tweets, in their raw form, while being informative, can also be overwhelming. For both end-users and data analysts, it is a nightmare to plow through millions of tweets which contain enormous amount of noise and redundancy.In this paper, we propose a novel continuous summarization framework called Sumblr to alleviate the problem. In contrast to the traditional document summarization methods which focus on static and small-scale dataset, Sumblr is designed to deal with dynamic, fast arriving, and large-scale tweet streams. Our proposed framework consists of three major components. First, we propose an online tweet stream clustering algorithm to cluster tweets and maintain distilled statistics in a data structure called Tweet Cluster Vector (TCV). Second, we develop a TCV-Rank summarization technique for generating online summaries and historical summaries of arbitrary time durations. Third, we design an effective topic evolution detection method, which monitors summary-based/volume-based variations to produce timelines automatically from tweet streams. Our experiments on large-scale real tweets demonstrate the efficiency and effectiveness of our framework.

Aim

The main aim is to summarize the continuous tweets stream using a protocol “Sumblr”. And also to generate timelines in the context of streams.

Scope

The scope is to propose an online tweet stream clustering algorithm, to develop a TCV (Tweet Cluster Vector)-Rank summarization technique, to design an effective topic evolution detection method.

Existing System

Stream Data Clustering

Stream Data clustering has been widely studied in the literature. BIRCH clusters the data based on an in-memory structure called CF-tree instead of the original large dataset. Bradley et al. proposed a scalable clustering framework which selectively stores important portions of the data, and compresses or discards other portions. CluStream is one of the most classic stream clustering methods. It consists of an online micro-clustering component and an offline macroclustering component. The pyramidal time frame was also proposed in to recall historical micro-clusters for different time durations. A variety of services on the Web such as news filtering, text crawling, and topic detecting etc. have posed requirements for text stream clustering.

Document/Microblog Summarization

Document summarization can be categorized as extractive and abstractive. The former selects sentences from the documents, while the latter may generate phrases and sentences that do not appear in the original documents. In this paper, we focus on extractive summarization. Extractive document summarization has received a lot of recent attention. Most of them assign salient scores to sentences of the documents, and select the top-ranked sentences. Some works try to extract summaries without such salient scores. Wang et al. used the Symmetric Non-negative Matrix Factorization (SNMF) to cluster sentences and choose sentences in each cluster for summarization. He et al. proposed to summarize documents from the perspective of data reconstruction, and select sentences that can best reconstruct the original documents. Xu et al. modeled documents (hotel reviews) as multi-attribute uncertain data and optimized a probabilistic coverage problem of the summary.

Disadvantages

(1) Efficiency - tweet streams are always very large in scale, hence the summarization algorithm should be highly efficient; (2) Flexibility - it should provide tweet summaries of arbitrary time durations.

(3) Topic evolution - it should automatically detect sub-topic changes and the moments that they happen.

Existing techniques fail to provide effective analysis on clusters formed over different time durations.

Proposed System

• We propose a continuous tweet stream summarization framework, namely Sumblr, to generate summaries and timelines in the context of streams.

• We design a novel data structure called TCV for stream processing, and propose the TCV-Rank algorithm for online and historical summarization.

• This project proposes a topic evolution detection algorithm which produces timelines by monitoring three kinds of variations.

Advantages

· This project discovers the changing dates and generates timelines dynamically during the process of continuous summarization. Moreover, ETS (Evolutionary Timeline Summarization) does not focus on efficiency and scalability issues, which are very important in our streaming context.

· This project detects topic evolution and produces summaries/timelines in an online fashion.

· Effective topic evolution detection method, which monitors summary-based/volume-based variations to produce timelines automatically from tweet streams.

System Architecture

SYSTEM CONFIGURATION

HARDWARE REQUIREMENTS:-

· Processor - Pentium –III

· Speed - 1.1 Ghz

· RAM - 256 MB(min)

· Hard Disk - 20 GB

· Floppy Drive - 1.44 MB

· Key Board - Standard Windows Keyboard

· Mouse - Two or Three Button Mouse

· Monitor - SVGA

SOFTWARE REQUIREMENTS:-

· Operating System : Windows 7

· Front End : JSP AND SERVLET

· Database : MYSQL

References:

Zhenhua Wang, Lidan Shou, Ke Chen, Gang Chen, Mehrotra S. “ON SUMMARIZATION AND TIMELINE GENERATION FOR EVOLUTIONARY TWEET STREAMS”Knowledge and Data Engineering, IEEE Transactions on Volume: 27, Issue: 5 August 2014.

2014 M.E / M.TECH IEEE PROJECTS, M.E / M.TECH IEEE 2014 PROJECT TITLES FOR CSE,

Friday, 23 October 2015

On Summarization and Timeline Generation for Evolutionary Tweet Streams

· Processor - Pentium –III

No comments:

Post a Comment