To secure a mobile ad hoc network (MANET) in adversarial environments, a particularly challenging problem is how to feasibly detect and defend possible attacks on routing protocols, particularly internal attacks, such as a Byzantine attack. In this paper, we propose a novel algorithm that detects internal attacks by using both message and route redundancy during route discovery. The route-discovery messages are protected by pairwise secret keys between a source and destination and some intermediate nodes along a route established by using public key cryptographic mechanisms. We also propose an optimal routing algorithm with routing metric combining both requirements on a nodes trustworthiness and performance. A node builds up the trustworthiness on its neighboring nodes based on its observations on the behaviors of the neighbor nodes. Both of the proposed algorithms can be integrated into existing routing protocols for MANETs, such as ad hoc on-demand distance vector routing (AODV) and dynamic source routing (DSR). As an example, we present such an integrated protocol called secure routing against collusion (SRAC), in which a node makes a routing decision based on its trust of its neighboring nodes and the performance provided by them. The simulation results have demonstrated the significant advantages of the proposed attack detection and routing algorithm over some known protocols.
The old dream of a universal repository containing all the human knowledge and culture is becoming possible through the Internet and the Web. Moreover, this is happening with the direct collaborative, participation of people. Wikipedia is a great example. It is an enormous repository of information with free access and edition, created by the community in a collaborative manner. However, this large amount of information, made available democratically and virtually without any control, raises questions about its relative quality. In this work we explore a significant number of quality indicators, some of them proposed by us and used here for the first time, and study their capability to assess the quality of Wikipedia articles. Furthermore, we explore machine learning techniques to combine these quality indicators into one single assessment judgment. Through experiments, we show that the most important quality indicators are the easiest ones to extract, namely, textual features related to length, structure and style. We were also able to determine which indicators did not contribute significantly to the quality assessment. These were, coincidentally, the most complex features, such as those based on link analysis. Finally, we compare our combination method with state-of-the-art solution and show significant improvements in terms of effective quality prediction.
Teachers usually have a personal understanding of what 'good teaching" means, and as a result of their experience and educationally related domain knowledge, many of them create learning objects (LO) and put them on the web for study use. In fact, most students cannot find the most suitable LO (e.g. learning materials, learning assets, or learning packages) from webs. Consequently, many researchers have focused on developing e-learning systems with personalized learning mechanisms to assist on-line web-based learning and to adaptively provide learning paths. However, although most personalized learning mechanism systems neglect to consider the relationship between learner attributes (e.g. learning style, domain knowledge) and LOs attributes. Thus, it is not easy for a learner to find an adaptive learning object that reflects his own attributes in relationship to learning object attributes. Therefore, in this paper, based on an ant colony optimization (ACO) algorithm, we proposed an attributes-based ant colony system (AACS) to help learners find an adaptive learning object more effectively. Our paper makes three critical contributions: (1) It presents an attribute-based search mechanism to find adaptive learning objects effectively; (2) An attributes-ant algorithm was proposed; (3) An adaptive learning rule was developed to identify how learners with different attributes may locate learning objects which have a higher probability of being useful and suitable; (4) A web-based learning portal was created for learners to find the learning objects more effectively.
This paper addresses the problem of monitoring the k nearest neighbors to a dynamically changing path in road networks. Given a destination where a user is going to, this new query returns the k-NN with respect to the shortest path connecting the destination and the users current location, and thus provides a list of nearest candidates for reference by considering the whole coming journey. We name this query the k-Path Nearest Neighbor query (k-PNN). As the user is moving and may not always follow the shortest path, the query path keeps changing. The challenge of monitoring the k-PNN for an arbitrarily moving user is to dynamically determine the update locations and then refresh the k-PNN efficiently. We propose a three-phase Best-first Network Expansion (BNE) algorithm for monitoring the k- PNN and the corresponding shortest path. In the searching phase, the BNE finds the shortest path to the destination, during which a candidate set that guarantees to include the k-PNN is generated at the same time. Then in the verification phase, a heuristic algorithm runs for examining candidates exact distances to the query path, and it achieves significant reduction in the number of visited nodes. The monitoring phase deals with computing update locations as well as refreshing the k-PNN in different user movements. Since determining the network distance is a costly process, an expansion tree and the candidate set are carefully maintained by the BNE algorithm, which can provide efficient update on the shortest path and the k-PNN results. Finally, we conduct extensive experiments on real road networks and show that our methods achieve satisfactory performance.
In this paper, the resource-constrained project scheduling problem with multiple execution modes for each activity is explored. This paper aims to find a schedule of activities such that the makespan of the schedule is minimized subject to the precedence and resource constraints. We present a two phase genetic local search algorithm that combines the genetic algorithm and the local search method to solve this problem. The first phase aims to search globally for promising areas, and the second phase aims to search more thoroughly in these promising areas. A set of elite solutions is collected during the first phase, and this set, which acts as the indication of promising areas, is utilized to construct the initial population of the second phase. By suitable applications of the mutation with a large mutation rate, the restart of the genetic local search algorithm, and the collection of good solutions in the elite set, the strength of intensification and diversification can be properly adapted and the search ability retained in a long term. Computational experiments were conducted on the standard sets of project instances, and the experimental results revealed that the proposed algorithm was effective for both the short-term (with 5000 schedules being evaluated) and the long-term (with 50 000 schedules being evaluated) search in solving this problem.
The peer-to-peer (P2P) paradigm has become very popular for storing and sharing information in a totally decentralized manner. At first, research focused on P2P systems that host 1D data. Nowadays, the need for P2P applications with multidimensional data has emerged, motivating research on P2P systems that manage such data. The majority of the proposed techniques are based either on the distribution of centralized indexes or on the reduction of multidimensional data to one dimension. Our goal is to create from scratch a technique that is inherently distributed and also maintains the multidimensionality of data. Our focus is on structured P2P systems that share spatial information. We present SPATIALP2P, a totally decentralized indexing and searching framework that is suitable for spatial data. SPATIALP2P supports P2P applications in which spatial information of various sizes can be dynamically inserted or deleted, and peers can join or leave. The proposed technique preserves well locality and directionality of space.
(IEEE Transaction on Knowledge and data engineering)
With the tremendous growth of information available to end users through the Web, search engines come to play ever a more critical role. Nevertheless, because of their general purpose approach, it is always less uncommon that obtained result sets provide a burden of useless pages. Next generation Web architecture, represented by Semantic Web, provides the layered architecture possibly allowing to overcome this limitation. Several search engines have been proposed, which allow to increase information retrieval accuracy by exploiting a key content of Semantic Web resources, that is relations. However, in order to rank results, most of the existing solutions need to work on the whole annotated knowledge base. In this paper we propose a relation-based page rank algorithm to be used in conjunction with Semantic Web search engines that simply relies on information which could be extracted from user query and annotated resource. Relevance is measured as the probability that retrieved resource actually contains those relations whose existence was assumed by the user at the time of query definition.
(IEEE Transaction on Service Computing)
Web services composition has been an active research area over the last few years. However, the technology is still not mature yet and several research issues need to be addressed. In this paper, we describe the design of CCAP, a system that provides tools for adaptive service composition and provisioning. We introduce a composition model where service context and exceptions are configurable to accommodate needs of different users. This allows for reusability of a service in different contexts and achieves a level of adaptiveness and contextualization without recoding and recompiling of the overall composed services. The execution semantics of the adaptive composite service is provided by an event-driven model. This execution model is based on Linda Tuple Spaces and supports real-time and asynchronous communication between services. Three core services, coordination service, context service, and event service, are implemented to automatically schedule and execute the component services, and adapt to user configured exceptions and contexts at run time. The proposed system provides an efficient and flexible support for specifying, deploying, and accessing adaptive composite services. We demonstrate the benefits of our system by conducting usability and performance studies.
(IEEE Transaction on Knowledge and Data Engineering)
In previous research of text categorization, a word is usually described by features which express that whether the word appears in the document or how frequently the word appears. Although these features are useful, they have not fully expressed the information contained in the document. In this paper, the distributional features are used to describe a word, which express the distribution of a word in a document. In detail, the compactness of the appearances of the word and the position of the first appearance of the word are characterized as features. These features are exploited by a TFIDF style equation in this paper. Experiments show that the distributional features are useful for text categorization. In contrast to using the traditional term frequency features solely, including the distributional features requires only a little additional cost, while the categorization performance can be significantly improved.
(IEEE Transaction on Knowledge and Data Engineering)
With the growing demand for visual information of rich content, effective and efficient manipulations of large video databases are increasingly desired. Many investigations have been made on content-based video retrieval. However, despite the importance, video subsequence identification, which is to find the similar content to a short query clip from a long video sequence, has not been well addressed. This paper presents a graph transformation and matching approach to this problem, with extension to identify the occurrence of potentially different ordering or length due to content editing. With a novel batch query algorithm to retrieve similar frames, the mapping relationship between the query and database video is first represented by a bipartite graph. The densely matched parts along the long sequence are then extracted, followed by a filter-and-refine search strategy to prune some irrelevant subsequences. During the filtering stage, maximum size matching is deployed for each subgraph constructed by the query and candidate subsequence to obtain a smaller set of candidates. During the refinement stage, sub-maximum similarity matching is devised to identify the subsequence with the highest aggregate score from all candidates, according to a robust video similarity model that incorporates visual content, temporal order, and frame alignment information. The performance studies conducted on a long video recording of 50 hours validate that our approach is promising in terms of both search accuracy and speed.
(IEEE Transaction on Knowledge and Data Engineering)
In this paper, we propose a novel, exact border-based approach that provides an optimal solution for the hiding of sensitive frequent itemsets by (i) minimally extending the original database by a synthetically generated database part - the database extension, (ii) formulating the creation of the database extension as a constraint satisfaction problem, (iii) mapping the constraint satisfaction problem to an equivalent binary integer programming problem, (iv) exploiting underutilized synthetic transactions to proportionally increase the support of non-sensitive itemsets, (v) minimally relaxing the constraint satisfaction problem to provide an approximate solution close to the optimal one when an ideal solution does not exist, and (vi) by using a partitioning in the universe of the items to increase the efficiency of the proposed hiding algorithm. Extending the original database for sensitive itemset hiding is proved to provide optimal solutions to an extended set of hiding problems compared to previous approaches and to provide solutions of higher quality. Moreover, the application of binary integer programming enables the simultaneous hiding of the sensitive itemsets and thus allows for the identification of globally optimal solutions.
With the existence of 'semantic gap" between the machine-readable low level features (e.g. visual features in terms of colors and textures) and high level human concepts, it is inherently hard for the machine to automatically identify and retrieve events from videos according to their semantics by merely reading pixels and frames. This paper proposes a human-centered framework for mining and retrieving events and applies it to indoor surveillance video databases. The goal is to locate video sequences containing events of interest to the user of the surveillance video database. This framework starts by tracking objects. Since surveillance videos cannot be easily segmented, the Common Appearance Intervals (CAIs) are used to segment videos, which have the flavor of shots in movies. The video segmentation provides an efficient indexing schema for the retrieval. The trajectories obtained are thus spatiotemporal in nature, based on which features are extracted for the construction of event models. In the retrieval phase, the database user interacts with the machine and provides 'feedbacks" to the retrieval results. The proposed learning algorithm learns from the spatiotemporal data, the event model as well as the 'feedbacks" and returns the refined results to the user. Specifically, the learning algorithm is a Coupled Hidden Markov Model (CHMM), which models the interactions of objects in CAIs and recognizes hidden patterns among them. This iterative learning and retrieval process contributes to the bridging of the 'semantic gap", and the experimental results show the effectiveness of the proposed framework by demonstrating the increase of retrieval accuracy through iterations and comparing with other methods.
IEEE DATA MINING
This paper describes Olex, a novel method for the automatic induction of rule-based text classifiers. Olex supports a hypothesis language of the form "if T1 or hellip or Tn occurs in document d, and none of T1+n,… Tn+m occurs in d, then classify d under category c," where each Ti is a conjunction of terms. The proposed method is simple and elegant. Despite this, the results of a systematic experimentation performed on the REUTERS-21578, the OHSUMED, and the ODP data collections show that Olex provides classifiers that are accurate, compact, and comprehensible. A comparative analysis conducted against some of the most well-known learning algorithms (namely, Naive Bayes, Ripper, C4.5, SVM, and Linear Logistic Regression) demonstrates that it is more than competitive in terms of both predictive accuracy and efficiency
Border Gateway Protocol (BGP) is the de-facto routing protocol in the Internet. Unfortunately, it is not a secure protocol, and as a result, several attacks have been successfully mounted against the Internet infrastructure. Among the security requirements of BGP is the ability to validate the actual source and path of the BGP update message. This is needed to help reduce the threat of prefix hijacking and IP spoofing based attacks. BGP route associates an address prefix with a set of autonomous systems (AS) that identify the inter-domain path that the prefix has traversed in the form of BGP announcements. This set is represented as the AS_PATH attribute in BGP and starts with the AS that originated the prefix. Credible BGP (CBGP) proposes several extensions to BGP protocol to validate source and path of BGP update message and to use the resulting validation score to influence the route selection algorithm. CBGP assigns credibility scores for AS prefix origination and AS_PATH. These credibility scores are used in the extended selection algorithm to prefer valid BGP routes. The new protocol can detect BGP attacks such as AS Path Injection and AS Prefix high jacking.
While the concept of collaboration provides a natural defense against massive spam emails directed at large numbers of recipients, designing effective collaborative anti-spam systems raises several important research challenges. First and foremost, since emails may contain confidential information, any collaborative anti-spam approach has to guarantee strong privacy protection to the participating entities. Second, the continuously evolving nature of spam demands the collaborative techniques to be resilient to various kinds of camouflage attacks. Third, the collaboration has to be lightweight, efficient, and scalable. Towards addressing these challenges, this paper presents ALPACAS - a privacy-aware framework for collaborative spam filtering. In designing the ALPACAS framework, we make two unique contributions. The first is a feature-preserving message transformation technique that is highly resilient against the latest kinds of spam attacks. The second is a privacy-preserving protocol that provides enhanced privacy guarantees to the participating entities. Our experimental results conducted on a real email dataset shows that the proposed framework provides a 10 fold improvement in the false negative rate over the Bayesian-based Bogofilter when faced with one of the recent kinds of spam attacks. Further, the privacy breaches are extremely rare. This demonstrates the strong privacy protection provided by the ALPACAS system.
We present a new method to select an attribute subset (with few or no loss of information) for high dimensional data clustering. Most of existing clustering algorithms loose some of their efficiency in high dimensional data sets. One possible solution is to use only a subset of the whole set of dimensions. But the number of possible dimension subsets is too large to be fully parsed. We use a heuristic search for optimal attribute subset selection. For this purpose we use the best cluster validity index to first select the most appropriate cluster number and then to evaluate the clustering performed on the attribute subset. The performances of our new approach of attribute selection are evaluated on several high dimensional data sets. Furthermore, as the number of dimensions used is low, it is possible to display the data sets in order to visually evaluate and interpret the obtained results.
A new method for the blind separation of linear image mixtures is presented in this paper. Such mixtures often occur, when, for example, we photograph a scene through a semireflecting medium (windshield or glass). The proposed method requires two mixtures of two scenes captured under different illumination conditions. We show that the boundary values of the ratio of the two mixtures can lead to an accurate estimation of the separation matrix. The technique is very simple, fast, and reliable, as it does not depend on iterative procedures. The method effectiveness is tested on both artificially mixed images and real images.
A fuzzy time series data representation method based on the Japanese candlestick theory is proposed and used in assisting financial prediction. The Japanese candlestick theory is an empirical model of investment decision. The theory assumes that the candlestick patterns reflect the psychology of the market, and the investors can make their investment decision based on the identified candlestick patterns. We model the imprecise and vague candlestick patterns with fuzzy linguistic variables and transfer the financial time series data to fuzzy candlestick patterns for pattern recognition. A fuzzy candlestick pattern can bridge the gap between the investors and the system designer because it is visual, computable, and modifiable. The investors are not only able to understand the prediction process, but also to improve the efficiency of prediction results. The proposed approach is applied to financial time series forecasting problem for demonstration. By the prototype system which has been established, the investment expertise can be stored in the knowledge base, and the fuzzy candlestick pattern can also be identified automatically from a large amount of the financial trading data.
Email has become one of the fastest and most economical forms of communication. However, the increase of email users has resulted in the dramatic increase of spam emails during the past few years. As spammers always try to find a way to evade existing filters, new filters need to be developed to catch spam. Ontologies allow for machine-understandable semantics of data. It is important to share information with each other for more effective spam filtering. Thus, it is necessary to build ontology and a framework for efficient email filtering. Using ontology that is specially designed to filter spam, bunch of unsolicited bulk email could be filtered out on the system. Similar to other filters, the ontology evolves with the user requests. Hence the ontology would be customized for the user. This paper proposes to find an efficient spam email filtering method using adaptive ontology.
The evolution of web services becomes necessary because it is inevitable that service s will evolve overtime and client s will always ask for new features. Unfortunately the current standards do not support necessary mechanisms to ensure the behaviour evolution of the web services. This problem becomes more complex for dynamically adaptation and for web service upgrade by non-backwards-compatible changes. In this paper we present an approach to manage the behaviour evolution of web services and adapt dynamically their clients by providing a set of change operators for services and a set of adaptation rules for clients. This approach is validated within a framework proposed at the end of this paper.
Link-based analysis of the Web provides the basis for many important applications-like Web search, Web-based data mining, and Web page categorization-that bring order to the massive amount of distributed Web content. Due to the overwhelming reliance on these important applications, there is a rise in efforts to manipulate (or spam) the link structure of the Web. In this manuscript, we present a parameterized framework for link analysis of the Web that promotes spam resilience through a source-centric view of the Web. We provide a rigorous study of the set of critical parameters that can impact source-centric link analysis and propose the novel notion of influence throttling for countering the influence of link-based manipulation. Through formal analysis and a large-scale experimental study, we show how different parameter settings may impact the time complexity, stability, and spam resilience of Web link analysis. Concretely, we find that the source-centric model supports more effective and robust rankings in comparison with existing Web algorithms such as Page Rank.