K-tree - a height balanced tree structured vector quantizer

Web Name: K-tree - a height balanced tree structured vector quantizer

WebSite: http://ktree.sourceforge.net

ID:10577

Keywords:

balanced,height,tree,

Description:

October 2014 The ClueWeb09 and ClueWeb12 document collections are some of the largest document collections used for research. They contain 500 million and 733 million English language documents respectively. The Streaming EM-tree algorithm using binary document vectors produced by TopSig has been used to cluster these collections into more than 500,000 clusters. This has been done using a single 16 core machine in 15 hours using the LMW-tree C++ template library. The experiments were run in the QUT Big Data Lab. It is is expected that a distributed version of the algorithm will be able to cluster the entire searchable web of around 50 billion documents into 1 million clusters. The clusters produced by these algorithms have been made browsable online: ClueWeb09 clusters and ClueWeb12 clusters. The cluster mappings and software are also available. Paper Accepted September 2014 The paper "Clustering and Labeling a web scale document collection using Wikipedia clusters" using TopSig binary vectors using clustering algorithms from the LMW-tree C++ template library has been accepted to Web-scale Knowledge, Representation and Reasoning (Web-KR 2014). Congratulations and thanks goes to all the authors. The experiments were run in the QUT Big Data Lab which is supported by the Science and Engineering Faculty and and High Performance Computing centre. September 2014 The PhD thesis introducing the EM-tree algorithm and document clustering with binary document vectors is available online. It is titled "Document Clustering Algorithms, Representations and Evaluation for Information Retrival". November 2013 Development has started on LMW-tree. It is a generic C++ template library implementing EM-tree, K-tree and related research using Boost and Intel Thread Building Blocks. It includes K-tree, EM-tree and other algorithms. It reduces memory usage and increases execution speed. Parallelized versions of the algorithms have been implemented. Streaming and distributed implementations of the EM-tree are being developed for clustering of extremely large collections containing billions of examples. Development is taking place on GitHub. A brief introduction K-tree is a tree structured clustering algorithm. It is also refered to as a Tree Structured Vector Quantizer (TSVQ). The goal of cluster analysis is to group objects based on similarity. Each object in a K-tree is represented by an n-dimensional vector. All vectors in the tree must have the same number of dimensions. The algorithm is a hybrid of the B+-tree and k-means algorithms. It uses a similar tree structure to the B+-tree and uses k-means to perform splits. The tree forms a nearest neighbour search tree. Unlike k-means the number of clusters does not need to be specified upfront. However, a tree order must be specified that restricts how many vectors can be stored in any node. Each level of the tree produces a different number of clusters. The K-tree algorithm is useful for clustering large data sets with many features. It scales best in comparison to traditional approaches when there are many objects to cluster into a large number of clusters. In this scenario each cluster contains relatively few objects. For example, a document collection of three million documents can be clustered into one hundred thousand clusters. Future directions Currently K-tree has implementations in C++, C, Java and Python. The Python version has recently been written by Ulf and focusses on rapid prototyping for research. Development has recently started on the C++ version. Licensed under LGPL and GPL To download K-tree see the software page. Please cite papers from the publications page when citing K-tree. The following people have contributed to the development of K-tree (sorted lexicographically by last name)     Lance De Vine, QUT     Chris De Vries, QUT     Shlomo Geva, QUT     Ulf Großekathöfer, Bielefeld University The development of K-tree has been proudly supported by the QUT Faculty of Science and Technology.

TAGS:balanced height tree 

<<< Thank you for your visit >>>

Websites to related :
Silkroad Online Forums Index pa

  In total there are 41 users online :: 2 registered, 0 hidden and 39 guests (based on users active over the past 5 minutes)Most users ever online was 1

Endpoints News The biopharma wo

  Mer ck has come out of its cor ner with a 1-2 late-stage punch aimed at one of Pfiz er s biggest fran chis es. And they're rolling the late-stage da t

Portal technologiczny - Instalki

  Autor: Anna BorzęckaKolejna kosmiczna zagwozdka. Hardware Czwartek, 25 Czerwiec 2020 13:45,Autor: Jan DomańskiJeśli wolisz PureBoot i Coreboot

scoot.net

  Classified Ads - page where you can buy $20 TV 175's that have been in a farmers barn for 35 years, or post your own ad.Photo Gallery - photos and vi

North Jersey Section - American

  I have fibromyalgia and arthritis. Cheap Tramadol 100 mg fights the pain a bit, but not for sure. It perfectly helps me with my depression and fatigue

Career Pathways

  The Administration for Children and Families’ (ACF's) Office of Planning, Research, and Evaluation (OPRE) studies ACF’s programs and the populations

Home .:. Sustainable Development

  Helping governments and stakeholders make the SDGs a reality SUSTAINABLE DEVELOPMENT GOAL 1 End poverty in all its forms everywhere Read more SUSTAIN

Ohio High School Sports - MaxP

  For Coaches. By Coaches.Find out how MaxPreps can help high school coaches serve their team and communityUnder the radar football hotbeds in 2020by M.

Home - York Community High Schoo

  The mission of York High School a part of Elmhurst Community Unit School District 205 is to provide an education that enables all students to become l

Home | The Schools' Football Web

  The SOCS sport toolkit is designed to save time and help manage sports in schools. SOCS sport caters for all sports and comes complete with a dedicat

ads

Hot Websites