Presentation slides are available here.(Click on the slide thumbnail to download the presentation)
See all 2011 Eurocon Conference videos here.

 

GO TO DAY 2 SESSIONS

DAY 1 TALKS

LIGHTNING TALKS


KEYNOTES


SEARCH + BIG DATA: IT'S (STILL) ALL ABOUT THE USER


        Presented by - Grant Ingersoll | Lucid Imagination, United States

Apache Hadoop has rapidly become the primary framework of choice for enterprises that need to store, process and Download Presentationmanage large data sets. It helps companies to derive more value from existing data as well as collect new data, including unstructured data from server logs, social media channels, call center systems and other data sets that present new opportunities for analysis. This keynote will provide insight into how Apache Hadoop is being leveraged today and how it evolving to become a key component of tomorrow's enterprise data architecture. This presentation will also provide a view into the important intersection between Apache Hadoop and search.


COMMENT / ADD TO SCHEDULE

ARCHITECTING THE FUTURE OF BIG DATA & SEARCH


        Presented by - Eric Baldeschwieler | Hortonworks, United States

Thanks in no small part to Lucene, quality keyword search is easily obtainable. Likewise, tools like Apache Hadoop and itsDownload Presentation ecosystem have made it easier to store and process large quantities of data. Besides being fun for engineers to geek on and pundits to talk about, what does the big data movement mean for the real problem at hand: helping users find relevant content as quickly and cost effectively as possible? In this talk, we'll look at the opportunities the Lucene ecosystem provides to offer better search, discovery and analytics capabilities to developers in order to better enable users.


COMMENT / ADD TO SCHEDULE

REALTIME SEARCH AT TWITTER


        Presented by - Michael Busch | Twitter, United States

At Twitter we serve more than 1.5 billion queries per day from Lucene indexes, while appending more than 200 millionDownload Presentation tweets per day in realtime. Additionally we recently launched image, video and relevance search on the same engine.
This talk will explain the changes we made to Lucene to support this high load and the changes and improvements we made in the last year.


COMMENT / ADD TO SCHEDULE



TRACK SESSIONS DAY 1


PORTABLE LUCENE INDEX FORMAT AND APPLICATIONS


        Presented by - Andrzej Bialecki | Lucid Imagination, Poland

This talk will present a design and implementation of a flexible, version-independent serialization format for Lucene Download Presentationindexes and its applications in index upgrades / downgrades, in distributed document analysis, in distributed indexing, and in integration with external indexing pipelines. This format enables submitting pre-analyzed documents to Lucene/Solr, and transferring parts of indexes between nodes in a distributed setup.



COMMENT / ADD TO SCHEDULE

CONFIGURING MAHOUT CLUSTERING JOBS


        Presented by - Frank Scholten | JTeam, Netherlands

For more than a decade internet search engines have helped users find documents they are looking for. However, what ifDownload Presentation users aren't looking for anything specific but want a summary of a large document collection and want to be surprised? One solution to this problem is document clustering. Clustering algorithms group documents that have similar content. Real-life examples of clustering are clustered search results of Google news, or tag clouds which group documents under a shared label. Apache Mahout is a framework for scalable machine learning on top of Apache Hadoop and can be used for large scale document clustering. This talk introduces clustering in general and shows you step-by-step how to configure Mahout clustering jobs to create a tag cloud from a document collection. This talk is suitable for people who have some experience with Hadoop and perhaps Mahout. Knowledge of clustering is not required.

Topics include

  • Clustering introduction
  • Clustering in Mahout
  • Text pre-processing & analysis
  • Tag cloud demo
  • Tips & tricks
COMMENT / ADD TO SCHEDULE

ADAPTING AJAX-SOLR TO COMPARE DIFFERENT SETS OF DOCUMENTS


        Presented by - Joan Codina | Barcelona Media – Centre d’Innovació, Spain

One of the main features of Solr is Faceted Search. Facets are the top terms present in the results of a query. But facets Download Presentationdo not indicate the most statistically relevant terms of a query, that is, these terms that are more present in the documents selected by the query than in the rest of the collection. A critical factor in making such statistical insights broadly useful is to make them visual -- i.e., using charts and graphs that display these quantitative relationships. We will present how to adapt Ajax-Solr to find the most prominent terms of a query compared to the full set or just another query. We are going to present and example on how this can be used to find current topics in the news, and extract that information into visually communicative charts and graphs.

COMMENT / ADD TO SCHEDULE

SOLR 4 HIGHLIGHTS


        Presented by - Mark Miller | Lucid Imagination, United States

In this talk, Lucene/Solr committer Mark Miller will discuss some of the new features and advancements that users can Download Presentationlook forward to in Solr 4. The list of topics will include: performance optimizations, further support for near-realtime search, SolrCloud, DirectSolrSpellChecker, and more.




COMMENT / ADD TO SCHEDULE

IMPROVING SOLR'S UPDATECHAIN


        Presented by - Jan Høydahl | Cominvent AS, Norway

Solr features a little known internal document processing pipeline called the UpdateRequestProcesssorChain or simply theDownload Presentation UpdateChain.

In this talk we'll discuss the importance of document processing, when the UpdateChain works well and what limitations it's got. We'll then go on to propose a range of possible improvements.

Topics include:

  • Examples of use with demo
  • How to write your own UpdateProcessor, best practices
  • Example: Tika as an UpdateProcessor
  • A vision for future improvements
COMMENT / ADD TO SCHEDULE

ARCHIVE-IT FULL-TEXT SEARCH: 200 CUSTOMERS, 2000+ COLLECTIONS, 1.3+ BILLION ARCHIVAL WEB PAGES


        Presented by - Aaron Binns | Internet Archive, United States

Description of Archive-It, the Internet Archive's subscription, self-serve web archiving service, focused on the full-text Download Presentationsearch system. With nearly 200 partners and over 2000 collections the custom Lucene-based system handles 3+ million index updates per day across an index that totals over 1.3 billion documents. This session will give a detailed description of the architecture and implementation of the Archive-It search system; highlighting many of the challenges due to the scale as well as complex use cases.

COMMENT / ADD TO SCHEDULE

SEARCH ANALYTICS: BUSINESS VALUE & BIGDATA NOSQL BACKEND


        Presented by - Otis Gospodnetic | Sematext International, United States

Search is increasingly the primary information access mechanism, so knowing how your search is doing often has directDownload Presentation business impact. You’ve indexed your data and people are searching it. But how do you know if they are happy with the results? How do you know if they are finding what they need? Regardless of whether you are using Solr, Lucene, Elastic Search or some other search solution, you should be paying attention what your users are telling you through their queries and clicks.

In the first part of this presentation we’ll talk about what Search Analytics is, why it's valuable, and how it can be used to answer questions like:

  • Are too many users getting the dreaded “no matches” results?
  • How deep into search results do people dig?
  • Which hits are they clicking on, or what percentage of them don’t click on any hits?
  • How much do they use the “Did You Mean” or “Auto-Complete” suggestions?

We’ll explore what specific search analytics reports tell us and what specific actions you should take based on those reports.

In the second part of the presentation we'll talk about how we've used Flume, Hadoop, MapReduce, and HBase to build a scalable Search Analytics service.

COMMENT / ADD TO SCHEDULE

IT'S JUST THE JOB: EMPLOYING SOLR FOR POWERFUL RECRUITMENT SEARCH


        Presented by - Charlie Hull | Flax, United Kingdom

Using a case study on a major European executive recruitment company, we will show how we used Apache Lucene/SolrDownload Presentation to build powerful, flexible, accurate and scalable search services over tens of millions of CVs and candidate records, allowing the company to completely restructure their IT provision for both local and national offices.




COMMENT / ADD TO SCHEDULE

MULTILINGUAL SEARCH AND TEXT ANALYTICS WITH SOLR


        Presented by - Steve Kearns | Basis Technology, United States

The power of the Solr search engine has rapidly gained it acceptance as an alternative to commercial search solutions for Download Presentationmany applications. There are many features required by organizations to serve their diverse communities, among these is the ability to deliver search excellence in users' native languages. Delivering quality multilingual search involves careful understanding of data and design of schemas, and selection of the best linguistic approaches for the supported languages.

This talk will explore the challenges of Multilingual search, including language-specific issues - like N-gram segmentation vs. morphological analysis, stemming vs. lemmatization, and language identification - and the various approaches to configuring your Solr schema. We will also discuss the integration strategies for common text analytics capabilities and the impact of multilingual content on application design.

COMMENT / ADD TO SCHEDULE

ARCHITECTURAL LESSONS LEARNED FROM REFACTORING A SOLR-BASED API APPLICATION


        Presented by - Torsten Koester | smatch.com/ Shopping24, Germany

In this case study I'll discuss architectural lessons learned from refactoring an existing REST-API backed by Apache Solr. Download PresentationThe initial goal of the refactoring was to speed up data access while scaling from 5m documents to 20-50m documents stored in Solr. Under consideration was the hosting infrastructure, the REST API Java code and the Solr documents and configuration. In this talk I'll give a brief review of the results.

"Pimping" the Solr configuration, the client access and the document structure achieved better results. But the elementary lesson learned was, that a significant increase of data access speed can only be realized with a functional redesign and a simplification of the REST API. NO CAPS ON CORES & SHARDS) I'll explain how this led us directly to distinct Solr cores and why we dropped the introduction of Solr shards or a breathing cloud infrastructure.

COMMENT / ADD TO SCHEDULE

SCALING SEARCH AT TROVIT WITH SOLR AND HADOOP


        Presented by - Marc Sturlese | Trovit, Spain

Trovit is a global classified advertising service covering real estate, jobs and more in 27 countries worldwide. Until Download Presentationrecently, our distributed Lucene/Solr search indexes used a customized Data Import Handler to draw data out of MySQL, but they no longer adequately handle our volumes with acceptable performance. We have moved Lucene/Solr indexes using MapReduce and came up with a new way to build indexes which is into production since months ago. Here at Trovit, we deal with many countries and different business categories, each with its own index -- and not all of them have similar size or structure.

I'll present our experience as a combined use case/tutorial, beginning with a brief introduction about the main Solr features we use at Trovit, and then move to the more complex part:

  • Brief explanation of the data pipeline handled by Hadoop before our ads are indexed, with implementation details of the indexing process, deploying indexes from HDFS, etc.
  • Tuning performance parameters to improve indexing speed as much as possible and keep good search performance
  • Managing the effect of GC at search time as much as we can as we deal with shards
  • Moving indexing time Solr features like DeDuplication to MapReduce.
  • Using Solr analyzers to analyze large amounts of text outside of an indexing process

I'll also talk about how we used the phased indexing strategy to manage indexes across countries and verticals (jobs, autos, etc.) and working around limitations in SOLR-1301.

COMMENT / ADD TO SCHEDULE

RELEVANCE IMPROVEMENTS AT CENGAGE FOR ENGLISH AND NON-ENGLISH CONTENT


        Presented by - Ivan Provalov | Cengage Learning, United States

In the session we describe relevance improvements we have implemented in our Lucene-based search system for EnglishDownload Presentation and Chinese contents and the tests we have performed for Arabic and Spanish contents based on TREC data. We will also describe our relevance feedback web app for the end-users to rank results of various queries. The presentation will have information about the usage data we analyze to improve the relevance. We will also touch upon our OCR data indexing challenges for English and non-English content.

COMMENT / ADD TO SCHEDULE

SOLR + GREENPLUM = MPP SOLR


        Presented by - George Chitouras | EMC, United States

The overall topic is computation on structured and unstructured data with "Big Data Text Analytics" as the goal. In this session we will look at Solr integrated with an MPP database which is a shared-nothing RDBMS architecture that scales out linearly. One feature of Greenplum (our MPP database) is "external tables", which allows the definition of RDBMS table based on files, processes/pipes, or HTTP accessible resources. These external tables feature parallel I/O utilizing the aggregate I/O (disk and network), memory, and CPU of the cluster. Apache Solr is a natural choice for such an integration due to its REST-like HTTP API, which is a clean fit for the framework. We have deployed Solr in a highly parallel manner to achieve high performance scalable text indexing and search, integrated with the MPP database to provide advanced analytics.

COMMENT / ADD TO SCHEDULE

SECURING DOCUMENTS IN SOLR WITH MANIFOLD CF


        Presented by - Karl Wright | Nokia, Inc, United States

This talk combines a brief presentation with panel Q&A session with a number of key experts in the field of open source Download Presentationcontent acquisition and security. We'll start by familiarizing Solr users with the capabilities of Apache Manifold Connector Framework, concentrating on how Manifold CF (MCF) can be used to project a repository's security into Solr search results through the use of Manifold CF's Authority Service and a custom Solr search component. We'll then transition to a panel discussion designed to explore case studies of how this security architecture has worked out when deployed in the field, and take questions from the audience. If you have questions in advance you would like us to consider for the panel discussion, we'd welcome them. You may submit questions ranging from 'how-to' to the MCF roadmap to kwright(at)apache.org.

COMMENT / ADD TO SCHEDULE

NATURAL LANGUAGE SEARCH IN SOLR


        Presented by - Tommaso Teofili | Sourcesense, Italy

This presentation aims to showcase how to build and implement a search engine which is able to understand a query Download Presentationwritten in a way much nearer to spoken language than to keyword-based search using Apache Lucene/Solr and Apache UIMA. A system which can recognize semantics in natural language can be very handy for non expert users, e-learning systems, customer care systems, etc. With such a system it's possible to submit queries such as "hotels near Rome" or "people working at Google" without having to manually transform a user entered natural language query to a Lucene/Solr query.

The Solr - UIMA integration (since Solr 3.1.0) can help on building such intelligent systems using NLP / Text mining algorithms on documents being indexed and on queries written by the user.

This module gives Solr the ability of calling UIMA pipelines when documents are indexed to trigger automatic extraction of metadata (i.e. named entities like people, places, organizations, etc.) using existing and custom algorithms as UIMA analysis engines. The talk will cover:

  • The Solr - UIMA integration
  • Introducing UIMA to Lucene's analysis phase
  • Running existing open source NLP algorithms in Lucene/Solr
  • Orchestrating blocks to build a sample system able to understand natural language queries

We'll introduce these points using examples (architectures & code) and a sample demo system.

COMMENT / ADD TO SCHEDULE

IMPROVED SEARCH WITH LUCENE 4


        Presented by - Robert Muir | Lucid Imagination, United States

This talk describes how you can practically apply some of Lucene 4's new features (such as flexible indexing, scoring Download Presentationimprovements, column-stride fields) to improve your search application.

The talk will give a brief description of these new features and some example use-cases, to address practical use cases you can try yourself in and around the new features now available in Lucene 4. We'll cover application of functions where you can configure Solr to:

  • Set up the schema to use Pulsing or Memory codec for a primary key field
  • Not use a separate spellcheck index, controlling character-level swaps from the query processor
  • Sorting with a different locale
  • Per-field similarity configurations, such as using a non-vector-space algorithm
COMMENT / ADD TO SCHEDULE

ENTERPRISE SEARCH: FAST ESP TO LUCENE SOLR


        Presented by - Michael McIntosh | TNR Global, LLC, United States

This presentation will discuss migration from FAST ESP to a Lucene Solr search platform. Illustrated through actual caseDownload Presentation studies, the presentation will include challenges and concerns, and present solutions and work-arounds to overcome migration issues. There are many reasons that an IT department with a large scale search installation would want to move from a proprietary platform to Lucene Solr. In the case of FAST Search, the company's purchase by Microsoft and discontinuation of the Linux platform has created an urgency for FAST users.

COMMENT / ADD TO SCHEDULE

PROVIDING A MORE POWERFUL SOLR SEARCH WITH SEMAPHORE


        Presented by - Jeremy Bentley | Smartlogic, United States

Metadata is widely understood to be a critical element of search, discovery and classification. But with the preponderance Download Presentationof unstructured data addressed by search technology, consistent native metadata is often in short supply. Organizations often find that the quality and depth of contextual metadata -- what documents are about – can maker or break search relevancy, precision and recall.

Semaphore is an enterprise semantic platform that uniquely captures an organization‘s subjects and topics into a taxonomy or ontology (model), in a manner that adds context for enhanced navigation and findability. Semaphore augments traditional information management systems like Solr search by adding advanced content classification, metadata and navigation capabilities to deliver a more complete, higher quality enterprise information management experience. This talk will focus on the following:

  • Deep dive into the technical integration of Semaphore with Apache/ Solr (including the connection points between Semaphore and Solr)
  • Discuss the Semaphore modules (Ontology Manager, Classification Server, Semantic Enhancement Server and Search Application Server) and how they provide better findability
  • Share a demonstration of Solr in action
  • Present a client case study (Nordyske). "
COMMENT / ADD TO SCHEDULE

STUMP THE CHUMP


        Presented by - Chris Hostetter | Lucid Imagination, United States

Do you have a tough problem with your Solr application? Facing challenges that you'd like some advice on?

Looking for new approaches to overcome a Solr issue? Not sure how to get the results you expected? Don't know where to get started? Then this session is for you. Get Chris Hostetter (aka Hoss) to come up with an immediate answer to your challenge or interesting problem.

During the session, Hoss will see the questions for the first time - and then will provide his approach to the problem. Our panel of judges will decide if he has provided an effective solution. Prizes will be awarded by the panel for the best question - and for those deemed to have "stumped the chump".
Submit your questions to info@lucene-eurocon.org

COMMENT / ADD TO SCHEDULE


DAY 2 TALKS

GO TO DAY 1 SESSIONS


TRACK SESSIONS DAY 2


LUCENE TODAY, TOMORROW AND BEYOND


        Presented by - Simon Willnauer | JTeam / Apache Lucene, Germany

Apache Lucene has grown to one of the most widely used Open Source search technologies. For more than a decade Download PresentationLucene has been used to retrieve search results for millions of users from mobile phones to world scale applications with billions of queries every day. This talk introduces the current state of the Lucene eco-system from a technical perspective and tries to provide a future vision of the project even beyond the next revolutionary major release.

COMMENT / ADD TO SCHEDULE

SOLR ON WINDOWS: DOES IT WORK? DOES IT SCALE?


        Presented by - Teun Duynstee | Funda, Netherlands

We will present a case study about running Solr on Windows and using Solr from Windows.Download Presentation

funda.nl is a Dutch household name. Our website is by far the largest real estate search engine in The Netherlands. Searching for homes for sale is the main functionality and used to be implemented as a home grown SQLServer based solution. This worked fine performancewise, but it was not very flexible in making changes to facets and searching/sorting. Over the last year we have migrated this solution to Solr.

funda serves 10M pageviews daily and most of these pages involve searching and faceting. Every month, 3M unique visitors visit the site (nearly 20% of the nationwide population of 16M). Our systems are built on the .NET platform in C# and this is also the skill set of the development and operations teams.

We'll discuss:

  • What kind of problems did we encounter when connecting to the Solr service from .NET?
  • About scalability: we will give many metrics about our solution: searches per second, indexing speed , effects of caching under load, indexer/searcher topology etc.
  • How caching of results in memcached compared to using the internal caching in Solr
  • Choices we made in running Solr on Windows. We use Tomcat and run it as a Windows system service. Results of stress and load tests we did
  • How we introduced Solr into the organization, taking away risks and uncertainties by doing a phased transition
COMMENT / ADD TO SCHEDULE

SEARCH, REST AND PLAY! VIDEO DISCOVERY


        Presented by - James Alexander | Open University, United States

The Open University has been creating programmes in partnership with BBC for over forty years. The resulting video archive contains over 9,000 programmes and 70,000 tapes of raw footage. The Access to Video Assets (AVA) project has been making this collection accessible to facilitate reuse and digitally preserving content: about 100TB of data so far.

Making video discoverable presents challenges that go beyond text-based search, and AVA has used Solr in unlocking this archive and bringing order to a collection that contains a plethora of physical and digital formats: these include over twenty types of tape, BBC holdings records, library catalogue and transmission data, subtitles, transcripts, rights contracts and other documents.

The resulting interface is designed to make video and metadata held in a repository accessible in innovative and intuitive ways for use by non-specialists. Its features include tag clouds hyperlinked to play out video content as well as other visualization and exploration tools including image storyboards, dynamic menus and facets.

This presentation outlines how Solr's REST API has been used to rapidly develop a Drupal interface and how indexing can be used to produce 'in-video' search.

COMMENT / ADD TO SCHEDULE

SOLR @ ETSY


        Presented by - Giovanni Fernandez-Kincade | Etsy.com, United States

Search at Etsy poses significant challenges. Our marketplace is filled with millions of unique, short-lived items and peopleDownload Presentation trying to find them over 10 million times a day. In this session we'll discuss many of the solutions we've engineered to meet these challenges. These include:

  • Infrastructure approaches like using Thrift as our interface to Solr and writing our own load-balancer.
  • Writing custom code that inter-operates with Solr, from QParserPlugins to in-house Stemmers.
  • Internationalization efforts, including tailoring the search experience to user language and location, and on-the-fly query translation – all a natural result of our efforts to create a global marketplace.

Finally, no talk about Etsy would be complete without some exploration of our incremental development strategy and the tools that empower it. We'll describe in detail how we continuously deploy our search stack, instantly change server configuration, and measure the impact of algorithmic changes using side-by-side user tests and live A/B tests.

COMMENT / ADD TO SCHEDULE

SOLR AT VIRGIN MONEY GIVING


        Presented by - Robin Bramley | Ixxus, United Kingdom

Virgin Money Giving is a UK-based not-for-profit business that was launched as a result of Virgin Money’s sponsorship of Download Presentationthe London Marathon and raised over £10 million for charity in the first 6 months of operation. The aim of the business is to provide a better deal for charities than that provided by the for-profit companies that previously dominated the sector. Search is of critical importance to the business to help connect fundraisers with charities and fundraising events, as well as allowing donors to find charities to donate to or fundraisers to sponsor.

The architectural vision was to build a service-oriented architecture leveraging Open Source software for cost effectiveness and flexibility. Ixxus, a Lucid Imagination partner, helped Virgin Money Giving to realise their overall vision including designing and implementing a search architecture that met the following goals:

  • Not polluting business logic or tightly coupling it to a search engine API
  • Asynchronous ‘fire and forget’ indexing
  • Read-only replica search nodes for scale out
  • High Availability / Disaster Recovery

This presentation will describe how the combination of Solr, the Spring Framework and JMS was successfully used on Virgin Money Giving, a medal winning project in the British Computer Society 2010 Computing awards. This session is aimed at architects and will cover the event-driven approach employed, the Solr features utilised and some of the alternative solutions that might be considered now.

The application built for the Virgin Money giving was awarded a medal by the BCS (http://bcs.org) for an outstanding community IT project. See http://bit.ly/Virgin-Solr-BCS for details.

COMMENT / ADD TO SCHEDULE

THE MANY FACETS OF APACHE SOLR


        Presented by - Yonik Seeley | Lucid Imagination, United States

Faceted Searching is a must have feature for enhancing findability and user engagement in enterprise search UI. The Download PresentationFaceted Searching features of Apache Solr have been a major factor in it's popularity, but many Solr users don't fully appreciate all of the capabilities that are available. In this session we will deep dive into the different types of data facets that Solr supports, discussing in detail the various options that can be used to explore them. We will also review some specific techniques for dealing with several complex use cases, and discuss some performance "gotchas" and how to avoid them.

COMMENT / ADD TO SCHEDULE

USING SOLRCLOUD, FOR REAL!


        Presented by - Jon Gifford | Loggly, United States

Loggly is a cloud based logging service. It helps you collect, index, and store all your log data and then makes it Download Presentationaccessible through search for analysis and reporting. All this is done without having to download or install anything on your servers. We have hundreds of customers, each of whom may have dozens of shards, quickly growing to thousands of individual indexes. To manage this explosion of indexes, I'll describe how we're using Solr Cloud to manage each and every index - from creation, through migration from box to box, and finally destruction. I'll describe some of the performance issues we had to deal with, especially with ZooKeeper.

COMMENT / ADD TO SCHEDULE

UNDERSTANDING AND VISUALISING SOLR 'EXPLAIN' INFORMATION


        Presented by - Rafal Kuc | Solr.pl, Poland

Talk and presentation about how to use, understand and visualize Solr 'explain' information—essential output from Solr Download Presentationthat lets you better tune and debug your search application. In the talk, I'll show the free software that is in development right now, that visualize Solr 'explain' information, such as how the score of the documents were counted, from what it is taken, how it was counted,which tokens mattered the most, and so on.

COMMENT / ADD TO SCHEDULE

RANDOMIZED CONTINUOUS TESTING: SOLR AND LUCENE USE CASE


        Presented by - Dawid Weiss | Carrot Search s.c, Poland

We have been taught that unit tests should be repeatable and most people (including the author) for a long time Download Presentationconsidered this an equivalent to "static", single-path execution. Solr and Lucene employ an interesting JUnit runner strategy where tests are randomized -- run with various data, various implementation of allowed interfaces, various configurations. The number of combinations makes running them all impossible, but execution randomization proves very successful at pinpointing implementation and regression bugs. This talk will provide an overview of this approach and practical considerations on when and how to port them to your own projects. Everything that stems from Lucene/Solr is not directly connected to search/ document retrieval and can still be useful and reused in other projects. This session is probably best suited to developers/ CTOs.

COMMENT / ADD TO SCHEDULE

BETTER SEARCH ENGINE TESTING


        Presented by - Eric Pugh | OpenSource Connections LLC, United States

"I know it when I see it".

This term was coined by a Supreme Court Justice in reference to obscenity, but he might as well been talking about Download Presentationrelevancy and search engine results. Testing search engines is rarely a binary process of "it works, it doesn't work", instead it draws on our human skills to design tests that capture the intangibles that make up a great search engine implementation! The behavior of a search engine changes as the data changes, so a search that returns one set of results today will return a different set tomorrow. Is that a bug? Or just a finely tuned search engine responding to changes in the data it searches? Search Engine testing often focuses on the very first layer of functionality, "Do I get results?", without digging deeper into "Do I get great relevant results?".

Search Engine implementation projects are typically less about writing new code, and more about integrating disparate existing data sets, turning knobs and levers to tune relevancy, and really understanding your data. Testing Search Engines really is a holistic activity.

You will leave this session armed with an overview of what search engines are, and how they work, and with real life techniques to apply, both to Exploratory Testing based search as well as Automated Testing. Users will also leave with a good grasp of using SolrMeter to quickly test the impact of configuration changes.

COMMENT / ADD TO SCHEDULE

SOLR ON EC2


        Presented by - Erick Erickson | Lucid Imagination, United States

"Cloud computing" is all the rage recently, and Amazon's EC2 is one of the major players. The idea of spinning up a new Download Presentationinstance of Solr in seconds to accomodate increased load is very attractive, especially as it can be done on demand, without heavy infrastructure investment. But how does that actually work?

This talk will (very) briefly outline creating a ready-to-deploy image containing a Solr instance. From there we'll discuss various the considerations to keep in mind when running Solr on EC2, including; replication concerns, monitoring and integration with CloudWatch, indexing, and cost.

We'll also explore Autoscaling; automatically increasing search capacity in response to the current load, and some of the issues that need to be considered when planning for autoscaling that are specific to Solr.

Finally, we'll consider the possibilities that EC2 offers in terms of answering the persistently difficult-to-answer question: "how many documents can I put on my server".

COMMENT / ADD TO SCHEDULE

Text analytics in Enterprise Search


        Presented by - Daniel Ling | Findwise, United States

Text analytics is a large and interesting subject, covering a wide range of topics. In the world of enterprise search Download Presentationhowever, the usual application of text analytics rarely ranges beyond extracting semi-structured information from the source data. As some of the more advanced concepts in text analytics, such as automatic text categorization, can be easily leveraged to bring a search installation from a search tool to a tool for discovery.

This talk will focus on how Findwise was recently able to combine entity extraction and categorization techniques with full text search in order to generate real business value for our customers. This was accomplished using a mix of technologies and readily available tools, from simple linguistic models to machine learning.

COMMENT / ADD TO SCHEDULE

HOW TO IMPROVE THE QUALITY OF YOUR SOLR RESULTS BY IMPLEMENTING THE RIGHT WORD ANALYZERS


        Presented by - Holger Keibel, Alberto Mijares | Canoo Engineering AG, Switzerland

In our Solr-based projects, we often observe that many expected documents are missing from the results of a query. For instance, a query for the German word "fliegen" (to fly) will generally not find documents only containing the corresponding past participle "geflogen" (flown). Likewise, a query for "Konto" (account) will generally fail to retrieve documents only containing related compound words such as "Gehaltskonto" (payroll account). Solr provides several tools for stemming and word segmentation to solve these problems but they are not very reliable and in some cases make things even worse. We therefore turned to more reliable external tools instead.

In this talk I describe how we built our own analyzer classes around these tools and how we plugged them into Solr's analyzer chain. I also explain how this improved the quality of search results and the MoreLikeThis module.

COMMENT / ADD TO SCHEDULE

DESIGNING MOBILE SEARCH


        Presented by - Tyler Tate | TwigKit, United States

About 15% of searches in 2011 have been performed on mobile devices, with an estimated rise to one in every four by Download Presentationnext year. And people aren't just searching Google: restaurants, cars, electronics, and even enterprise content are all being searched by people on the move.

How should we design Lucene- and Solr-powered search experiences for phones and tablets? To be sure, very different rules apply; small screens, slow connections, limited attention, and location awareness all afford very different user interfaces between desktop and mobile devices.

This talk will examine design patterns for mobile search, including approaches to faceted navigation, autocomplete, sorting, breadcrumbs, recent history, and bookmarks, as well as how these design patterns fit together as a whole.

COMMENT / ADD TO SCHEDULE

HEAVY COMMITTING: FLEXIBLE INDEXING IN LUCENE 4


        Presented by - Uwe Schindler | SD DataSolutions GmbH, Germany

Apache Lucene's next major release, 4, will introduce lots of flexibility into indexing, but also fundamental changes to the Download Presentationwell-known APIs: It features a new and consistent, 4-dimensional iteration API on top of a low-level, pluggable codec API giving applications full control over the postings data. Terms are now arbitrary opaque bytes enabling users to store terms in any encoding, not necessarily UTF-8, natively in the index (e.g. numeric fields). Currently under development is a higher performance postings iteration API, enabling interesting codecs based on recent encoding algorithms to work effectively. Several codecs have already been created, including the default "standard" codec, which enables sizable RAM reduction for searchers, and a "pulsing" codec that inlines postings data directly into the terms dictionary, which provides a solid performance boost for primary key fields. A lot of new codecs are under development. In this talk, Uwe presents an overview of all of these exciting changes, as well as several concrete, real-world examples of how applications can tap into these new features.

COMMENT / ADD TO SCHEDULE