Home / Big Data / What is Apache Solr?

What is Apache Solr?

Apache Solr has a similar task and is basically an open source search platform from Apache Lucene, a Java library that provides its power indexing mechanism. Although it is a search engine, it offers more than a search engine with many features such as replication of data and allowing multiple cores. Solr is also a ready-to-use web application that allows you to make queries with HTTP requests. With this feature, you can think of Lucene as a packaged version.

First, you need a computer with 8GB or more RAM to run Solr. Or you can install the instance via Azure or Amazon. In addition to providing Solr from the official site, Java version 8 must be installed on your system.
When the above command completes the operation, the management panel of our Solr application will be serving via localhost. We will talk about the management panel later in the article.

 Apache Solr management panel

 

Documents and Collections

Solr is a document-based database. Database entities, such as personnel, can be made up of fields such as names, addresses and email. These documents are stored in collections. We can think of collections as tables in traditional databases. The single most important difference is that, unlike conventional databases, a Person entity can host more than one address information in the same document. In the traditional database, it is necessary to open a new table named Address and establish a relationship with Person.

What is Shard, Replica and Core?

Unlike many relational databases, data is automatically shredded and copied via Solr Cloud. In this way, when you add a record to a properly configured collection, the corresponding record is automatically distributed to any of the Solr instance. Read performance increases by transferring the record as a shard to the corresponding instances. Each document is also copied to a different instance for replication. In this way, when any Solr instance crashes, only the total performance of the cluster is reduced, but data loss is prevented because the corresponding replica still exists in other instances.

Hierarchy between Shard, Replica and Core

 

A cluster is a structure consisting of all of the nodes (nodes), each with a JVM instance running Solr. A node can contain multiple cores. Core is the logical replica of a shard. In general, the kernels are represented by a string formed by the combination of the collection name, shard number and replica number.

Creating Collection

You can create and manage a collection with REST-varible HTTP interfaces and perform this operation with the solr command. We can download health expense data from Data.gov as sample data. Let us export to CSV format for your convenience.

READ  Why You Should Use Apache Drill?

Before we transfer data, we need to create a schema as in the same traditional databases.

When successful, you can see the fields that you added under the fields element on the ipps schema.
Now that our data diagram has been created, it is time to send our data to Solr. As with Postman, we can also use the bin / post CLI tool currently available for Linux and MacOS environments.

Querying Data

Support is available for many programming languages ​​(including .Net, Java, and Python) to perform queries via Solr. Instead, you can also query directly from your browser.

Solr Management

If you don’t want to work on the command prompt, Apache Solr also has a user interface. You can access the relevant interface by entering http: // localhost: 8983 / solr. Also, if you select the ipps collection on the left, there is also a nice interface where you can enter the query when you click on the Query section below:

We have performed simple text queries. You can also select specific ranges to make your search more specific. If the relevant sort that Solr produces by default is not suitable for you, you can use more advanced query expressions to make sure that only records that match the query can be returned in the same relational databases. You can sort various fields and perform category-based filtering. If you are not satisfied with this, you can make Solr learn the best result for you by taking advantage of the machine learning capabilities called ”Learn to rank”.

Why Should I Use Solr?

If you need a search engine, it seems pretty logical to use Solr. It also provides a distributed document database offering SQL to connect to data visualization tools such as Tableau. You can communicate with many programming languages ​​such as Solr, or interact with Solr easily with JSON and XML documents.
If you are dealing with small data and have a key-value structure, Solr may not be suitable for you. Because Apache Solr is specialized for the fast processing of large amounts of data and will be meaningless for the small-size data.
If your search criteria are quite text-based, Solr is an indispensable option for you. However, in spatial search operations, Solr offers very convenient query components.

In this article, we tried to introduce Apache Solr, which you can use in search operations. You can share your thoughts about Solr with us in the comments below.  See you in another article.

About bigdtadmin

Check Also

What is Elasticsearch?

ElasticSearch is an open source, scalable full-text search engine from the Apache Lucene infrastructure. ElasticSearch ...

Leave a Reply

Your email address will not be published. Required fields are marked *