Solr 7 beginner tutorial, part 2: delta-import, debugging tips.


In my previous tutorial on the basics of Solr, I didn't manage to fit everything I wanted to share in one post. Now I finally got around to wrapping up the subject with a subset of data import called delta-import and handy tips on debugging.

As a consequence, this part is based off the first one and assumes you've already imported sample data from AdventureWorks database through full import and are somewhat familiar with Solr Admin (Solr web interface).

With this short introduction out of the way now, let's kick it off with delta-import!

Delta-import

Data import is well-described in Lucene documentation, but this particular page quickly glosses over the subject of delta-import. Luckily for us, there's another page in Apache wiki that gives an example on delta-import,  but even this page, IMHO, doesn't show the full picture of delta-import (deletedPkQuery parameter is fleetingly mentioned only as 'something used only in delta-import'), especially the Whys and Whens of this approach to importing data into Lucene index.

Solr indexes may contain millions of documents, but a collection of static documents can only go so far. That means that after initial import of the whole collection, records associated with those documents may change, and those changes have to be propagated into Solr index.

How do we do that? Well, it would depend on how much data is changed between subsequent imports:
  • Use case one: the index contains around 10 million documents, and 7 million of them (on average) will be obsolete by the time the next import takes place. Solution: do a full data import.
  • Use case two: the index contains 100 docs, and only one or two of those get outdated. Solution: even though only a small share of the full index needs to be replaced, the overall index size is so tiny that simply importing it again will take fractions of second.
  • Use case three: the index is huge, but only a small share of documents have to be added/removed/changed between index updates. Solution: now this is where delta-import is going to shine. Instead of ditching the whole index and importing it again through full import, we can run queries that will fetch us ONLY the records that got added/changed /removed.
And that's exactly what delta-import does. Seeing how broad the definition is, you can guess that there's not one and the only way to do delta-import. We'll talk about 3 ways to do it, and try out two of them in practical examples.

First variety: keeping a separate delta table.

This approach is exactly what's shown in the title image: having a table with data for indexing, and alongside it a delta table that only stores changes to the data table.

We will start by checking the initial state of our index. After a full import it should be identical to person.person table in Adventureworks database:


Since we're going to add documents to index, let's also query the trailing document in adventureworks_person:


Now that the original state of index is clear, you will want to create a separate delta table in PostgreSQL. This can be done either in pure command-line interface, or you could install pgAdmin to do all the database-related stuff through GUI:

pgAdmin has a web-based UI that's run on the local server. Quite handy when you don't exactly know what to do.

Whether or not you prefer windows to console, both ways to do that start at the command line:
OR
and set up the delta table via graphic interface or using the following statements in Adventureworks database:
Our mock delta table contains the same columns as the original table plus:
  • Delta flag - we'll use that to determine what's happened to the record: i stands for inserted, u - updated and d - deleted. Actually, Lucene index doesn't even have to be able to tell between insert and update - from its standpoint the document is either still present and/or changed, or got removed.
  • Modified date - this column is all the difference between our mock implementation and the real one. Delta import query should not spew out all the changes that ever happened to the table. This would cause two problems: 1) Every operation would be applied to documents on every import (e.g.  one more delta import = the same document that was added the last time will be added again). 2) Eventually Solr would be swamped under changeset records. Modified date column exists just to kill two birds with one stone: the query will only get the changes were done after the last import happened - either full or delta.
Trying out all possible actions: modifying, adding and removing a document from index.

Note that all fields of removed documents except ID do not have to be specified, because ID is enough to find the document liable for purging. Unfortunately, I haven't found out how to omit IDs for inserted records - it would be great if they could be computed on the fly.

After a little tweak in data import configuration we should be able to get delta-import up and running.
Modify the file data-config.xml under $SOLRHOME/server/solr/adventureworks_person/conf :

Our data import handler is ready to do delta-import now.

The queries required for delta-import are as follows:
  • deltaQuery - should return a list of IDs for new and modified documents in index.
  • deltaImportQuery - after Solr is done executing the deltaQuery, it knows the IDs of documents that has to be added/updated. Therefore it will execute deltaImportQuery for each of those IDs to get all fields of the document.
  • deletedPkQuery - will be executed once to fetch all the records that got deleted. Documents with these IDs will be removed from index - no further queries need to be executed.
Restart Solr to apply changes: You'll have to update modifieddate column for any records you'd like to go into delta because of the additional condition we added in deltaQuery and deletedPkQuery.

So, to try out delta import right now we'll do just that: After this update Solr will apply the changes we've included in person.persondelta table, because they will definitely be fresh compared to when last indexing took place:

Uncheck 'Clean', otherwise current index contents will be thrown away.

What'll happen if delta-import works:
  1. When you do your first delta-import, 2 documents will be added/updated and one deleted, as can be seen on screenshot above. To do that, Solr will execute: deltaQuery once (fetching IDs 1 and 20778), then deletedPkQuery once (fetching ID 2), then deltaImportQuery for IDs 1 and 20778 - a total of 4 queries to database. This can also be seen in data import message.
  2. On every subsequent attempt to do delta-import you won't see anything happening (see picture below). That's because the changes were already imported on the first attempt. To import something again, UPDATE modifieddate column for records you want to be used in delta-import.

No matter how much you smash the Import button, the changes won't be applied twice.

Evidently, the index should reflect the changes from person.persondelta table now:

Changes from delta table were applied to index.
Solr config can be quite finicky and you'll definitely run into problems after configuring delta-import, just like I did. Refer to Debug section to see what I've faced - those tips may be helpful to you.

Pros and cons of Delta table approach:
Pros:
  • Easy from programmer's point of view: every changeset is clearly marked out.
Cons:
  • Challenging to build a procedure that will take note and store changes to a certain table for unclear period of time. Something akin to implementing version control in RDBMS.

     Second variety: including modification date into the original table.

    Before we get to another variety of delta-import, we'll have to get back to square one. Do a full import with "Clean" checked:

    The index should be in mint condition after that:


    Having a separate table to store all changes to the original one is handy, but can we do better? A delta table is bound to bloat eventually. Can we do without this table at all? Turns out we can (almost).

    Merging delta table with original one would be a great solution - we'd only need modifieddate column to tell if the record's been added or deleted, and turns out it's already in our person.person table:


    But how to determine when the record's been deleted? Here are some ideas:
    • Still have deltaflag column to mark records for deletion. This is still the previous approach with 2 tables merged into one, really. Hence I won't cover this one.
    • Find a way to mark records for deletion with existing columns. E.g. set modifieddate to 1 January 1970 (beginning of UNIX time) and exclude those records from all import queries except deletedPkQuery. Then run DELETE * FROM person.person WHERE EXTRACT(YEAR FROM modifieddate) = 1970; after each delta-import.
    • Use extended capabilities of your database. For instance, create a table for deleted records and a trigger that will insert a record for each record deleted from person.person table. This table won't swollen up as fast as an ordinary delta table from our first approach, since it's only going to hold deleted records, but you still have to figure out how to clean it up intermittently.

    It's up to you to decide which one to use, or even come up with your own, more clever way. I'm going to show how to implement the latter approach, because it involves tricky procedures of creating triggers and procedures.

    First things first, we're going to lay the groundwork in Posrgres: tables, triggers and everything.
    And then, following our streamlined process we'll have to modify data-config.xml file in $SOLRHOME/server/solr/adventureworks_person/conf and restart Solr:
    Now, let's perform some changes to the person.person table, taking care to set modifieddate as we do it. The triggered database procedure will take care of the rest. Some constraints will have to be dropped to allow us inserting new records.
    We're going to check if all the changes got applied:

    Launch the delta-import and observe, hopefully, the same picture as in the previous approach.

    Pros and cons of keeping changes in the original table:
    Pros:
    • Easier than the first one in respect of not having to maintain extra tables (kinda, depends on your implementation). Just update modifieddate field and you're good to go!
    Cons:
    • Keeping track of deleted items is not a straightforward task.

       Bonus third variety

      This page describes a method of doing delta-import through full import that might or might not be more efficient than the previous two, depending on your use case. I'm going to assume that after getting through the previous sections, you can easily try this one out.

      Debugging Solr

      One thing I've spent most time trying to figure out is how to debug Solr when something doesn't work the way it should (which was often). For some reason I couldn't find a comprehensive guide on the topic, so here goes:
      • The first and the easiest way to debug Solr dataimport is just ticking "Verbose" and "Debug" buttons in UI:

      Does it help? Not for me, but it's worth trying out as the first bulwark against bugs. Debug output will, at least, point out of import failed or not.

      • "Logs" section in Solr Admin: 
      The most informative message I've seen in Logs section of UI. Mostly it's just "Hey, dataimport failed! That's all I know."

      The same section can be used to configure logging levels - dataimport package log level should be switched to ALL if you're having problems with it:

      • The Silver Bullet - Solr Jetty web server logs.
      This is the most universal way to find what's happening under the hood. Incidentally, that's also how I managed to resolve the most subtle problems with my configuration.

      tail -f server/logs/solr.log . You don't have to go any further than that to solve all your problems.

      Since Solr is made with Java throughout, for every fatal error that is bound to happen when things don't work out you'll see a familiar stack trace with every possible detail:


      For example, when I tried to make delta-import work - the issue turned out to be different timezones in PostgreSQL and Solr. Solr uses UTC by default, and Postgres was using UTC-4 for some reason. Therefore, even after updating the modification date the records were deemed 'obsolete' and not imported at all.

      The solution was to make Postgres use the same timezone as Solr - UTC:
      Looking at the logs after switching Postgres engine to UTC. Timestamp in Postgres now corresponds to what can be seen in Solr Jetty logs.

      Points to keep in mind, common causes of failures:
      • XML tags in Solr are case-sensitive. <dataConfig> will work, <dataconfig> won't. Same holds true for <dataSource>.
      • Did you restart Solr after the last change to configuration? Try doing it right now.

      Conclusion

      I personally found the second approach to doing delta-import to be the most satisfying technically. It is hard to pick my favorite strategy of keeping track of deleted items between trigger-enabled deletes and just setting modifieddate to year 1970 - both have some indubitable elegance in them.

      No matter what I prefer, this post's primary purpose is giving you the taste of many ways of doing delta-import, and deciding which one to use on your own.

      Comments

      1. Did you had to check the last_index_time in dataimport.properties file before composingt he deltaquery?
        Mine includes escape characters
        ex 2018-10-17 20\:13\:12
        Do i have to take that into account and do a
        WHERE TABLETIMETS > TIMESTAMP_FORMAT('${dataimporter.last_index_time}','YYYY-MM-DD HH24:MI:SS')
        Is the dataimporter.last_index_time perceived as string by the DBMS, and in turn do I have to convert into timestamp?

        ReplyDelete
        Replies
        1. Methinks it will depend on which DBMS you use.
          I had to experiment with different ways to plug last_index_time into the query, including formatting it like you mentioned. In the end, plain old variable worked, with just apostrophes around it:
          modifieddate > '${dih.last_index_time}'
          Any failed attempt surfaced in solr.log file, as you might see in one of the screenshots on this page. That's how you'll know what DBMS "sees" after Solr sets the date in query.

          Delete
        2. Thanks for your response. You answered my next question as to how do I see what the DBMS actually getting. My Full import is nicely logged with all the JDBC stuff as to which connection is being used, making a connection and i'm able to see the actual query but i do not see such detail for the delta load.

          2018-10-18 18:48:53.905 INFO (qtp817348612-15) [ x:cus] o.a.s.c.S.Request [cus] webapp=/solr path=/dataimport params={indent=on&wt=json&command=status&_=1539778725489} status=0 QTime=0
          2018-10-18 18:48:54.019 INFO (qtp817348612-20) [ x:cus] o.a.s.h.d.DataImporter Loading DIH Configuration: solr-data-config.xml
          2018-10-18 18:48:54.020 INFO (qtp817348612-20) [ x:cus] o.a.s.h.d.c.DIHConfiguration id is a required field in SolrSchema . But not found in DataConfig
          2018-10-18 18:48:54.020 INFO (qtp817348612-20) [ x:cus] o.a.s.h.d.c.DIHConfiguration The field :CUSTID present in DataConfig does not have a counterpart in Solr Schema
          2018-10-18 18:48:54.020 INFO (qtp817348612-20) [ x:cus] o.a.s.h.d.DataImporter Data Configuration loaded successfully
          2018-10-18 18:48:54.021 INFO (qtp817348612-20) [ x:cus] o.a.s.c.S.Request [cus] webapp=/solr path=/dataimport params={core=cus&indent=on&commit=true&name=dataimport&clean=false&wt=json&command=delta-import&_=1539778725489&verbose=true} status=0 QTime=2
          2018-10-18 18:48:54.021 INFO (Thread-17) [ ] o.a.s.h.d.DataImporter Starting Delta Import
          2018-10-18 18:48:54.026 INFO (Thread-17) [ ] o.a.s.h.d.SimplePropertiesWriter Read dataimport.properties
          2018-10-18 18:48:54.027 INFO (Thread-17) [ ] o.a.s.h.d.DocBuilder Starting delta collection.
          2018-10-18 18:48:54.029 INFO (Thread-17) [ ] o.a.s.h.d.DocBuilder Running ModifiedRowKey() for Entity: CUST
          2018-10-18 18:48:54.029 INFO (Thread-17) [ ] o.a.s.h.d.DocBuilder Completed ModifiedRowKey for Entity: CUST rows obtained : 0
          2018-10-18 18:48:54.030 INFO (Thread-17) [ ] o.a.s.h.d.DocBuilder Completed DeletedRowKey for Entity: CUST rows obtained : 0
          2018-10-18 18:48:54.030 INFO (Thread-17) [ ] o.a.s.h.d.DocBuilder Completed parentDeltaQuery for Entity: CUST
          2018-10-18 18:48:54.030 INFO (Thread-17) [ ] o.a.s.h.d.DocBuilder Delta Import completed successfully
          2018-10-18 18:48:54.030 INFO (Thread-17) [ ] o.a.s.h.d.DocBuilder Time taken = 0:0:0.3
          2018-10-18 18:48:54.030 INFO (Thread-17) [ ] o.a.s.u.p.LogUpdateProcessorFactory [cus] webapp=/solr path=/dataimport params={core=cus&indent=on&commit=true&name=dataimport&clean=false&wt=json&command=delta-import&_=1539778725489&verbose=true} status=0 QTime=2{} 0 11

          Also is there a difference between dih.last_index_time and dataimporter.last_index_time ?

          Delete
        3. ...meaning the query completed successfully, that's why you don't see it in the output. I seem to recall I had to intentionally add syntax errors to queries that didn't seem to elicit any response, so that they would pop up in error log messages.

          Not sure about dih vs dataimporter, if you do find out the difference please share here.

          Delete
      2. thanks for explaining delta queries in detail

        ReplyDelete
      3. When I used the above as a reference to create my data-config.xaml I get the below error when running delta import
        org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query where ID= Processing Document

        ReplyDelete

      Post a Comment