Dedupe, or not dedupe – that is the question

Hi, There:

It has been a little while again. I have been pretty busy recently. Anyway, happy a nice Summer weekend!

When you index millions of millions of data, inevitably you can face duplicated data. Duplicate data doesn’t necessarily mean that two documents are identical. But it can simply mean they are essentially the same document for your business’ purpose.

You can certainly dedupe them before indexing into Solr. But it is not always easy since you would need to maintain the state of the criteria of each document somewhere. It is harder when you have tons of documents. For this, Solr provides a handy way to help you dedupe when indexing.

I am using a simple approach to illustrate how dedupe can be configured, and analyze how it works.

Considering a document below, which contains information of a person. The combination of the name, ssn4 and dob fields is assumed as the uniqueness indicator for a given person. The id field can be different, and the city is not part of the duplication criteria because people move around the country.

{
        "name":"John Doe",
        "ssn4":9999,
        "dob":"1995-07-04T00:00:00Z",
        "id":"1",
        "city":"san francisco"
}

If we want to index each person only once in Solr (based on name/ssn4/dob), we can set up in Solr like this.

In solrconfig.xml, add/enable the updateRequestProcessorChain.

     <updateRequestProcessorChain name="dedupe">
       <processor class="solr.processor.SignatureUpdateProcessorFactory">
         <bool name="enabled">true</bool>
         <str name="signatureField">signature1_s</str>
		 <bool name="overwriteDupes">true</bool>
         <str name="fields">name,dob,ssn4</str>
         <str name="signatureClass">solr.processor.Lookup3Signature</str>
       </processor>
       <processor class="solr.LogUpdateProcessorFactory" />
       <processor class="solr.RunUpdateProcessorFactory" />
     </updateRequestProcessorChain>

This request processor SignatureUpdateProcessorFactory will calculate a signature field with combination of name/dob/ssn4 field, and put in a new field called signature1_s. Make sure this new field signature1_s is defined in your schema. Lookup3Signature is the class that defines the algorithm to generate the signature hash. You could use others such as MD5.

Now you can index some data by curling. Note the update.chain=dedupe will enable the chain processor by it’s name, dedupe. Without this, the processors won’t run. You could make the dedupe process as defaults, then you would not need it in the parameter.

curl -X POST -H "Content-type:application/json" --data-binary @people.json "http://localhost:8983/solr/people/update/json?update.chain=dedupe&commit=true"

people.json

[
{"name":"John Doe","ssn4":9999,"dob":"1995-07-04T00:00:00Z", "id":"1","city":"san francisco"}
]

Check the indexed doc in Solr and see this. Note the generated signature field signature1_s with the value 286004b0d7fd7de4.

{
        "name":"John Doe",
        "ssn4":9999,
        "dob":"1995-07-04T00:00:00Z",
        "id":"1",
        "city":"san francisco",
        "signature1_s":"286004b0d7fd7de4",
        "_version_":1703663998959878144
}

Now let’s index a slightly different people.json buy changing the value of id (if we don’t change id, Solr will overwrite the document anyway), and city (assuming John Doe moved to new york). We shall keep the name/ssn4/dob fields intact.

New people.json

[
{"name":"John Doe","ssn4":9999,"dob":"1995-07-04T00:00:00Z", "id":"2","city":"new york"}
]

Check the indexed doc in Solr. We only see the new document, which overwrote the previous one since the signature field is the same. Note the signature1_s has the same value as the previous document, but other non-signature fields have changed.

{
        "name":"John Doe",
        "ssn4":9999,
        "dob":"1995-07-04T00:00:00Z",
        "id":"2",
        "city":"new york",
        "signature1_s":"286004b0d7fd7de4",
        "_version_":1703664476960587776
}

Now let’s turn off the dedupe flag and set overwriteDupes to false. Don’t forget to reload the core for this change.

<bool name="overwriteDupes">false</bool>

Try the same experiment all over again – you will see both documents, as expected. Even though the signature is the same, Solr indexed the two documents as expected.

{
        "name":"John Doe",
        "ssn4":9999,
        "dob":"1995-07-04T00:00:00Z",
        "id":"1",
        "city":"san francisco",
        "signature1_s":"286004b0d7fd7de4",
        "_version_":1703664823793876992
},
{
        "name":"John Doe",
        "ssn4":9999,
        "dob":"1995-07-04T00:00:00Z",
        "id":"2",
        "city":"new york",
        "signature1_s":"286004b0d7fd7de4",
        "_version_":1703664839278198784
}

An interesting thought to check is this. So far, these are dedupe or not-dedupe when new documents are indexed into Solr to overwrite existing documents. How about the documents are being indexed in the same commit? Consider if the input people.json like this, both documents are in the same commit:

[
{"name":"John Doe","ssn4":9999,"dob":"1995-07-04T00:00:00Z", "id":"1","city":"san francisco"},
{"name":"John Doe","ssn4":9999,"dob":"1995-07-04T00:00:00Z", "id":"2","city":"new york"}
]

And the result is .. exactly the same! Solr doesn’t care if the dedupe is within one or multiple transactions. It behaves just as the update chain processor is configured. This is consistent and nice.

This technique provided by Solr should conveniently help you dedupe documents based on your criteria of uniqueness. Personally, I was interested in seeing if there is any way to NOT index a document if there is already a duplicate document existing in the index. There are some use cases for that. For example, avoiding re-index a duplicate document could save resource in Solr, and avoid re-merging too. But it seems the default behavior of dedupe inside RunUpdateProcessorFactory is to overwrite instead of skipping. I think we can use some custom implementation to change this behavior, i.e. skip indexing if the calculated value of signature field already exists in Solr.

For more info, see https://solr.apache.org/guide/8_4/de-duplication.html

Cheers!

~T

Leave a comment