In 2021, our crew gained third place within the second monitor of the iDASH workshop problem on healthcare knowledge privateness. Our answer labeled 2000 viruses in lower than 1 second with greater than 99% accuracy through the use of the IBM homomorphic encryption HElayers library.
On this weblog, we describe the iDASH competitors, our answer, and what makes it so efficient. As a motivating situation, consider a hospital that, after a lot analysis, has collected numerous virus DNA sequences which are labeled as one among 4 potential strains. The hospital desires to supply native clinics with a service that classifies the DNA sequences taken from their sufferers. Nevertheless, the hospital doesn’t wish to disclose the classification algorithm to the clinics for apparent enterprise causes.
A easy answer would encompass a consumer/server system by which the native clinics function the consumer, and the hospital is the server. In such an answer, the consumer would ship the DNA sequence to the server. The server would classify the sequence and ship the label again to the consumer. The issue is that each the consumer and the server on this relationship wish to keep away from disclosing affected person info, together with the DNA sequences of the viruses that sufferers have contracted as a result of doing so would require them to adjust to in depth and exhausting rules.
Particularly, we would like the server to have the ability to classify a virus with out realizing what its DNA sequence is. Till not too long ago, this appeared inconceivable. At the moment, this may be finished through the use of homomorphic encryption (HE) expertise. This encryption expertise is the main focus of the iDASH competitors.
What’s the iDASH competitors?
iDASH is an annual workshop on knowledge privateness in healthcare. As a part of the workshop, the organizers arrange a three-tiered safe genome evaluation competitors that’s open for competing groups from world wide. Annually the problem is totally different, and groups have three months to plan and implement an answer for the problem.
What was the iDASH problem this 12 months?
This 12 months, the second monitor of the competitors challenged contributors to categorise virus DNA sequences as one among 4 strains. Every competing crew acquired a coaching set consisting of 8,000 DNA sequences with their pressure labels (2,000 from every pressure). The DNA sequences had been supplied as a FASTA file (roughly 30,000 DNA letters for every virus). Every crew educated their very own mannequin utilizing the coaching set.
The options had been evaluated utilizing a brand new set of two,000 DNA sequences to be labeled, which had been unknown forward of the competitors’s deadline. The DNA sequences had been encoded and encrypted utilizing the consumer’s key, with every crew offering software program that simulated the consumer. The DNA sequences had been then labeled (whereas nonetheless encrypted) by one other software program that every crew supplied to simulate the server. The output of this classification was additionally encrypted with the consumer’s keys, and, due to this fact, couldn’t be learn by the server. Ultimately, the encrypted classification was decrypted by the software program simulating the consumer to disclose the output.
The foundations of the competitors included the next objects.
Privateness-related
- The server should not be taught something in regards to the DNA sequences that it classifies.
- The consumer should not be taught something besides the label that’s labeled by the server (for instance, it can’t be given the mannequin). Particularly, this additionally implies that the consumer should not carry out any sort of characteristic choice as a result of the chosen options reveal one thing in regards to the server’s mannequin.
Efficiency-related
- The consumer can use just one CPU with 1 GB RAM.
- The server can use 4 CPUs with 4 GB RAM.
What’s Homomorphic encryption?
Homomorphic encryption (HE) is a brand new encryption expertise that permits encrypted numbers to be added collectively or multiplied. With these two primitive operations, any algorithm will be computed on encrypted numbers, together with the classification of an encrypted DNA sequence.
HE is understood for being notoriously gradual and reminiscence consuming. That is primarily as a result of it solely helps addition and multiplication operations, which can be utilized to guage any polynomial over encrypted ciphertexts. This computation mannequin is totally different from conventional algorithms that may take a look at the values of variables and make choices primarily based on their values. Theoretically, each algorithm will be expressed as a polynomial. For instance, places the place the algorithm takes a department in its move primarily based on a situation C is changed by computing each branches and multiplying the output of 1 department by C and the output of the opposite by (1-C), after which calculating their sum. Right here, we assume that C is both 0 or 1, by which case one department is multiplied by 1 (that’s, taken as is), and the opposite department is multiplied by 0 (that’s, is nullified). The worth of C determines which department can be taken. This concept extends to instances the place C takes on different values as properly. However, the scale of the polynomial may develop exponentially with the variety of branches. Thus, crucial factor is to revamp the algorithm in order that it has as few branches as potential. An analogous effectivity downside arises when we have to evaluate values.
To enhance the working time, HE schemes use the underlying arithmetic to achieve a technical benefit that permits packing many values (often a number of thousand) into one ciphertext. A ciphertext then holds not only one message, however an array of messages, the place additions or multiplications are executed in a SIMD (Single Instruction A number of Information) method. This considerably boosts the working time. Nevertheless, packing introduces its personal problem (extra on this later). That mentioned, this enchancment doesn’t tackle the inherent downside of not with the ability to department, as mentioned within the earlier paragraph.
What does our mannequin do?
In a nutshell, in our method we encoded DNA sequences as k-mer units. We used the coaching set to discover a consultant for every pressure. For the classification, we computed a similarity rating with every consultant set, after which selected the one with the best rating.
We used the notion of k-mers, that are sequences of ok letters that seem consecutively in a DNA sequence. For instance, for ok=7 there are 47 = 8192 potential 7-mers as a result of every DNA letter has one among 4 choices. Given a DNA sequence, we encoded it because the set of all k-mers showing in its DNA sequence. A consultant set for a pressure is a set of k-mers which have excessive correlation with the coaching DNA sequences belonging to the pressure. Computing these representatives was finished offline as a part of the coaching course of.
To categorise a DNA sequence, the consumer computes its k-mer set, encrypts this set, and sends it to the server. The server computes similarity scores that signify how a lot the given DNA sequence is just like every of the pressure representatives. We used an ordinary similarity rating that measures similarity between units: 1 that means that it’s the similar set, and 0 that means that they don’t have anything in widespread. As acknowledged beforehand, as a result of the DNA sequence that must be labeled is encrypted, the ensuing similarity scores are additionally encrypted. As a technical final step, we normalized the similarity scores by dividing every of them by their common.
The encrypted normalized similarity scores had been then returned to the consumer, which decrypted them as outlined within the objectives of the competitors.
Challenges and options
Throughout our work on the iDASH competitors, we encountered a number of difficult conditions that we addressed utilizing quite a lot of approaches.
Making our mannequin HE-friendly
As talked about beforehand, lowering the variety of branches that an algorithm makes in HE is essential. In our case, the naïve algorithm branches when it computes the intersection of two k-mer units (as a part of the computation of the similarity rating). To keep away from all of those branches, we encoded (earlier than encryption) every k-mer set as a 4k-long binary vector. We created a world public map that assigned an index i to every 6-mer. The i-th coordinate within the vector was set to 1 or 0 relying on whether or not the i-th k-mer seems within the set. This encoding is HE-friendly.
Nontrivial encoding
To reap the benefits of the SIMD characteristic of HE, we used a nontrivial encoding by which a single ciphertext contained 4 k-mers of every of the 2000 DNA sequences to be labeled (the standard encoding would pack all k-mers of a DNA sequence in a single ciphertext). This nontrivial encoding helped us scale back the working time and the reminiscence requirement. The encoding was finished utilizing the IBM HElayers library, which simply helps such (and comparable nontrivial) encodings over HE schemes.
RAM
The competitors’s reminiscence limitation was so tight that it didn’t enable us to concurrently preserve all the encrypted DNA sequences in reminiscence. To satisfy these limitations, we selected a “pipeline” structure by which a ciphertext of the enter was learn, processed, and discarded earlier than studying the subsequent ciphertext of the enter.
No division
As talked about beforehand, HE schemes help solely additions and multiplications. They don’t help divisions (or computing the inverse 1/x). To compute and normalize the similarity scores, our algorithm wanted to compute two inverses. We used an ordinary low-degree polynomial approximation to the inverse operate.
What had been our outcomes?
The outcomes of our system had been:
- Consumer encryption: 10 seconds utilizing 280 MB of RAM and 1 CPU
- Classification on the server: 0.6 seconds utilizing 400 MB of RAM and 1 CPU
- Decryption on the consumer: 1 second
Be aware that though the competitors allowed the server to make use of 4 CPUs, our answer was so quick on 1 CPU that we didn’t have to parallelize it.
Conclusion
The world of HE is evolving and enhancing. Classification duties that took days to finish 10 years in the past and took hours then minutes to compute a number of years in the past, now take a second. It is a results of developments within the encryption schemes, within the algorithms, and in {hardware}. Our outcomes present that privateness preserving classification will be finished in a consumer/server mannequin in actual time.
We imagine HE will change into a typical methodology for offering privacy-preserving options. Our HElayers library makes HE accessible and simple to make use of by noncryptographers.
You possibly can obtain HElayers. Our code can be revealed quickly as a part of the demos that include HElayers.