The Most Distinctive Snowflake – Cloudera Weblog


Okay, I admit, the title is a bit click-batey, but it surely does maintain some fact! I spent the vacations up within the mountains, and for those who stay within the northern hemisphere like me, you realize that implies that I spent the vacations both celebrating or cursing the snow. Once I was a child, throughout this time of yr we’d at all times do an artwork challenge making snowflakes. We’d bust out the scissors, glue, paper, string, and glitter, and go to work. In some unspecified time in the future, the trainer would undoubtedly pull out the large weapons and blow our minds with the truth that each snowflake in the complete world for all of time is completely different and distinctive (folks simply like to oversell unimpressive snowflake options). 

Now that I’m a grown mature grownup that has every little thing discovered (pause for laughter), I’ve began to marvel in regards to the uniqueness of snowflakes. We are saying they’re all distinctive, however some have to be extra distinctive than others. Is there a way that we might quantify the distinctiveness of snowflakes and thus discover essentially the most distinctive snowflake

Absolutely with trendy ML expertise, a job like this could not solely be attainable, however dare I say, trivial? It most likely appears like a novel thought to mix snowflakes with ML, but it surely’s about time somebody does. At Cloudera, we offer our prospects with an in depth library of prebuilt knowledge science tasks (full with out of the field fashions and apps) known as Utilized ML Prototypes (AMPs) to assist them transfer the place to begin of their challenge nearer to the end line.

One in every of my favourite issues about AMPs is that they’re completely open supply, which means anybody can use any a part of them to do no matter they need. Sure, they’re full ML options which might be able to deploy with a single click on in Cloudera Machine Studying (CML), however they can be repurposed for use in different tasks. AMPs are developed by ML analysis engineers at Cloudera’s Quick Ahead Labs, and consequently they’re a fantastic supply for ML finest practices and code snippets. It’s yet one more software within the knowledge scientist’s toolbox that can be utilized to make their life simpler and assist ship tasks sooner.

Launch the AMP

On this weblog we’ll dig into how the Deep Studying for Picture Evaluation AMP may be reused to search out snowflakes which might be much less just like each other. If you’re a Cloudera buyer and have entry to CML or Cloudera Information Science Workbench (CDSW), you can begin out by deploying the Deep Studying for Picture Evaluation AMP from the “AMPs” tab. 

In case you don’t have entry to CDSW or CML, the AMP github repo has a README with directions for getting up and operating in any setting.

Information Acquisition

Upon getting the AMP up and operating, we are able to get began from there. For essentially the most half, we will reuse elements of the prevailing code. Nevertheless, as a result of we’re solely occupied with evaluating snowflakes, we have to convey our personal dataset consisting solely of snowflakes, and plenty of them.

It seems that there aren’t very many publicly out there datasets of snowflake pictures. This wasn’t an enormous shock, as taking pictures of particular person snowflakes could be a handbook intensive course of, with a comparatively minimal return. Nevertheless, I did discover one good dataset from Jap Indiana College that we are going to use on this tutorial. 

You might undergo and obtain every picture from the web site individually or use another software, however I opted to place collectively a fast pocket book to obtain and retailer the pictures within the challenge listing. You’ll want to put it within the /notebooks subdirectory and run it. The code parses out all the picture URLs from the linked net pages that comprise pictures of snowflakes and downloads the pictures. It’s going to create a brand new subdirectory known as snowflakes in /notebooks/pictures and the script will populate this new folder with the snowflake pictures.

Like several good knowledge scientist, we must always take a while to discover the info set. You’ll discover that these pictures have a constant format. They’ve little or no coloration variation and a comparatively fixed background. An ideal playground for pc imaginative and prescient fashions.

Repurposing the AMP

Now that we have now our knowledge, and it seems to be to be fairly fitted to picture evaluation, let’s take a second to restate our objective. We wish to quantify the distinctiveness of a person snowflake. In keeping with its description, Deep Studying for Picture Evaluation is an AMP that “demonstrates the best way to construct a scalable semantic search answer on a dataset of pictures.” Historically, semantic search is an NLP method used to extract the contextual which means of a search time period, as a substitute of simply matching key phrases. This AMP is exclusive in that it extends that idea to pictures as a substitute of textual content to search out pictures which might be just like each other.

The objective of this AMP is essentially centered on educating customers on how deep studying and semantic search works. Inside the AMP there’s a pocket book situated in /notebooks that’s titled Semantic Picture Search Tutorial. It gives a sensible implementation information for 2 of the principle strategies underlying the general answer – function extraction & semantic similarity search. This pocket book would be the basis for our snowflake evaluation. Go forward and open it and run the complete pocket book (as a result of it takes a short time), after which we’ll check out what it accommodates.

The pocket book is damaged down into three principal sections: 

  1. A conceptual overview of semantic picture search
  2. A proof of extracting options with CNN’s and demonstration code
  3. A proof of similarity search with Fb’s AI Similarity Search (FAISS) and demonstration code

Pocket book Part 1

The primary part accommodates background data on how the end-to-end strategy of semantic search works. There isn’t a executable code on this part so there may be nothing for us to run or change, but when time permits and the subjects are new to you, it is best to take the time to learn.

Pocket book Part 2

Part 2 is the place we are going to begin to make our modifications. Within the first cell with executable code, we have to set the variable ICONIC_PATH equal to our new snowflake folder, so change 

ICONIC_PATH = “../app/frontend/construct/belongings/semsearch/datasets/iconic200/”


ICONIC_PATH = "./pictures/snowflakes"

Now run this cell and the subsequent one. You must see a picture of a snowflake displayed the place earlier than there there was a picture of a automotive. The pocket book will now use solely our snowflake pictures to carry out semantic search.

From right here, we truly can run the remainder of the cells in part 2 and go away the code as is up till part 3, Similarity Search with FAISS. When you’ve got time although, I might extremely advocate studying the remainder of the part to realize an understanding of what’s occurring. A pre-trained neural community is loaded, function maps are saved at every layer of the neural community, and the function maps are visualized for comparability.

Pocket book Part 3

Part 3 is the place we are going to make most of our modifications. Often with semantic search, you are attempting to search out issues which might be similar to each other, however for our use case we have an interest within the reverse, we wish to discover the snowflakes on this dataset which might be the least just like the others, aka essentially the most distinctive. 

The intro to this part within the pocket book does a fantastic job of explaining how FAISS works. In abstract, FAISS is a library that permits us to retailer the function vectors in a extremely optimized database, after which question that database with different function vectors to retrieve the vector (or vectors) which might be most comparable. If you wish to dig deeper into FAISS, it is best to learn this submit from Fb’s engineering web site by .

One of many classes that the unique pocket book focuses on is how the options output from the final convolutional layer are a way more summary and generalized illustration of what options the mannequin deems vital, particularly when in comparison with the output of the primary convolutional layer. Within the spirit of KISS (hold it easy silly), we are going to apply this lesson to our evaluation and solely concentrate on the function index of the final convolutional layer, b5c3, with the intention to discover our most unusual snowflake.

The code within the first 3 executable cells must be barely altered. We nonetheless wish to extract the options of every picture then create an FAISS index for the set of options, however we are going to solely do that for the options from convolutional layer b5c3.

# Cell 1

​​def get_feature_maps(mannequin, image_holder):

    # Add dimension and preprocess to scale pixel values for VGG

    pictures = np.asarray(image_holder)

    pictures = preprocess_input(pictures)

    # Get function maps

    feature_maps = mannequin.predict(pictures)

    # Reshape to flatten function tensor into function vectors

    feature_vector = feature_maps.reshape(feature_maps.form[0], -1)

    return feature_vector


# Cell 2

all_b5c3_features = get_feature_maps(b5c3_model, iconic_imgs)


# Cell 3

import faiss

feature_dim = all_b5c3_features.form[1]

b5c3_index = faiss.IndexFlatL2(feature_dim)



Right here is the place we are going to begin deviating considerably from the supply materials. Within the authentic pocket book, the creator created a operate that permits customers to pick out a particular picture from every index, the operate returns essentially the most comparable pictures from every index and shows these pictures. We’re going to use elements of that code with the intention to obtain our new objective, discovering essentially the most distinctive snowflake, however for the needs of this tutorial you’ll be able to delete the remainder of the cells and we’ll undergo what so as to add of their place.

First off, we are going to create a operate that makes use of the index to retrieve the second most comparable function vector to the index that was chosen (as a result of essentially the most comparable could be the identical picture). There additionally occurs to be a pair duplicate pictures within the dataset, so if the second most comparable function vector can also be an actual match, we’ll use the third most comparable.


def get_most_similar(index, query_vec):

    distances, indices =, 2)

    if distances[0][1] > 0:

        return distances[0][1], indices[0][1]


        distances, indices =, 3)

        return distances[0][2], indices[0][2]


From there it’s only a matter of iterating by means of every function, trying to find essentially the most comparable picture that isn’t the very same picture, and storing the ends in a listing:


distance_list = []

for x in vary(b5c3_index.ntotal):

    dist, indic = get_most_similar(b5c3_index, all_b5c3_features[x:x+1])

    distance_list.append([x, dist, indic])

Now we are going to import pandas and convert the record to a dataframe. This provides us a dataframe for every layer, containing a row for each function vector within the authentic FAISS index, with the index of the function vector, the index of the function vector that’s most just like it, and the L2 distance between the 2 function vectors. We’re curious in regards to the snowflakes which might be most distant from their most comparable snowflake, so we must always finish this cell with sorting the dataframe in ascending order by the L2 distance.

import pandas as pd

df = pd.DataFrame(distance_list, columns = ['index', 'L2', 'similar_index'])

df = df.sort_values('L2', ascending=False)

Let’s check out the outcomes by printing out the dataframe, in addition to displaying the L2 values in a box-and-whisker plot.



Wonderful stuff. Not solely did we discover the indexes of the snowflakes which might be the least just like their most comparable snowflake, however we have now a handful of outliers made evident within the field and whisker plot, considered one of which stands alone.

To complete issues up, we must always see what these tremendous distinctive snowflakes truly seem like, so let’s show the highest 3 most unusual snowflakes in a column on the left, together with their most comparable snowflake counterparts within the column on the proper. 

fig, ax = plt.subplots(nrows=3, ncols=2, figsize=(12, 12))

i = 0

for row in df.head(3).itertuples():

    # column 1



    ax[i][0].set_title('Distinctive Rank: %s' % (i+1), fontsize=12, loc='heart')

    ax[i][0].textual content(0.5, -0.1, 'index = %s' % row.index, measurement=11, ha='heart', rework=ax[i][0].transAxes)

    # column 2



    ax[i][1].set_title('L2 Distance: %s' % (row.L2), fontsize=12, loc='heart')

    ax[i][1].textual content(0.5, -0.1, 'index = %s' % row.similar_index, measurement=11, ha='heart', rework=ax[i][1].transAxes)

    i += 1

fig.subplots_adjust(wspace=-.56, hspace=.5)


That is why ML strategies are so nice. Nobody would ever have a look at that first snowflake and suppose, that’s one tremendous distinctive snowflake, however in response to our evaluation it’s by far essentially the most dissimilar to the subsequent most comparable snowflake.


Now, there are a mess of instruments that you possibly can have used and ML methodologies that you possibly can have leveraged to discover a distinctive snowflake, together with a type of overhyped ones. The good factor about utilizing Cloudera’s Utilized ML Prototypes is that we had been capable of leverage an present, fully-built, and purposeful answer, and alter it for our personal functions, leading to a considerably sooner time to perception than had we began from scratch. That, girls and gents, is what AMPs are all about!

To your comfort, I’ve made the ultimate ensuing pocket book out there on github right here. In case you’re occupied with ending tasks sooner (higher query – who isn’t?) you also needs to take the time to take a look at what code within the different AMPs may very well be used in your present tasks. Simply choose the AMP you’re occupied with and also you’ll see a hyperlink to view the supply code on GitHub. In spite of everything, who wouldn’t be occupied with, legally, beginning a race nearer to the end line? Take a take a look at drive to attempt AMPs for your self.