Using AI and drones to combat elephant and rhino poaching


A new collaboration project between Neurala and the Lindbergh Foundation’s AirShepherd program has been announced and it will see the use of AI and drones to combat the poaching of elephants and rhinos in Afrika. The project uses “intelligent” drones which are able to differentiate and identify any object of interest by using Neurala’s technology which can learn from any sensory stream in real time. “This is a terrific example of how AI technology can be a vital force for good,” said Max Versace, CEO at Neurala. The Lindbergh Foundation’s Air Shepherd aimsto combat poaching of elephants and rhinos in southern…

This story continues at The Next Web

How Europe’s biggest economy is uniting its tech hubs to dethrone Silicon Valley


Silicon Valley is by far the most successful tech hub on the planet. However, European and Asian cities have started gaining on the startup mecca — so much so, that you could even say Europe has become better than SV. But as we’ve pointed out, Europe isn’t one large monolithic block, each country and city has it’s strengths and weaknesses when it comes to startups. If we truly want to understand European startup ecosystems better, we’ll need to take a more in-depth look at each country. That’s why TNW contacted Brigitte Zypries, Germany’s Minister for Economic Affairs and Energy, to get more…

This story continues at The Next Web

How to identify the gaps in your own leadership style


A leader’s strengths and weaknesses directly inform how they guide others to thrive at work. The good always comes with the bad, as every leader has flaws that correlate with the skills that have led them to succeed in their career. According to Lolly Daskal’s new book The Leadership Gap: What Gets Between You and Your Greatness, these shortcomings are referred to as leadership gaps as they can either prevent you from reaching your full potential or be leveraged to help you rise to the occasion as a boss, supervisor, executive or any other type of leader. As the CEO…

This story continues at The Next Web

How to Become a Power Elite Author on Envato: Insider Tips & Secrets

This is a guest article contributed by Henry Rise, a co-founder of ThemeRex.

The contemporary web has gone very competitive. The number of providers of ready-made WordPress themes is growing at a tremendous speed.

So, surviving in a WordPress business even for such pros as Power Elite Authors at ThemeForest is not that easy. So, how to survive your business and make it succeed? Let’s take a case study provided by Henry Rise, the owner of ThemeRex Power Elite Author account.

The project was launched on ThemeForest 5 years ago. Building a brand presence on a marketplace doesn’t include any extra spending for brand promotion and advertising. That is what attracted ThemeRex team and gave them a push for building their own brand. Since 2013, they have released more than 140 WordPress themes for different niches.

Instead of self-promotion, they had more time to focus on building pro-quality products and enhance the services of the supporting team. Known as a reputable and reliable themes provider, the ThemeRex owner shares an extensive guide with the rules on how to make your business succeed.

Step 1. Decide on the Themes Exclusivity

Envato charges theme developers for every product that is being sold on the marketplace. All themes are divided into two large groups – exclusive and non-exclusive. Based on this and your activity on the marketplace, the fee can vary from 55% to 12.5%. It’s clear that beginner authors might have some difficulties entering the marketplace. It takes time to establish a high-level authority in the field. As the popularity of your themes starts growing, and the number of sales boosts, your earnings from every acquired template will also increase.

It may be somewhat confusing for beginner theme developers to decide whether to be an exclusive or non-exclusive vendor. Envato gives a higher priority to exclusive offers. When deciding to sell items exclusively, you cannot distribute them anywhere else except for the marketplace, thus you get more money for each sale.

ThemeRex tried and tested a number of marketplaces and claim that Envato has a higher level of importance than all other marketplaces combined. Wrapping it up, if you are a beginner web developer and want to create a wider exposure to your own brand, then exclusive themes would be the best choice.

Step 2. Get Your Theme Approved

Getting the theme approved on Envato is one of the most difficult steps for beginner web developers. Envato is known to have a strict theme approval process. Special attention is being paid to the quality of code and the design of the template. It goes without saying that the design should be unique, feature clear hierarchy, proper semantics, etc. In the beginning of the working experience with Envato, web developers can face dozens of theme rejects before the first project is approved. That is why a pro tip for beginners will be starting off with something clear and simple. In such a way, you will get a better understanding of Envato’s requirements as per the guidelines to the code and design of the themes.

One more thing that may be difficult to understand for beginner webmasters is the support system that is not that prompt at times. That is quite clear since Envato includes plenty of authors. So, if you are a new author and search for an answer to some questions be patient and prepare to wait at least for a week.

As you start generating more sales, you climb the Envato Elite ladder, which also brings quicker response times by the customers support reps. The response time for Elite members takes up to 2-3 days, which can seem to be frustrating in the beginning. Still, this is their standard communication mode.

Step 3. Understand pros and cons of Envato

Just like anything else in this world, cooperating with one of the world’s leading marketplaces for selling website templates also has positive and negative sides. One of the biggest advantages of selling themes on Envato is the possibility of getting a bigger exposure and a higher traffic flow, which a beginner webmaster will hardly ever manage to achieve on a different platform. In order to attract the public attention to your products, you do not need to run any kind of promotional campaigns to advertise your offers. Just create a good product, upload it and start selling.

One more impressive advantage is that joining Envato you also join a huge community of web designers and developers like yourself. You can learn from them, share experiences, get inspired with some creative ideas that you never saw before, etc. To be honest, Envato has turned into a large copying machine, where people come thirsty for new ideas, they find ones and enhance those to a greater degree. This has a reverse effect as well. The market got over-flooded with the authors searching for the ways to deliver their products to a wider audience, so they slash the cost of their themes. The good is that more quality website templates are available at a cheaper cost. However, this is hardly a benefit for other authors.

Selling themes on Envato, you attain more than just a steady flow of income, you get an opportunity to expose your products to the broadest relevant audience, as well as meet real people, help them change something in their lives, and see the results of your work, which cannot be compared to anything else in the world.

If you are a new author on the marketplace and you are planning to launch your first theme ever, Envato will provide you with the necessary help to promote your themes and make them more reachable by the target audiences.

However, joining the marketplace also has another side of the coin. As you become an author on Envato, you become part of their community. So, it may become difficult to build your own brand name. On Envato, you are one of the thousands of authors uploading new themes over and over again. So, people will hardly differentiate your name among the others. In a word, if your goal is to build a brand around your name, then the marketplace is not your option.

Envato lets you set the cost of your themes on your own. This is both good and bad for the authors. The good is that you are the only one to set the value of your products. The bad is that some authors can opt for the strategy of selling their themes at the lowest cost, which potentially decreases your chances of being reached by as many users as you wish. This can also decrease your sales since people get more accustomed to buying cheaper products.

Additionally, Envato sets certain standards for its authors. While this may seem quite fair for the online customers, it has an opposite reflection of the WordPress community. Although the “cold war” has never been that tough, the theme providers have split down the middle of the issue. Thus, today you can find authors who sell their themes on Envato, as well as those who use the products from WordPress.org as freemium models. This doesn’t mean that you will be hated for selling your themes on Envato. Still, there are still lots of hardcore WordPress fans claiming that WordPress themes from Envato are of moderate quality and only damage reputation of the fellow authors.

Step 4. Promote your themes

As we have already noted in the beginning of this guide, Envato gives you a certain degree of promotion to your new themes. Some time ago, the marketplace was giving a 1-week boost to new themes. Now, the number of authors has grown manifolds, and the situation is somewhat different. So, if you want your themes to attain a better exposure, then it’s better to think about some kind of additional promotion of your products.

There are thousands of theme vendors that people can choose from. So, being just one of them won’t work, no matter how smart and professional are your designs. The thing that you need to achieve is establishing a closer connection with your customers while building a brand. Even if you decide to sell exclusive web designs on the marketplace, you still have the freedom to build your own site, run your official blog, and become an active social media user. You can also run a newsletter campaign so that your audience remains always aware of your latest news and fresh releases. If you are not limited on the budget, then think about promoting your themes via AdWords and banner ads.

Getting back to ThemeRex case study, one of the first things that they achieved was the launch of their own website. Next, they participated in the community life. Their self-promotion efforts made them one of the Power Elite Authors on Envato, with more than 35,000 of sales.

Pro Tips from ThemeRex for Envato

The ThemeRex brand was taking their first steps when there were already plenty of well-established players in the marketplace. So, it was a really tough task to make a small project stand out as the industry got over-saturated with theme developers. So, the chances for beginner webmasters to build strong and reputable brands around their names are really low. Still, there is still a lot of place for small vendors to join the market and win the hearts of their client base. Everything depends on the strategy that you choose.

As a vendor, you can put all your time and effort into developing niche-specific or multipurpose solutions; you can produce 1 theme per month or release dozens of different products on a regular basis. The approach that will work for you is very individual. So, unless you try several strategies you won’t find the one that works the best. What ThemeRex did was focusing on the development of micro-niche themes. The approach still works pretty well today.

For beginner vendors, it’s probably the most effective advice to start working on your own online strategy from the in-depth analysis of every micro niche. You can enhance the existing solutions or develop new features for the already existing ones. For example, you can build a theme on any of the most popular topics (like sports or business) and add a unique feature to it (like a universal calorie intake calculator). You can create a product for a specific micro-niche that will be ready to go live straight out-of-the-box. In order to give the web audience exactly what they need, run an in-depth marketing research first. In such a way, you will know for sure that you have chosen the right approach.

Moreover, think about the presentation of your product, its marketing promotion strategy, features review, screenshots, etc. Unless you present your product in a way that ensures the web audience that it will bring them an incomparable value, the effort that you put into the process of its development will be simply useless.

Final Thoughts

Although it’s not that easy to build a successful design shop on Envato today, it is still possible to achieve if you follow the instructions described in this guide. Creating just a quality product in not enough any longer. To make your reach the success, you need to think about a strategy to promote it, develop a plan and possibly find a micro-niche that would make your products stand out from the competition. Pay special attention to details. Think about the development of a unique feature that will make your theme truly universal and ready to go live straight out of the box. Also, do not neglect to build your own website and running an official blog. Initiating an active social media campaign will be a great benefit.

Whatever approach you opt for, remember that establishing a relationship with your customers is a way to success. So, we hope that you find this guide useful and it gave you a hint on how to make your brand popular on Envato. If you have ideas on how else you can make your name more recognizable on the marketplace, you are welcome to share your thoughts in the comments block.

More Web Design Resources

Tackling Tag Sprawl: Crawl Budget, Duplicate Content, and User-Generated Content

Posted by rjonesx.

Alright, so here's the situation. You have a million-product website. Your competitors have a lot of the same products. You need unique content. What do you do? The same thing everyone does — you turn to user-generated content. Problem solved, right?

User-generated content (UGC) can be an incredibly valuable source of content and organization, helping you build natural language descriptions and human-driven organization of site content. One common feature used by sites to take advantage of user-created content are tags, found everywhere from e-commerce sites to blogs. Webmasters can leverage tags to power site search, create taxonomies and categories of products for browsing, and to provide rich descriptions of site content.

This is a logical and practical approach, but can cause intractable SEO problems if left unchecked. For mega-sites, manually moderating millions of user-submitted tags can be cumbersome (if not wholly impossible). Leaving tags unchecked, though, can create massive problems with thin content, duplicate content, and general content sprawl. In our case study below, three technical SEOs from different companies joined forces to solve a massive tag sprawl problem. The project was led by Jacob Bohall, VP of Marketing at Hive Digital, while computational statistics services were provided by J.R. Oakes of Adapt Partners and Russ Jones of Moz. Let's dive in.

What is tag sprawl?

We define tag sprawl as the unchecked growth of unique, user-contributed tags resulting in a large amount of near-duplicate pages and unnecessary crawl space. Tag sprawl generates URLs likely to be classified as doorway pages, pages appearing to exist only for the purpose of building an index across an exhaustive array of keywords. You’ve probably seen this in its most basic form in the tagging of posts across blogs, which is why most SEOs recommend a blanket “noindex, follow” across tag pages in Wordpress sites. This simple approach can be an effective solution for small blog sites, but is not often the solution for major e-commerce sites that rely more heavily on tags for categorizing products.

The three following tag clouds represent a list of user-generated terms associated with different stock photos. Note: User behavior is generally to place as many tags as possible in an attempt to ensure maximum exposure for their products.

  1. USS Yorktown, Yorktown, cv, cvs-10, bonhomme richard, revolutionary war-ships, war-ships, naval ship, military ship, attack carriers, patriots point, landmarks, historic boats, essex class aircraft carrier, water, ocean
  2. ship, ships, Yorktown, war boats, Patriot pointe, old war ship, historic landmarks, aircraft carrier, war ship, naval ship, navy ship, see, ocean
  3. Yorktown ship, Warships and aircraft carriers, historic military vessels, the USS Yorktown aircraft carrier

As you can see, each user has generated valuable information for the photos, which we would want to use as a basis for creating indexable taxonomies for related stock images. However, at any type of scale, we have immediate threats of:

  • Thin content: Only a handful of products share the user-generated tag when a user creates a more specific/defining tag, e.g. "cvs-10"
  • Duplicate and similar content: Many of these tags will overlap, e.g. "USS Yorktown" vs. "Yorktown," "ship" vs. "ships," "cv" vs. "cvs-10," etc.
  • Bad content: Created by improper formatting, misspellings, verbose tags, hyphenation, and similar mistakes made by users.

Now that you understand what tag sprawl is and how it negatively effects your site, how can we address this issue at scale?

The proposed solution

In correcting tag sprawl, we have some basic (at the surface) problems to solve. We need to effectively review each tag in our database and place them in groups so further action can be taken. First, we determine the quality of a tag (how likely is someone to search for this tag, is it spelled correctly, is it commercial, is it used for many products) and second, we determine if there is another tag very similar to it that has a higher quality.

  1. Identify good tags: We defined a good tag as term capable of contributing meaning, and easily justifiable as an indexed page in search results. This also entailed identifying a "master" tag to represent groups of similar terms.
  2. Identify bad tags: We wanted to isolate tags that should not appear in our database due to misspellings, duplicates, poor format, high ambiguity, or likely to cause a low-quality page.
  3. Relate bad tags to good tags: We assumed many of our initial "bad tags" could be a range of duplicates, i.e. plural/singular, technical/slang, hyphenated/non-hyphenated, conjugations, and other stems. There could also be two phrases which refer to the same thing, like "Yorktown ship" vs. "USS Yorktown." We need to identify these relationships for every "bad" tag.

For the project inspiring this post, our sample tag database comprised over 2,000,000 "unique" tags, making this a nearly impossible feat to accomplish manually. While theoretically we could have leveraged Mechanical Turk or similar platform to get "manual" review, early tests of this method proved to be unsuccessful. We would need a programmatic method (several methods, in fact) that we could later reproduce when adding new tags.

The methods

Keeping the goal in mind of identifying good tags, labeling bad tags, and relating bad tags to good tags, we employed more than a dozen methods, including: spell correction, bid value, tag search volume, unique visitors, tag count, Porter stemming, lemmatization, Jaccard index, Jaro-Winkler distance, Keyword Planner grouping, Wikipedia disambiguation, and K-Means clustering with word vectors. Each method either helped us determine whether the tag was valuable and, if not, helped us identify an alternate tag that was valuable.

Spell correction

  • Method: One of the obvious issues with user-generated content is the occurrence of misspellings. We would regularly find misspellings where semicolons are transposed for the letter “L” or words have unintended characters at the beginning or end. Luckily, Linux has an excellent built-in spell checker called Aspell which we were able to use to fix a large volume of issues.
  • Benefits: This offered a quick, early win in that it was fairly easy to identify bad tags when they were composed of words that weren’t included in the dictionary or included characters that were simply inexplicable (like a semicolon in the middle of a word). Moreover, if the corrected word or phrase occurred in the tag list, we could trust the corrected phrase as a potentially good tag, and relate the misspelled term to the good tag. Thus, this method help us both filter bad tags (misspelled terms) and find good tags (the spell-corrected term)
  • Limitations: The biggest limitation with this methodology was that combinations of correctly spelled words or phrases aren’t necessarily useful for users or the search engine. For example, many of the tags in the database were concatenations of multiple tags where the user space-delimited rather than comma-delimited their submitted tags. Thus, a tag might consist of correctly spelled terms but still be useless in terms of search value. Moreover, there were substantial dictionary limitations, especially with domain names, brand names, and Internet slang. In order to accommodate this, we added a personal dictionary that included a list of the top 10,000 domains according to Quantcast, several thousand brands, and a slang dictionary. While this was helpful, there were still several false recommendations that needed to be handled. For example, we saw "purfect" correct to "perfect," despite being a pop-culture reference for cat images. We also noticed some users reference this saying as "purrfect," "purrrfect," "purrrrfect," "purrfeck," etc. Ultimately, we had to rely on other metrics to determine whether we trusted the misspelling recommendations.

Bid value

  • Method: While a tag might be good in the sense that it is descriptive, we wanted tags that were commercially relevant. Using the estimated cost-per-click of the tag or tag phrase proved useful in making sure that the term could attract buyers, not just visitors.
  • Benefits: One of the great features of this methodology is that it tends to have a high signal-to-noise ratio. Most tags that have high CPCs tend to be commercially relevant and searched frequently enough to warrant inclusion as "good tags." In many cases we could feel confident that a tag was good just on this metric alone.
  • Limitations: However, the bid value metric comes with some pretty big limitations, too. For starters, Google Keyword Planner’s disambiguation problem is readily apparent. Google combines related keywords together when reporting search volume and CPC data, which means a tag like “facbook” would return the same data as “facebook.” Obviously, we would prefer to map “facbook” to “facebook” rather than keep both tags, so in some cases the CPC metric wasn’t sufficient to identify good tags. A further limitation of the bid value was the difficulty of acquiring CPC data. Google now requires running active Adwords campaigns to get access to CPC value. It is no simple feat to look up 5,000,000 keywords in Google Keyword Planner, even if you have a sufficient account. Luckily, we felt comfortable that historical data would be trustworthy enough, so we didn't need to acquire fresh data.

Tag search volume

  • Method: Similar to CPC, we could use search volume to determine the potential value of a tag. We had to be careful not to rely on the tag itself, though, since the tag could be so generic that it earns traffic unrelated to the product itself. For example, the tag “USS Yorktown” might get a few hundred searches a month, but “USS Yorktown T-shirt” gets 0. For all of the tags in our index, we tracked down the search volume for the tag plus the product name, in order to make sure we had good estimates of potential product traffic.
  • Benefits: Like CPC, this metric did a very good job of consolidating our tag data set to just keywords that were likely to deliver traffic. In the vast majority of cases, if “tag + product” had search volume, we could feel confident that it is a good term.
  • Limitations: Unfortunately, this method fell victim to the same disambiguation problem that CPC presents. Because Google groups terms together, it is possible that on some occasions two tags will be given the same metrics. For example: “pontoons boat,” “pontoonboat,” “pontoon boats,” “pontoon boat,” “pontoon boating,” and “pontoons boats” were in the same traffic volume group which also included tags like “yacht” and “yachts.” Moreover, there is no accounting for keyword difficulty in this metric. Some tags, when combined with product types, produce keywords that receive substantial traffic but will always be out of reach for a templated tag page.

Unique visitors

  • Method: This method was a no-brainer: protect the tags that already receive traffic from Google. We exported all of the tags from Google Analytics that had received search traffic from Google in the last 12 months. Generally speaking, this should be a fairly safe list of terms.
  • Benefits: When doing experimental work with a client, it is always nice to be able to give them a scenario that almost guarantees improvement. Because we were able to protect tags that already receive traffic by labeling them as good (in the vast majority of cases), we could ensure that the client had a high probability of profiting from the changes we made and minimal risk of any traffic loss.
  • Limitations: Unfortunately, even this method wasn’t perfect. If a product (or set of products) with high enough authority included a poor variation of a tag, then the bad variant would rank and receive traffic. We had to use other strategies to verify our selections from this method and devise a method to encourage a tag swap in the index for the correct version of a term.

Tag count

  • Description: The frequency with which a tag was used on the site was often a strong signal that we could trust the tag, especially when compared with other similar tags. By counting the number of times each tag was used on the site, we could bias our final set of trusted tags in favor of these more popular terms.
  • Benefits: This was a great tie-breaker metric when we had two tags that were very similar but needed to choose just one. For example, sometimes two variants of a phrase were completely acceptable (such as a version with and without a hyphen). We could simply defer to the one with a higher tag count.
  • Limitations: The clear limitation of tag frequency is that many of the most frequent tags were too generic to be useful. The tag “blue” isn’t particularly useful when it just helps people find “blue t-shirts.” The term is too generic and too competitive to warrant inclusion. Additionally, the inclusion of too broad of a tag would simply create a very large crawl vs. traffic-potential ratio. A common tag will have hundreds if not thousands of matching products, creating many pages of products for the single tag. If a tag produces 50 paginated product listings, but only has the potential to drive 10 visitors a year, it might not be worth it.

Porter stemming

  • Method: Stemming is a method used to identify the root word from a tag by scanning the word right to left and using various pattern matching rules to remove characters (suffixes) until you arrive at the word’s stem. There are a couple of popular stemmers available, but we found Porter stemming to be more accurate as a tool for seeing alternative word forms. You can geek out by looking at the Porter stemming algorithm in Snowball here, or you can play with a JS version here.
  • Benefits: Plural and possessive terms can be grouped by their stem for further analysis. Running Porter stemming on the terms “pony” and “ponies” will return “poni” as the stem, which can then be used to group terms for further analysis. You can also run Porter stemming on phrases. For example, “boating accident,” “boat accidents,” “boating accidents,” etc. share the stem “boat accid.” This can be a crude and quick method for grouping variations. Porter stemming also is able to clean text more kindly, where others stemmers can be too aggressive for our efforts; e.g., Lancaster stemmer reduces "woman" to "wom," while Porter stemmer leaves it as "woman."
  • Limitations: Stemming is intended for finding a common root for terms and phrases, and does not create any type of indication as to the proper form of a term. The Porter stemming method applies a fixed set of rules to the English language by blanket removing trailing “s,” “e,” “ance,” “ing,” and similar word endings to try and find the stem. For this to work well, you have to have all of the correct rules (and exceptions) in place to get the correct stems in all cases. This can be particularly problematic with words that end in S but are not plural, like "billiards" or "Brussels." Additionally, this method does not help with mapping related terms such as “boat crash,” “crashed boat,” “boat accident,” etc. which would stem to “boat crash,” “crash boat,” and “boat acci."

Lemmatization

  • Method: Lemmatization works similarly to stemming. However, instead of using a rule set for editing words by removing letters to arrive at a stem, lemmatization attempts to map the term to its most simple dictionary form, such as WordNet, and return a canonical “lemma” of the word. A crude way to think about lemmatization is just simplifying a word. Here’s an API to check out.
  • Benefits: This method often works better than stemming. Terms like “ship,” “shipped,” and “ships” are all mapped to “ship” by this method, while “shipping” or “shipper,” which are terms that have distinct meaning despite the same stem, are retained. You can create an array of “lemma” from phrases which can be compared to other phrases resolving word order issues. This proved to be a more reliable method for grouping variations than stemming.
  • Limitations: As with many of the methods, context for mapping related terms can be difficult. Lemmatization can provide better filters for context, but to do so generally relies on identifying the word form (noun, adjective, etc) to appropriately map to a root term. Given the inconsistency of the user-generated content, it is inaccurate to assume all words are in adjective form (describing a product), or noun form (the product itself). This inconsistency can present wild results. For example, “strip socks” could be intended as as a tag for socks with a strip of color on them, such as as “striped socks,” or it could be “stripper socks” or some other leggings that would be a match only found if there other products and tags to compare for context. Additionally, it doesn’t create associations between all related words, just textual derivatives, so you are still seeking out a canonical between mailman, courier, shipper, etc.

Jaccard index

  • Method: The Jaccard index is a similarity coefficient measured by Intersection over Union. Now, don’t run off just yet, it is actually quite straightforward.

    Imagine you had two piles of with 3 marbles in each: Red, Green, and Blue in the first, Red, Green and Yellow in the second. The “Intersection” of these two piles would be Red and Green, since both piles have those two colors. The “Union” would be Red, Green, Blue and Yellow, since that is the complete list of all the colors. The Jaccard index would be 2 (Red and Green) divided by 4 (Red, Green, Blue, and Yellow). Thus, the Jaccard index of these two piles would be .5. The higher the Jaccard index, the more similar the two sets.
    So what does this have to do with tags? Well, imagine we have two tags: “ocean” and “sea.” We can get a list of all of the products that have the tag “ocean” and “sea.” Finally, we get the Jaccard index of those two sets. The higher the score, the more related they are. Perhaps we find that 70% of the products with the tag “ocean” also have the tag “sea”; we now know that the two are fairly well-related. However, when we run the same measurement to compare “basement” or “casement,” we find that they only have a Jaccard index of .02. Even though they are very similar in terms of characters, they mean quite different things. We can rule out mapping the two terms together.
  • Benefits: The greatest benefit of using the Jaccard index is that it allows us to find highly related tags which may have absolutely no textual characteristics in common, and are more likely to have an overly similar or duplicate results set. While most of the the metrics we have considered so far help us find “good” or “bad” tags, the Jaccard index helps us find “related” tags without having to do any complex machine learning.
  • Limitations: While certainly useful, the Jaccard index methodology has its own problems. The biggest issue we ran into had to do with tags that were used together nearly all the time but weren’t substitutes of one another. For example, consider the tags “babe ruth” and his nickname, “sultan of swat.” The latter tag only occurred on products which also had the “babe ruth” tag (since this was one of his nicknames), so they had quite a high Jaccard index. However, Google doesn't map these two terms together in search, so we would prefer to keep the nickname and not simply redirect it to "babe ruth." We needed to dig deeper if we were to determine when we should keep both tags or when we should redirect one to another. As a standalone, this method also was not sufficient at identifying cases where a user consistently misspelled tags or used incorrect syntax, as their products would essentially be orphans without “union.”

Jaro-Winkler distance

  • Method: There are several edit distance and string similarity metrics that we used throughout this process. Edit Distance is simply some measurement of how difficult it is to change one word to another. For example, the most basic edit distance metric, Levenshtein distance, between "Russ Jones" and "Russell Jones" is 3 (you have to add "E","L", and "L" to transform Russ to Russell). This can be used to help us find similar words and phrases. In our case, we used a particular edit distance measure called "Jaro-Winkler distance" which gives higher precedence to words and phrases that are similar at the beginning. For example, "Baseball" would be closer to "Baseballer" than to “Basketball” because the differences are at the very end of the term.
  • Benefits: Edit distance metrics helped us find many very similar variants of tags, especially when the variants were not necessarily misspellings. This was particularly valuable when used in conjunction with the Jaccard index metrics, because we could apply a character-level metric on top of a character-agnostic metric (i.e. one that cares about the letters in the tag and one that doesn’t).
  • Limitations: Edit distance metrics can be kind of stupid. According to Jaro-Winkler distance, "Baseball” and "Basketball" are far more related to one another than "Baseball" and "Pitcher" or "Catcher." "Round" and "Circle" have a horrible edit distance metric, while "Round" and "Pound" look very similar. Edit distance simply cannot be used in isolation to find similar tags.

Keyword Planner grouping

  • Method: While Google's choice to combine similar keywords in Keyword Planner has been problematic for predicting traffic, it has actually offered us a new method to identify highly related terms. Whenever two tags share identical metrics from Google Keyword Planner (average monthly traffic, historical traffic, CPC, and competition), we can conclude that there is an increased chance the two are related to one another.
  • Benefits: This method is extremely useful for acronyms (which are particularly difficult to detect). While Google groups together COO and Chief Operating Officer, you can imagine that standard methods like those outlined above might have problems detecting the relationship.
  • Limitations: The greatest drawback for this methodology was that it created numerous false positives among less popular terms. There are just too many keywords which have an annual search volume average of 10, are searched 10 times monthly, and have a CPC and competition of 0. Thus, we had to limit the use of this methodology to more popular terms where there were only a handful of matches.

Wikipedia disambiguation

  • Method: Many of the methods above are great for grouping similar/related terms, but do not provide a high-confidence method for determining the "master" term or phrase to represent a grouping of related/duplicate terms. While considerations can be made for testing all tags against an English language model, the lack of pop culture references and phrases makes it unreliable. To do this effectively, we found Wikipedia to be a trusted source for identifying the proper spelling, tense, formatting, and word order for any given tag. For example, if users tagged a product as "Lord of the Rings," "LOTR," and "The Lord of the Rings," it can be difficult to determine which tag should be preferred (certainly we don't need all 3). If you search Wikipedia for these terms, you will see that they redirect you to the page titled "The Lord of the Rings." In many cases, we can trust their canonical variant as the "good tag." Please note that we don’t encourage scraping any website or violating their terms of use. Wikipedia does offer an export of their entire database that can be used for research purposes.
  • Benefits: When a tag could be mapped to a Wikipedia entry, this method proved to be a highly effective at providing validation that a tag had potential value, or creating a point of reference for related tags. If the Wikipedia community felt a tag or tag phrase was important enough to have an article dedicated to it, then the tag was more likely to be a valuable term vs. random entry or keyword stuffing by the user. Further, the methodology allows for grouping related terms without any bias on word order. Doing a search on Wikipedia creates a search results page (“pontoon boats”), or redirects you to a correction of the article (“disneyworld” becomes “Walt Disney World”). Wikipedia also tends to have entries for some pop culture references, so things that would get flagged as a misspelling, such as "lolcats," can be vindicated by the existence of a matching Wikipedia article.
  • Limitations: While Wikipedia is effective at delivering a consistent formal tag for disambiguation, it can at times be more sterile than user-friendly. This can run counter to other signals such as CPC or traffic volume methods. For example, "pontoon boats" becomes "Pontoon (Boat)", or “Lily” becomes "lilium." All signals indicate the former case as the most popular, but Wikipedia disambiguation suggests the latter to be the correct usage. Wikipedia also contains entries for very broad terms, like each number, year, letter, etc. so simply applying a rule that any Wikipedia article is an allowed tag would continue to contribute to tag sprawl problems.

K-means clustering with word vectors

  • Method: Finally, we attempted to transform the tags into a subset of more meaningful tags using word embeddings and k-means clustering. Generally, the process involved transforming the tags into tokens (individual words), then refining by part-of-speech (noun, verb, adjective), and finally lemmatizing the tokens ("blue shirts" becomes "blue shirt"). From there, we transformed all the tokens into a custom Word2Vec embedding model based on adding the vectors of each resulting token array. We created a label array and a vector array of each tag in the dataset, then ran k-means with 10 percent of the total count of the tags as the value for number of centroids. At first we tested on 30,000 tags and obtained reasonable results.
    Once k-means had completed, we pulled all of the centroids and obtained their nearest relative from the custom Word2Vec model, then we assigned the tags to their centroid category in the main dataset.

    Tag Tokens Tag Pos Tag Lemm. Categorization
    ['beach', 'photographs'] [('beach', 'NN'), ('photographs', 'NN')] ['beach', 'photograph'] beach photo
    ['seaside', 'photographs'] [('seaside', 'NN'), ('photographs', 'NN')] ['seaside', 'photograph'] beach photo
    ['coastal', 'photographs'] [('coastal', 'JJ'), ('photographs', 'NN')] ['coastal', 'photograph'] beach photo
    ['seaside', 'photographs'] [('seaside', 'NN'), ('photographs', 'NN')] ['seaside', 'photograph'] beach photo
    ['seaside', 'posters'] [('seaside', 'NN'), ('posters', 'NNS')] ['seaside', 'poster'] beach photo
    ['coast', 'photographs'] [('coast', 'NN'), ('photographs', 'NN')] ['coast', 'photograph'] beach photo
    ['beach', 'photos'] [('beach', 'NN'), ('photos', 'NNS')] ['beach', 'photo'] beach photo
    The Categorization column above was the centroid selected by Kmeans. Notice how it handled the matching of "seaside" to "beach" and "coastal" to "beach."
  • Benefits: This method seemed to do a good job of finding associations between the tags and their categories that were more semantic than character-driven. "Blue shirt" might be matched to "clothing." This was obviously not possible without the semantic relationships found within the vector space.
  • Limitations: Ultimately, the chief limitation that we encountered was trying to run k-means on the full two million tags while ending up with 200,000 categories (centroids). Sklearn for Python allows for multiple concurrent jobs, but only across the initialization of the centroids, which in this case was 11 — meaning that even if you ran on a 60-core processor, the number of concurrent jobs was limited by the number of initialization, which in this case, was again 11. We tried PCA (principal component analysis) to reduce the vector sizes (300 to 10) but the results were overall poor. Finally, because embeddings are generally built based on probabilistic closeness of terms in the corpus on which they were trained, there were matches that you could understand why they matched, but would obviously not have been the correct category (eg "19th century art" was picked as a category for "18th century art"). Finally, context matters and the word embeddings obviously suffer from understanding the difference between "duck" (the animal) and "duck" (the action).

Bringing it all together

Using a combination of the methods above, we were able to develop a series of methodology confidence scores that could be applied to any tag in our dataset, generating a heuristic for how to consider each tag going forward. These were case-level strategies to determine the appropriate methodology. We denoted these as follows:

  • Good Tags: This mostly started as our “do not touch” list of terms which already received traffic from Google. After some confirmation exercises, the list was expanded to include unique terms with rankings potential, commercial appeal, and unique product sets to deliver to customers. For example, a heuristic for this category might look like this:
    1. If tag is identical to Wikipedia entry and
    2. Tag + product has estimated search traffic and
    3. Tag has CPC value then
    4. Mark as "Good Tag"
  • Okay Tags: This represents terms that we would like to retain associated with products and their descriptions, as they could be used within the site to add context to a page, but do not warrant their own indexable space. These tags were mapped to be redirected or canonicaled to a “master,” but still included on a page for topical relevancy, natural language queries, long-tail searches, etc. For example, a heuristic for this category might look like this:
    1. If tag is identical to Wikipedia entry but
    2. Tag + product has no search volume
    3. Vector tag matches a "Good Tag"
    4. Mark as "Okay Tag" and redirect to "Good Tag"
  • Bad Tags to Remap: This grouping represents bad tags that were mapped to a replacement. These tags would literally be deleted and replaced with a corrected version. These were most often misspellings or terms discovered through stemming/lemmatization/etc. where a dominant replacement was identified. For example, a heuristic for this category might look like this:
    1. If tag is not identical to either Wikipedia or vector space and
    2. Tag + product has no search volume
    3. Tag has no volume
    4. Tag Wikipedia entry matches a "Good Tag"
    5. Mark as "Bad Tag to Remap"
  • Bad Tags to Remove: These are tags that were flagged as bad tags that could not be related to a good tag. Essentially, these needed to be removed from our database completely. This final group represented the worst of the worst in the sense that the existence of the tag would likely be considered a negative indicator of site quality. Considerations were made for character length of tags, lack of Wikipedia entries, inability to map to word vectors, no previous traffic, no predicted traffic or CPC value, etc. In many cases, these were nonsense phrases.

All together, we were able to reduce the number of tags by 87.5%, consolidating the site down to a reasonable, targeted, and useful set of tags which properly organized the corpus without wasting either crawl budget or limiting user engagement.

Conclusions: Advanced white hat SEO

It was nearly nine years ago that a well-known black hat SEO called out white hat SEO as being simple, stale, and bereft of innovation. He claimed that "advanced white hat SEO" was an oxymoron — it simply did not exist. I was proud at the time to respond to his claims with a technique Hive Digital was using which I called "Second Page Poaching." It was a great technique, but it paled in comparison to the sophistication of methods we now see today. I never envisioned either the depth or breadth of technical proficiency which would develop within the white hat SEO community for dealing with unique but persistent problems facing webmasters.

I sincerely doubt most of the readers here will have the specific tag sprawl problem described above. I'd be lucky if even a few of you have run into it. What I hope is that this post might disabuse us of any caricatures of white hat SEO as facile or stagnant and inspire those in our space to their best work.


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!

Mark Zuckerberg returns to his Harvard dorm on Facebook Live


Mark Zuckerberg, CEO of Facebook, today returned to his old Harvard dorm room 13 years after dropping out — and he broadcast the whole tour on Facebook Live for several thousand viewers. Zuckerberg and his wife Priscilla Chan toured H-33 of Kirkland House, walking viewers through his activities there, including coding what was then known as “thefacebook.com.” The two reminisce at length about his roommates, two of whom — Dustin Moskovitz and Chris Hughes — would go on to found Facebook with Zuckerberg. Like all things Zuckerberg, it’s about 50-percent pleasant, and 50-percent awkward. Zuckerberg says things like, “In this hallway,…

This story continues at The Next Web

Or just read more coverage about: Facebook

Uber ponies up cash to underpaid NYC drivers


Uber is making reparations to drivers in New York City after an accounting error left them underpaid for years. According to Quartz, Uber was taking its commission based on the gross fare, or the total amount paid by the passenger. However, the terms of service state that Uber takes its commission based on the net fare, or what the passenger pays before fees and taxes. In total, Uber collected 2.6-percent more from its drivers than it should have. It seems like Uber can’t go a single week without getting more bad press. Most recently, it was getting into a legal tussle with…

This story continues at The Next Web

Or just read more coverage about: Uber

Samsung S8’s iris scanner fooled by photograph of an eye


Just one month after the Samsung Galaxy S8’s release, German hackers have already figured out a way around the phone’s iris recognition software. The Chaos Computer Club, a European hacker group, published its account of hacking the S8’s biometrics via a few simple tools such as a camera and a contact lens. It also showed video footage of the successful workaround. Here are the ingredients in the Samsung Sensor Scramble, if you ever want to make it yourself: One camera (ordinary point-and-shoot will do) One laser printer, made by Samsung for the added spice of irony One contact lens One Samsung Galaxy…

This story continues at The Next Web

Or just read more coverage about: Samsung

Google is now selling its 4K digital whiteboard for $5000


Jamboard, Google’s 4K whiteboard-like touchscreen, is now available for sale — for an appropriately large sum. Jamboard, our cloud-based collaborative, digital whiteboard is now available: https://t.co/g8N9WWDxRO pic.twitter.com/U9YqB481lS — G Suite (@gsuite) May 23, 2017 As my colleague Napier noted when the Jamboard was announced, it looks like a whiteboard, acts like a tablet, and comes with a few other collaboration tools. It lets multiple people work together in G Suite with a central location, even remote employees. The big touchscreen is obviously designed with a business team in mind. It has its own movable stand (sold separately, natch), complete with single cable to…

This story continues at The Next Web

Or just read more coverage about: Google

Snapchat launches custom Stories you can create with your friends


Snapchat’s latest feature lets users create custom Stories with their friends, making a communal Story based on a shared event or location. When you create a custom Story, you can add your friends as contributors. When they create a new Snap, they have the option to add it to your communal story. You can also “Geofence” the Story to a particular location, which I assume means you can only see it if you’re in a the right place. Snap says, “It’s perfect for a trip, a birthday party, or a new baby story just for the family.” Communal Stories disappear if…

This story continues at The Next Web

Or just read more coverage about: Snapchat