Skip to content

Feed aggregator

scikit-learn: TF/IDF and cosine similarity for computer science papers

Mark Needham - Wed, 07/27/2016 - 04:45

A couple of months ago I downloaded the meta data for a few thousand computer science papers so that I could try and write a mini recommendation engine to tell me what paper I should read next.

Since I don’t have any data on which people read each paper a collaborative filtering approach is ruled out, so instead I thought I could try content based filtering instead.

Let’s quickly check the Wikipedia definition of content based filtering:

In a content-based recommender system, keywords are used to describe the items and a user profile is built to indicate the type of item this user likes.

In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present).

We’re going to focus on the finding similar items part of the algorithm and we’ll start simple by calculating the similarity of items based on their titles. We’d probably get better results if we used the full text of the papers or at least the abstracts but that data isn’t as available.

We’re going to take the following approach to work out the similarity between any pair of papers:

for each paper:
  generate a TF/IDF vector of the terms in the paper's title
  calculate the cosine similarity of each paper's TF/IDF vector with every other paper's TF/IDF vector

This is very easy to do using the Python scikit-learn library and I’ve actually done the first part of the process while doing some exploratory analysis of interesting phrases in the TV show How I Met Your Mother.

Let’s get started.

We’ve got one file per paper which contains the title of the paper. We first need to iterate through that directory and build an array containing the papers:

import glob
corpus = []
for file in glob.glob("papers/*.txt"):
    with open(file, "r") as paper:

Next we’ll build a TF/IDF matrix for each paper:

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')
tfidf_matrix =  tf.fit_transform([content for file, content in corpus])

Next we’ll write a function that will find us the top n similar papers based on cosine similarity:

from sklearn.metrics.pairwise import linear_kernel
def find_similar(tfidf_matrix, index, top_n = 5):
    cosine_similarities = linear_kernel(tfidf_matrix[index:index+1], tfidf_matrix).flatten()
    related_docs_indices = [i for i in cosine_similarities.argsort()[::-1] if i != index]
    return [(index, cosine_similarities[index]) for index in related_docs_indices][0:top_n]

Let’s try it out:

>>> corpus[1619]
('papers/221215.txt', 'TOTEM: a reliable ordered delivery protocol for interconnected local-area networks')
>>> for index, score in find_similar(tfidf_matrix, 1619):
       print score, corpus[index]
0.917540397202 ('papers/852338.txt', 'A reliable ordered delivery protocol for interconnected local area networks')
0.248736845733 ('papers/800897.txt', 'Interconnection of broadband local area networks')
0.207309089025 ('papers/103726.txt', 'High-speed local area networks and their performance: a survey')
0.204166719869 ('papers/161736.txt', 'High-speed switch scheduling for local-area networks')
0.198514433132 ('papers/627363.txt', 'Algorithms for Distributed Query Processing in Broadcast Local Area Networks')

It’s pretty good for finding duplicate papers!

>>> corpus[1599]
('papers/217470.txt', 'A reliable multicast framework for light-weight sessions and application level framing')
>>> for index, score in find_similar(tfidf_matrix, 1599):
       print score, corpus[index]
1.0            ('papers/270863.txt', 'A reliable multicast framework for light-weight sessions and application level framing')
0.139643354066 ('papers/218325.txt', 'The KryptoKnight family of light-weight protocols for authentication and key distribution')
0.134763799612 ('papers/1251445.txt', 'ALMI: an application level multicast infrastructure')
0.117630311817 ('papers/125160.txt', 'Ordered and reliable multicast communication')
0.117630311817 ('papers/128741.txt', 'Ordered and reliable multicast communication')

But sometimes it identifies duplicates that aren’t identical:

>>> corpus[5784]
('papers/RFC2616.txt', 'Hypertext Transfer Protocol -- HTTP/1.1')
>>> for index, score in find_similar(tfidf_matrix, 5784):
       print score, corpus[index]
1.0 ('papers/RFC1945.txt', 'Hypertext Transfer Protocol -- HTTP/1.0')
1.0 ('papers/RFC2068.txt', 'Hypertext Transfer Protocol -- HTTP/1.1')
0.232865694216 ('papers/131844.txt', 'XTP: the Xpress Transfer Protocol')
0.138876842331 ('papers/RFC1866.txt', 'Hypertext Markup Language - 2.0')
0.104775586915 ('papers/760249.txt', 'On the transfer of control between contexts')

Having said that, if you were reading and liked the HTTP 1.0 RFC the HTTP 1.1 RFC probably isn’t a bad recommendation.

There are obviously also some papers that get identified as being similar which aren’t. I created a CSV file containing 5 similar papers for each paper as long as the similarity is greater than 0.5. You can see the script that generates that file on github as well.

That’s as far as I’ve got for now but there are a couple of things I’m going to explore next:

  • How do we know if the similarity suggestions are any good? How do we measure good? Would using a term counting vector work better than TF/IDF?
  • Similarity based on abstracts as well as/instead of titles

All the code from this post for calculating similarities and writing them to CSV is on github as well so feel free to play around with it.

Categories: Blogs

Agile 2016 Video Podcasts!

Leading Agile - Mike Cottmeyer - Tue, 07/26/2016 - 22:01

This week Atlanta is hosting the biggest Agile event ever put together. There are 2,500 people attending Agile 2016! If you weren’t able to make it, you don’t have to miss out. LeadingAgile is doing video podcast interviews with the speakers and thought leaders who are helping to reshape the way we work.

You can watch the videos on Vimeo or Facebook.

And keep checking back… we’ll be posting new videos all week long.


The post Agile 2016 Video Podcasts! appeared first on LeadingAgile.

Categories: Blogs

The Executives (Step-by-Step) Guide to Leading Large-Scale Agile Transformation #agile2016

Leading Agile - Mike Cottmeyer - Tue, 07/26/2016 - 21:55

In case anyone is interested, here is my talk from Monday’s session at Agile 2016. Thanks to everyone that made it out. Had a great time.

The Executives Guide from Mike Cottmeyer whitepaper-cta 2

The post The Executives (Step-by-Step) Guide to Leading Large-Scale Agile Transformation #agile2016 appeared first on LeadingAgile.

Categories: Blogs

Incorporating Lean and Kanban with Leanban

Learn about Leanban at LeanKit. A project management approach that uses Lean thinking to incorporate Scrum and Kanban into Agile software development.

The post Incorporating Lean and Kanban with Leanban appeared first on Blog | LeanKit.

Categories: Companies

Extensions and Simplifications for Calculated Custom Fields

TargetProcess - Edge of Chaos Blog - Tue, 07/26/2016 - 15:52

We’ve made a few extensions to Calculated Custom Fields with some of our recent releases (3.8.9 and 3.9.0). We also simplified the formulas required for Boolean expressions and calculations with potentially null values. I’ve defined these terms below in case you’re not sure what they are.

Calculated Custom Fields: Used to create your own metrics for custom fields in Targetprocess.

Null value: A value which is unknown, e.g. hours of effort remaining for a project which has not been started. Please note that null does not equal zero; it’s simply an unknown variable. A nullable expression is any calculation with a potentially null result.

Boolean expression: This is basically just a true / false statement. If a field generates a yes or no answer, it’s probably using a Boolean expression.


Nullable Expressions:

We’ve extended Calculated Custom Fields to work much better with nullable expressions by adding an IFNONE operator.  This has simplified the formulas needed for such calculations by removing some of the developer terms which were formerly required, such as ternary operators.

So, when you need a default value for a nullable expressions. You can use IFNONE(a,0) instead of a.HasValue ? a.Value : 0

Old: Bugs.SUM(TimeSpent).HasValue ? Bugs.SUM(TimeSpent).Value : 0

New: IFNONE(Bugs.SUM(TimeSpent), 0).


Boolean Expressions

Decimal values are now automatically converted to double (nullable double) values in case a double value and a decimal value must be used in one expression. Basically, this means you don’t have to use the phrase Convert.ToDouble(it.Decimal) when working with decimal values. This also works for IFNONE/IIF operators: you can simply write IFNONE(it.NullableDouble, it.NullableDecimal) or IIF(it.NullableDecimal.HasValue, it.NullableDecimal.Value, it.Double).

Old: Convert.ToDouble(TotalPoints.HasValue?TotalPoints.Value:0) / (EndDate-StartDate).TotalDays

New: TotalPoints.HasValue?TotalPoints.Value:0 / (EndDate-StartDate).TotalDays


Nullable Bool:

Nullable bool is now automatically converted to bool (with “false” used for null values). So, if nullable bool is used in a place where bool should be used, it will be converted from it.NullableBool to it.NullableBool == True.

Old: IIF(EntityState.IsFinal == True,0,Effort)

New: IIF(EntityState.IsFinal,0,Effort).

If you have any questions about these changes, please feel free to ask us in the comments section below or contact our support team. Have a good day! 

Categories: Companies

The Fastest Way For A Coder To Get Fired

Derick Bailey - new ThoughtStream - Tue, 07/26/2016 - 15:30

What do you mean, “why are the websites down?” – I asked as I turned to look at my boss, in confusion.

“They’re not down.“

“See, they’re …”


… WHAT?!

It was the early 2000’s and I was nervous about what I had built – a new and ambitious application with a new language runtime and a new development platform.

Initially, there was a sigh of relief when I saw my work on the production website. But that was quickly overshadowed by the horror of what I now saw.

An error – right there on screen, where my website should be – was telling me that the app couldn’t connect to the database.

I restarted the app, and it connected.

WHEW! That was a close one.

A moment later, the site was down, again.


I checked the database management console, and saw that all available connections had been used, and none of them were releasing.

My connections being eaten up?

They aren’t closing, or recycling in the connection pool?!

I didn’t understand. I had my connection “.close()” method right there – just like always – right after my return statement…

It took almost 2 hours to figure out the mistake I had made

… and it was a mistake you can probably guess, from my description above.

return someData;


The “return” statement in VB.NET exits the function immediately. The database connection would never close.

How stupid is that?!” I thought.

VB6 didn’t do it that way. Why would they change that in VB.NET?!

Fix It. NOW!

No time to figure out why VB.NET is different – too much pressure…

Boss breathing down my neck; customer support holding back the horde of angry distributors; C-level execs asking why the sites are down!

I found every database connection “.close()” and moved them one line up.

I deployed. It worked.

The sites were up, stayed up, and I slowly started to breathe.

Then my boss called me into his office.

Looking back, I know my mistake was one of assumption.

I assumed the language I used for that app would behave the same as the previous language in which I had worked.

They shared a similar syntax, after all. I thought they were the same.

It was the assumption that killed me – a mistake that many developers make with JavaScript.

JavaScript has a familiar syntax

It is familiar in the same way VB.NET was familiar to me, back in those days, because I had worked in VB6, previously.

JavaScript does look a lot like Java, C++, or C#, yes. But, the differences can be staggering.

C#, for example, only allows if statements to evaluate strict boolean values. But, JavaScript will coerce any value into a boolean, implicitly.

C# has a strict syntax for encapsulating code. JavaScript is a bit iffy on structure for encapsulation. It allows code to be encapsulated, though.

In C#, “this” always points to the object on which a function is defined.

But, in JavaScript? Nope.

JavaScript’s “this” may be the most notorious keyword in the language. But the headache of “this” is only a symptom of the real problem:

Misunderstanding the language fundamentals.

A lack of knowledge in the fundamentals of any language is dangerous, at best.

You may end up with code that looks like it works, but won’t stand to the pressures of a production deployment

I found that out the hard way, with VB.NET, all those years ago.

Ultimately, learning the fundamentals of any language is important.

It doesn’t matter what the language is – and JavaScript is no exception.

In fact, JavaScript may be the penultimate example of why you need to study and learn the fundamentals.

With language features that do not work as one would expect, coming from C# or other places, JavaScript is easily misunderstand and full of pitfalls and pain.

And JavaScript’s “this” is a prime example.

With behavior that looks like C# in some circumstances, but behaves in what look like unexpected and unpredictable ways in other circumstances, “this” is easily the most notorious feature of the web’s darling language.

Clear the air of obscurity and uncertainty in JavaScript’s fundamentals with my email course on the 6 rules of mastering JavaScript’s “this”. It’s completely free and the sign up is just below.

Categories: Blogs

Do Agile Teams ‘Storm’ In Different Ways?

Learn more about transforming people, process and culture with the Real Agility Program

Team Discussion

Agile transformation coaches promise their clients the positive outcome of “high-performance teams.”

According to the well-cited Psychologist B.W Tuchman, teams go through four stages on their way to high-performance. The end result seems to be a self-organizing team which effectively delivers to clients or customers with increasing satisfaction and continuous development and growth.

However, agile teams are different than regular teams. Aren’t they?

What I mean is, right from the outset individuals in an agile culture expect to confront change with positive stride. They are expected to be able to adapt to quickly even in uncertain environments. Therefore, their experience of team development is different, right from the outset.

Consider what Debbie Madden has to say in her article The Increasing Fluidity of Agile Practices Across Teams. She writes that, “most companies either claim they are Agile, are trying to become Agile, or have tried Agile. In truth, what I see today is a lot of customized Agile. In fact, the term “Traditional Agile” has come to mean the pure, original implementation of Agile. And, most companies are not following “Traditional Agile”. Instead, teams are customizing Agile to fit their needs, making the fluidity of Agile more prominent now than ever before.”

What this says to me is that since “Traditional Agile” has been around long enough now, teams have internalized the principles and values enough to understand change is to be expected and they have strategies in place to adapt well.

It says to me that teams are now taking Agile to a whole new level. They are making it their own. Adapting. Shaping. Moulding. Sculpting. The fluid nature of Agile gives teams permission to do this.

If we take Tuchman’s four-stage model and insert some agile thinking what we might come out with is an awareness that agile teams do what Debbie said they do. They make things up as they go along and they get the job done.

In this way, what might have been called “storming” by the old standards and definitions of team development can really also be called “high-performance” when the team is agile.

Perhaps some agile teams can create their own team development model and one of the stages is “high-performing storming” and maybe that is not even the final outcome but maybe it is the starting point on Day One!

Wouldn’t that be something?

Learn more about our Scrum and Agile training sessions on WorldMindware.comPlease share!

The post Do Agile Teams ‘Storm’ In Different Ways? appeared first on Agile Advice.

Categories: Blogs

Product SAFeTY

Agile Management Blog - VersionOne - Tue, 07/26/2016 - 14:30

About seven years ago, I started a deep dive into all things product. I found myself coaching teams that could consistently produce, but not consistently produce the right product. I shifted DevJam coaching to be a blend of product discovery … Continue reading →

The post Product SAFeTY appeared first on The Agile Management Blog.

Categories: Companies

Why Scaling Agile Does Not Work

Scrum Expert - Tue, 07/26/2016 - 14:00
There are now several frameworks designed for scaling agile. This talk explains the flaws in such frameworks, why they so often fail to produce the desired effects, and what we should do instead. It also addresses some common organizational obstacles to moving fast at scale: governance, budgeting, and the project paradigm – and discusses how to address them. Warning: this talk includes liberal use of real, statistically sound data. Video producer:
Categories: Communities

Manifesto voor Agile Veranderen

Ben Linders - Tue, 07/26/2016 - 10:47

Manifesto Agile Veranderen Ben LindersHet manifesto voor agile veranderen helpt organisaties om hun agility te verhogen. Het zorgt voor blijvende verbetering van de resultaten, tevreden klanten, en blije medewerkers. Dit eerste artikel over Agile Veranderen beschijft de uitgangspunten en waarden met behulp van het manifesto voor agile veranderen.

Agile software ontwikkeling is gebaseerd op het Manifesto voor Agile Software Ontwikkeling. Dit manifesto bevat vier waarden en twaalf principes. Het manifesto voor agile veranderen is op een zelfde manier opgebouwd. Het beschrijft mijn visie en werkwijze in organisatieverandering, samengevat in vier waarden. Mijn verander “waarden”


Retrospectives Exercises Toolbox - Design your own valuable Retrospectives

Dit zijn de waarden van mijn Manifesto voor Agile Veranderen:

  • Betrekken van professionals en ruimte geven voor ideeën over standaardisatie en voorschrijven van werkprocessen
  • Stapsgewijze evolutionaire verbetering van binnen uit over top down opleggen van veranderingen.
  • Resultaatgericht en intensief samenwerken over directieve doelen met “command & control” management.
  • Prioritiseren en flexibel inspelen op kansen over budgetteren en veranderplannen uitvoeren.

De waarden aan de rechterkant van bovenstaande statements zijn en blijven belangrijk, maar ik geef graag meer aandacht aan de waarden aan de linkerkant. Daarom geef ik bijvoorbeeld de voorkeur aan het in kaart brengen van de bestaande werkprocessen met de medewerkers en samen werken aan verbetering mbv retrospectives in plaats van organisatiebreede uitrol van Scrum met standaard trainingen. En werk ik liever met een veranderbacklog waarin de prioriteiten eenvoudig aan te passen zijn dan met een plan. Ook veranderen veranderd

Anders dan het Agile Manifesto wat al 15 jaar hetzelfde is verwacht ik dat dit manifesto wel zal veranderen. De eerste evolutie is al te zien als je het vergelijkt met het verander manifesto van veranderproject, een samenwerkingsverband van enkele jaren geleden. Bijvoorbeeld woorden als “verbinden” zijn verder uitgewerkt in “resultaatgericht en intensief samenwerken” en het manifesto voor agile veranderen benoemd de rol van de professional en een bottom up aanpak voor verandering.

In de nabije toekomst zal ik diverse artikelen publiceren waarin ik dieper in ga op de waarden van dit manifesto. Ik geef daarin o.a. voorbeelden van betrekken van professionals in verandertrajecten, top-down versus bottom up veranderen, evolutionaire versus revolutionaire verandering en resultaatgericht veranderen.

Categories: Blogs

Sponsorship Information

Agile Ottawa - Tue, 07/26/2016 - 00:11
We are looking for sponsors for GOAT 2016. There are 5 categories with various spots per category. As a sponsor you will have access to a specialized and influential audience to: ● Increase your brand awareness ● Create product marketing opportunities ● … Continue reading →
Categories: Communities

Scrum Day Europe 2016

Xebia Blog - Mon, 07/25/2016 - 11:50
During the 5th edition of Scrum Day Europe, Laurens and I facilitated a workshop on how to “Add Visual Flavor to Your Organization Transformation with Videoscribe.” The theme of the conference, “The Next Iteration,”  was all about the future of Scrum. We wanted to tie our workshop into the theme of the conference, so we had
Categories: Companies

Links for 2016-07-24 []

Zachariah Young - Mon, 07/25/2016 - 09:00
Categories: Blogs

Targetprocess v.3.9.0: TP2 removal, minor improvements

TargetProcess - Edge of Chaos Blog - Sat, 07/23/2016 - 12:59
TP2 removal

As we described in our Phasing out Targetprocess v.2 blog post, we've stopped supporting Targetprocess v.2. This means you'll no longer have access to features which were exclusive to v.2. Only Custom reports, Time sheets, and Global Settings will be preserved. However, all of your data should still be available in Targetprocess 3.

If you have any questions about this change, please contact us.

  • Improved board performance
  • Quick Add buttons are now always visible in List mode
Fixed Bugs:
  • Fixed Feedback popup that appeared too often
  • Fixed a problem where Custom Rule ‘Close a User Story when all its Tasks are done’ only took the first 25 tasks into consideration
  • Fixed issue with displayed password in Git plugin log when timeout occurs
  • Fixed high CPU usage by Cache context
  • Fixed units in List that were incorrectly displayed as non-sortable
  • Fixed Search icon alignment in Safari
  • Fixed CKeditor table alignment
  • Fixed filter font size in Lookup bubble
Categories: Companies

What’s the Difference Between a Skilled Agilist and a Great Agile Team Coach?

BigVisible Solutions :: An Agile Company - Fri, 07/22/2016 - 19:00

Isn’t it enough to have great Agilists tell your people how to do Agile? Well, no.

We know that training alone is not enough to start great teams. Starting a team with just training is like feeding corn to chickens. Most of it goes right through, and pretty soon they’re eating whatever they can scrounge up.

The most reliable approach for a team start or re-start is to follow initial training with onsite coaching. This helps the concepts of Agile settle in and become a way of living. Having coaches available for new teams, leaders, and managers and for ongoing support is critical to the success of the Agile initiative.

I’m in my tenth year of working in Agile, and one thing that’s become very clear to me from studying great coaches is this: it’s not enough to know a lot about Agile. Many highly skilled Agile practitioners are not very good trainers, much less coaches. Coaching is a professional capability that not only includes domain knowledge but also the skills of teaching and much more. A skilled Agilist may be able to clearly explain the Agile values and principles; may be up to the minute on the latest Scrum Guide details; may know Lean and Kanban and Theory of Constraints; may understand ceremonies, roles, artifacts, waste reduction, team process, and scaling frameworks; may be skilled with XP engineering practices. These are all important—but all of them together are not enough to make a great coach.

As Lyssa Adkins explained in her groundbreaking book, Coaching Agile Teams, a coach needs to be able to engage in a variety of ways based on the needs of the situation. These include modalities commonly recognized as teaching and coaching, as well as facilitation, mentoring, problem-solving, and working with conflict and collaboration.

Lyssa Adkins along with Michael Spayd established the Agile Coaching Institute (ACI) to help Agilists develop “competence and confidence in the profession of Agile coaching.” ACI has since identified the stances and techniques that a coach of Agile teams must be capable of performing, creating this diagram to illustrate the range of skills required in a top-flight Agile coach:


A successful coach needs a great deal of self-awareness and self-mastery. Development as a coach includes challenges of maturation, not just skill acquisition. A coach needs to show up in a way that manifests the values and qualities that are important for Agile to succeed within a team and in the team’s organizational environment. These include a willingness to be vulnerable, an attitude of inquiry, and a genuine belief in the value of collaboration and creativity rather than a reliance on expertise or control. The successful organization in today’s world is a learning organization, and to really become one requires that team members, leaders, managers, and coaches all be committed to ongoing learning at a personal level, as well as in terms of the organization itself.

Recognizing the range and depth required of a serious coach can be the first step on an important developmental journey. I know it was for me. After several experiences of watching admired coaches ask powerful questions—or even remain silent—instead of simply giving answers, I realized there was territory here that I wanted to master. And this kind of understanding has informed SolutionsIQ’s position on the importance of coach development more broadly.

Lyssa and Michael and their partners, our long-time friends at the Agile Coaching Institute (ACI), have developed what we believe to be the world’s leading curriculum for developing Agile coaches. We recently announced that SolutionsIQ will be offering ACI team coaching courses—The Agile Facilitator, Coaching Agile Teams, and the Agile Coach Bootcamp—and that a group of SolutionsIQ facilitators are undergoing intensive preparation to offer these workshops to our clients, the general public, and our own consultants. These workshops are also accredited by the International Consortium for Agile (ICAgile), so we will be able to prepare students for these prestigious certifications. Very few programs in the world meet the ICAgile Learning Objectives in Team Coaching. We believe the ACI program that we will offer is by far the best of them.

Read the full press release about SolutionsIQ and ACI’s partnership.

The post What’s the Difference Between a Skilled Agilist and a Great Agile Team Coach? appeared first on SolutionsIQ.

Categories: Companies

Automated pipeline support for Consumer Driven Contracts

Putting the tea into team - Ivan Moore - Fri, 07/22/2016 - 18:38
When developing a system composed of services (maybe microservices) some services will depend on other services in order to work. In this article I use the terminology "consumer"[1] to mean a service which depends on another service, and "provider" to mean a service being depended upon. This article only addresses consumers and providers developed within the same company. I'm not considering external consumers here.

This article is about what we did at Springer Nature to make it easy to run CDCs - there is more written about CDCs elsewhere.
CDCs - the basicsCDCs (Consumer Driven Contracts) can show the developers of a provider service that they haven't broken any of their consumer services.

Consider a provider service called Users which has two consumers, called Accounting and Tea. Accounting sends bills to users, and Tea delivers cups of tea to users.

The Users service provides various endpoints, and over time the requirements of its consumers change. Each consumer team writes tests (called CDCs) which check whether the provider service (Users in this case) understands the message sent to it and responds with a message the consumer understands (e.g. using JSON over HTTP).
How we used to run CDCsWe used to have the consumer team send the provider team an executable (e.g. an executable jar) containing the consumer's CDCs. It was then up to the provider team to run those CDCs as necessary, e.g. by adding a stage to their CD (Continuous Delivery) pipeline. A problem with this was that it required manual effort required by the provider team to set up such a stage in the CD pipeline, and required effort each time a consumer team wanted to update its CDCs.
How we run them nowOur automated pipeline system allows consumers to define CDCs in their own repository and declare which providers they depend upon in their pipeline metadata file. Using this information, the automated pipeline system adds a stage to the consumer's pipeline to run its "CDCs" against its providers, and also in the provider's pipeline to run its consumers' CDCs against itself. In our simple example earlier this means the pipelines for Users, Accounting, Tea and Marketing will be something like this:




i.e. Users runs Accounting and Tea CDCs against itself (in parallel) after it has been deployed. Accounting and Tea run their CDCs against Users before they deploy.

This means that:
  • when a change is made to a consumer (e.g. Tea), its pipeline checks that its providers are still providing what is needed. This is quite standard and easy to arrange.
  • when a change is made to a provider (e.g. Users), its pipeline checks that it still provides what its consumers require. This is the clever bit that is harder to arrange. This is the point of CDCs.
Benefits of automationBy automating this setup, providers don't need to do anything in order to incorporate their consumers' CDCs into their pipeline. The providers also don't have to do anything in order to get updated versions of their consumers' CDCs.

The effort of setting up CDCs rests with the teams who have the dependency, i.e. the consumers. The consumers need to declare their provider (dependency) in their metadata file and define and maintain their CDCs.
SubtletiesThere are a few subtleties involved in this system as it is currently implemented.
  • the consumer runs its CDCs against the provider in the same environment it is about to deploy into. There may be different versions of providers in different environments and this approach checks that the provider works for the version of the consumer that is about to be deployed, and will prevent the deployment if it is incompatible.
  • the provider runs the version of each consumer's CDCs corresponding to the version of the consumer in the same environment that the provider has just deployed into. There may be different versions of consumers in different environments and this approach checks that the provider works for the versions of consumers that are using the provider.
  • the system deploys the provider before running the consumer CDCs because the consumer CDCs need to run against it. It would be better for the system to deploy a new version of the provider without replacing the current version, run its consumers' CDCs and then only switch over to the new version if the CDCs all pass.
  • because the consumer's CDCs need to run against the provider in the appropriate environment, the system sets an environment variable with the host name of the provider in that environment. Because we only have one executable per consumer for all its CDCs, if a consumer has multiple providers, it needs to use those environment variables in order to determine which of its CDCs to execute.
Implementation notesThe implementation of a consumer running its CDCs against its provider is relatively straightforward. The difficulties are when a provider runs its consumers' CDCs against itself.

In order for a provider to run its consumers' CDCs the system clones each consumer's repository at the appropriate commit and then runs the appropriate executable in a Docker container. (The implementation doesn't clone every time, just if the repository hasn't been cloned on that build agent before.) Using Docker for running the CDCs means that consumers can implement their CDCs using whatever technology they want, as long as it runs in Docker.

In our system, all services are required to implement an endpoint which returns the git hash of the commit that they are built from. This is used to work out which version of consumer's CDCs to run in the case when they are run in a provider's pipeline.

Automating the running of CDCs in our automated pipelines required either making providers know who their consumers are, or consumers know who their providers are. If a provider doesn't provide what a consumer requires, it causes more problems for the consumer than for the provider. Therefore we made it the responsibility of the consumer to define that it depends on the provider rather than the other way around.

1 Other terminology in use for consumer is "downstream" and for provider is "upstream". A consumer is a dependant of a provider. A provider is a dependency of a consumer. I sometimes use the word producer instead of provider.

Copyright © 2016 Ivan Moore
Categories: Blogs

Mahout/Hadoop: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4

Mark Needham - Fri, 07/22/2016 - 15:55

I’ve been working my way through Dragan Milcevski’s mini tutorial on using Mahout to do content based filtering on documents and reached the final step where I needed to read in the generated item-similarity files.

I got the example compiling by using the following Maven dependency:


Unfortunately when I ran the code I ran into a version incompatibility problem:

Exception in thread "main" org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(
	at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(
	at java.lang.reflect.Method.invoke(
	at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
	at org.apache.hadoop.ipc.RPC.checkVersion(
	at org.apache.hadoop.hdfs.DFSClient.createNamenode(
	at org.apache.hadoop.hdfs.DFSClient.<init>(
	at org.apache.hadoop.hdfs.DFSClient.<init>(
	at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(
	at org.apache.hadoop.fs.FileSystem.createFileSystem(
	at org.apache.hadoop.fs.FileSystem.access$200(
	at org.apache.hadoop.fs.FileSystem$Cache.get(
	at org.apache.hadoop.fs.FileSystem.get(
	at org.apache.hadoop.fs.FileSystem.get(
	at com.markhneedham.mahout.Similarity.getDocIndex(
	at com.markhneedham.mahout.Similarity.main(
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(
	at java.lang.reflect.Method.invoke(
	at com.intellij.rt.execution.application.AppMain.main(

Version 0.9.0 of mahout-core was published in early 2014 so I expect it was built against an earlier version of Hadoop than I’m using (2.7.2).

I tried updating the Hadoop dependencies that were being called in the stack trace to no avail.


When stepping through the stack trace I noticed that my program was still using an old version of hadoop-core, so with one last throw of the dice I decided to try explicitly excluding that:


And amazingly it worked. Now, finally, I can see how similar my documents are!

Categories: Blogs

Hadoop: DataNode not starting

Mark Needham - Fri, 07/22/2016 - 15:31

In my continued playing with Mahout I eventually decided to give up using my local file system and use a local Hadoop instead since that seems to have much less friction when following any examples.

Unfortunately all my attempts to upload any files from my local file system to HDFS were being met with the following exception: File /user/markneedham/book2.txt could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.hadoop.ipc.WritableRpcEngine$
at org.apache.hadoop.ipc.Server$Handler$
at org.apache.hadoop.ipc.Server$Handler$
at Method)
at org.apache.hadoop.ipc.Server$
at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at $Proxy0.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(
at org.apache.hadoop.hdfs.DFSOutputStream$

I eventually realised, from looking at the output of jps, that the DataNode wasn’t actually starting up which explains the error message I was seeing.

A quick look at the log files showed what was going wrong:


2016-07-21 18:58:00,496 WARN org.apache.hadoop.hdfs.server.common.Storage: Incompatible clusterIDs in /usr/local/Cellar/hadoop/hdfs/tmp/dfs/data: namenode clusterID = CID-c2e0b896-34a6-4dde-b6cd-99f36d613e6a; datanode clusterID = CID-403dde8b-bdc8-41d9-8a30-fe2dc951575c
2016-07-21 18:58:00,496 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to / Exiting. All specified directories are failed to load.
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(
2016-07-21 18:58:00,497 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool <registering> (Datanode Uuid unassigned) service to /
2016-07-21 18:58:00,602 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool <registering> (Datanode Uuid unassigned)
2016-07-21 18:58:02,607 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2016-07-21 18:58:02,608 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0
2016-07-21 18:58:02,610 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:

I’m not sure how my clusterIDs got out of sync, although I expect it’s because I reformatted HDFS without realising at some stage. There are other ways of solving this problem but the quickest for me was to just nuke the DataNode’s data directory which the log file told me sits here:

sudo rm -r /usr/local/Cellar/hadoop/hdfs/tmp/dfs/data/current

I then re-ran the hstart script that I stole from this tutorial and everything, including the DataNode this time, started up correctly:

$ jps
26736 NodeManager
26392 DataNode
26297 NameNode
26635 ResourceManager
26510 SecondaryNameNode

And now I can upload local files to HDFS again. #win!

Categories: Blogs

Minimum Valuable Problem

Tyner Blain - Scott Sehlhorst - Fri, 07/22/2016 - 13:38

redacted use case dependency thumbnail

Defining and building a good minimum viable product is much harder than it sounds.  Finding that “one thing” you can do, which people want, is really about a lot more than picking one thing.  It is a combination of solving the minimum valuable problem and all of the other things that go with it.  Solving for both the outside-in needs and the inside-out goals is critical.

Starting with Icebergs

image of iceberg showing the massive hidden parts

Rich Mironov’s great article, the DIY Illusion, talks about the importance of focusing your team on building what is important to build (and not building something more easily acquired in other ways).  Imagine your team is building a mobile app.  Now imagine your team is building – from scratch – a CRM system to allow you to track all of the users who install the app.  Or imagine they are building a ticketing system – from scratch – to allow you to track development team progress on feature requests and bug fixes.

context of framing

I introduced the Andy Polaine’s concept of designing in different contexts in an article about roadmaps and feature-lists last year.  The same pattern / concept applies here.

Rich’s article describes a micro-version of the classic buy, build, partner decision. When it is your team making decisions about dev-ops or other infrastructure that they need, this is exactly what it feels like and looks like.

Pop up to the broader organization-level context, and now it is the classic MBA version – do we build a new product to complete our portfolio?  Or do we partner with someone else to include their product?  Or maybe acquiring that partner (or just the product) makes the most sense.

Both of those decisions are firmly in the inside-out side of thinking about product.  What about the outside-in framing?  Your customers are making  buy, build, partner decisions about your product.  How do you make sure the right answer for them is “buy?”

another iceberg - emphasizing what is hidden

An important point in Rich’s article is that the work you need to do (to roll your own <insert system here>) is much larger than a shallow analysis would lead you to believe.  The same is true about defining a minimum viable product.  You customers will need to solve more than the single problem on which you begin your focus.

Minimum Valuable Problem

I’m going to spend the next couple weeks talking only about minimum valuable problems, and not minimum viable products, as an experiment to see if it accelerates a change in thinking with my team.  [I dropped the term first in a meeting with executives yesterday (as of when I’m typing) explaining that our product is focused on completely addressing the minimum valuable problem, and got some head nods but no direct commentary.]  If you want to know the results, ask in the comments on this post.

In my mind, I remember reading Steve Johnson quoting Saeed Khan as saying that a minimum viable product is, literally, “the least you could do.”  I hope it’s true, I love that quote.  I don’t know if that’s actually where I heard it, but let’s make it quotable, and see if some tweets cause the original author to magically appear.  An MVP is literally the least you could do with your #product.

US quarter featuring the state of Texas

Why make the distinction between product and problem?  Two reasons – one philosophical and one practical.

Philosophical Soapbox

One thing my clients regularly value from me is that I’m joining their team with a “fresh set of eyes” and one thing I bring is an external perspective on what they are doing and plan to do.  It affords me an opportunity to help shift the perspective of the team from inside-out to outside-in.  In other words, being driven by the needs of the market.  At the product-level of context, this usually means being driven by the problems a particular set of users are trying to solve.  Introducing problem as a totem in many conversations helps reinforce and subtly shift conversations over time.  The longer I work with a particular team, the more I see the results of this.

When people talk about the product they are usually talking about “this thing we (will) build.”  That’s not nearly as valuable for me in assuring product-market fit as if people are talking about the problem we are solving.  I’m on a team in the early discovery and definition phases.

We get more value from conversations about why someone will use the product than discussions around how the product will work.  We get more value from conversations around how the product will be used than from discussions around how much it costs to make the product.

Practical Thinking

A huge challenge in communication is one best described by a sketch of Jeff Patton’s from his book User Story Mapping.

three people discovering they don't ACTUALLY agree[for a larger version, buy Jeff’s book].

When people talk about “the product,” in my experience, everyone in the room will happily carry the conversation forward, each referring to “the product” with no one clarifying precisely what they mean.

When people talk about “the problem” we intend the product to be used to help solve, it is common for the conversation to reiterate, refine, or at least reference which problem we’re speaking about.

I don’t know why these play out in different ways, but they seem to do so.  Perhaps we’ve got a cognitive psychologist in the audience who can shed some light?

Regardless, the minimum valuable problem seems to be something people are comfortable clarifying in conversation.

Solving the Problem

I get to stand on the shoulders of another giant, Gojko Adzic, and his book, Impact Mapping, as my personal product management  jiu jitsu.  Gojko’s approach helps me very quickly define what it means to my user to solve his or her problem.

By focusing on the outcomes (there are, in fact, many ways to get to this – I just happen to find Gojko’s to be compelling), you discover that solving the problem you originally intended to solve may not be sufficient.

Your minimum viable product may be solving half of a problem.  Solving half of a problem is creating half of a product.  There may be cases where this makes sense – splitting stories, incremental delivery, etc.  But it doesn’t make sense for very long.

How often are you interested in purchasing half a solution to a problem you’re facing?  When the brake lights on your car go out, would you ask the mechanic to just fix one of them right now, and schedule a follow-up visit next month to repair the other one?

Defining the minimum valuable problem is defining the minimum viable product.

The minimum valuable problem is one you completely solve.  You may only solve it in a very narrow context or scope.  You may only solve it for a small group of users or a subset of your ultimate market.  You may only solve it adequately, without creating a truly competitive solution.  But for that one customer, in that one situation, you have completely solved it.

Remember – you grow revenue one customer at a time.  This sounds like a platitude, but reverse the perspective.  That one customer is considering multiple vendors for that one sale.  Will the customer pick the vendor who is mediocre (and also mediocre for other customers), or will the customer pick the vendor who is perfect for them (even if imperfect for other customers)?

The Problems Behind the Problem

dependency map of user stories[larger]

The above diagram is a real view of the dependencies of an ecosystem for a product.  It is blurred out because it is real.  What it shows, in the upper left corner is the “target problem” to be solved.  This target is a candidate for minimum valuable problem.

Each connection in red says “requires” because for a given user to solve the problem in their blurred out box requires assistance from another user.  That other user then has to solve the problem of “help the first user.”  Or it could be that there is an operational context like “monitor performance of the [first user group] solving their problem, so we can fine tune the solution.”  When you’re doing service work, or designing whole-products, you see (or should see) this on every engagement.

In the ecosystem of a complex problem-space, we discover that there are multiple parties associated with adequately solving the user’s problem.  Each different color of user reflects a different user involved in the solution of the focus problem for the focus user.  This web of interdependent problems is the rest of the iceberg.

onion diagram

An onion diagram for this same problem space allows us to also very quickly see (even with this redacted version) that there are three systems (or system interfaces) through which different users directly or indirectly use our product to solve their problems.

Bridging the Process Gap

These views of the problem space help us assure that we are solving a valuable problem – which is my preferred definition of a viable product.  As a bonus, they help bridge the gap between the abstract thinking of a product management team and the concrete thinking of the engineering team who will create the solution and the executive team who wants to “know what it is.”

Categories: Blogs

Mahout: Exception in thread “main” java.lang.IllegalArgumentException: Wrong FS: file:/… expected: hdfs://

Mark Needham - Thu, 07/21/2016 - 19:57

I’ve been playing around with Mahout over the last couple of days to see how well it works for content based filtering.

I started following a mini tutorial from Stack Overflow but ran into trouble on the first step:

bin/mahout seqdirectory \
--input file:///Users/markneedham/Downloads/apache-mahout-distribution-0.12.2/foo \
--output file:///Users/markneedham/Downloads/apache-mahout-distribution-0.12.2/foo-out \
-c UTF-8 \
-chunk 64 \
-prefix mah
16/07/21 21:19:20 INFO AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[file:///Users/markneedham/Downloads/apache-mahout-distribution-0.12.2/foo], --keyPrefix=[mah], --method=[mapreduce], --output=[file:///Users/markneedham/Downloads/apache-mahout-distribution-0.12.2/foo-out], --startPhase=[0], --tempDir=[temp]}
16/07/21 21:19:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/07/21 21:19:20 INFO deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
16/07/21 21:19:20 INFO deprecation: is deprecated. Instead, use
16/07/21 21:19:20 INFO deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file:/Users/markneedham/Downloads/apache-mahout-distribution-0.12.2/foo, expected: hdfs://localhost:8020
	at org.apache.hadoop.fs.FileSystem.checkPath(
	at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(
	at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(
	at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(
	at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(
	at org.apache.mahout.text.SequenceFilesFromDirectory.runMapReduce(
	at org.apache.mahout.text.SequenceFilesFromDirectory.main(
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(
	at java.lang.reflect.Method.invoke(
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(
	at org.apache.hadoop.util.ProgramDriver.driver(
	at org.apache.mahout.driver.MahoutDriver.main(
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(
	at java.lang.reflect.Method.invoke(
	at org.apache.hadoop.util.RunJar.main(

I was trying to run the command against the local file system on my laptop which should have been possible according to the instructions. I couldn’t find any flag I could pass any flag that I could pass to Mahout to tell it not to use HDFS but I eventually stumbled on someone else experiencing a similar problem.

It turns out the last time I was playing around with Hadoop, in late 2015, I’d actually configured that and had completely forgotten. I needed to comment out the following config:



I commented that property out and all was happy with the (Hadoop) world again.

Categories: Blogs

Knowledge Sharing

SpiraTeam is a agile application lifecycle management (ALM) system designed specifically for methodologies such as scrum, XP and Kanban.