Pages

My slides from PythonBrazil's Conference available

Monday, November 26, 2012



Hi all,

This weekend I attended the PythonBrasil,  a Annual Brazilian Conference for Python Developers. It was a great event meeting old friends and keep in touch with the best python developers around Brazil. I had the opportunity to lecture there two talks and one tutorial.

My first presentation was about how Python is used nowadays as main platform in several on-line e-learning system such as the innitatives Coursera, Udacity, CodeAcademy, Khan Academy worldwide. I also  present  my experiences with Python in brazilian enviroment such as Atepassar and PyCursos.  Python is also a good language to start learning and I will present how I and our team teached  it for more than de 500 students around Brazil.


SlideShows here.




My second presentation was about how I used Python, Hadoop and MapReduce to scale up our recommender system currently running at Atepassar, an educational  brazilian social network with         more than 150 thousand students.   Examples, tips,  guides to anyone who wants explore the         power of distributed computing at Amazon as also the current effort at the framework Crab to         include those features.



SlideShows here.





The tutorial I gave at FGV about Machine Learning with Python is being updated and I will release it soon here at this post.

Congratulations to Rio's Staff for this exciting conference! Next year it will be at Brasilia, Brazil.

See you there,

Marcel Caraciolo

Keynotes and Tutorial at PythonBrasil 8 about education, data mining and big data!

Saturday, November 17, 2012

Hi all,

Next week I will be at PythonBrasil,  a annual meeting that joins all Python developers around Brazil to discuss programming, projects and opportunities to see some old friends, make new ones and even some business there. 


It will be a great event with more than 50 keynotes about Python and related applications and projects. I will host some keynotes and a hands-on tutorial during the event.

On Thurday, 22th - 2:00PM - 6:00PM


           I will present some learning techniques implemented with Python and Scikit-learn such as Linear Regression, Naive Bayes and several others. For  anyone who  knows Python and familiar with basic statistics.

On  Friday, 23th - 10:40AM- 11:10AM

          I will show how Python is used nowadays as main platform in several on-line e-learning system such as the innitatives Coursera, Udacity, CodeAcademy, Khan Academy worldwide. I also  present  my experiences with Python in brazilian enviroment such as Atepassar and PyCursos.  Python is also a good language to start learning and I will present how I and our team teached  it for more than de 400 students around Brazil.

On Saturday, 24th - 4:10 PM - 4:40 PM
       In this presentation I will show how I used Python, Hadoop and MapReduce to scale up our recommender system currently running at Atepassar, an educational  brazilian social network with         more than 150 thousand students.   Examples, tips,  guides to anyone who wants explore the         power of distributed computing at Amazon as also the current effort at the framework Crab to         include those features.

Those are my talks and tutorials above. There are several other talks quite interesting that I want to watch, I extremely recommend you to attend this conference, it will be awesome.

For you who wants to follow up the event,  follow @PythonBrasil at Twitter and like the PythonBrasil8 fanpage at Facebook.

Look for me at the event, for anyone who wants to talk about education, python, entrepreneurship, data mining and big data ;D

See you at the conference!

Marcel Caraciolo

Atepassar Recommendations: Recommending friends with MapReduce and Python

Sunday, October 28, 2012

Hi all,

In this post I will present one of the tecnhiques used at Atépassar, a brazilian social network that help students around Brazil in order to pass the exams for a civil job, our recommender system.



I will describe some of the data models that we use and discuss our approach to algorithmic innovation that combines offline machine learning with online testing.  For this task we use distributed computing since we deal with over with 140 thousand users. MapReduce is a powerful technique and we use it by writting in python code with the framework MrJob.  I recommend you to read further about it at my last post here.

One of our recommender techniques is the simple 'people you might know' recommender algorithm.  Indeed, there are several components behind the algorithm since at Atépassar, users can follow other people  as also be followed by other people.  In this post I will talk about the basic idea of the algorithm which can be derivated for those other components.  The idea of the algorithm is that if person A and person B do know each other but they have a lot of mutual friends, then the system should recommend that they connect with each other.


Person A and Person B has common friends, it should recommend each other 

We will implement this algorithm using the MapReduce architecture in order to use Hadoop,  which is a open source software for highly reliable, scalable distributed computing.  I assume that you already familiar with those concepts, if not please take a look at those posts and see what map and reduce jobs are.


But before introducing the algoritm, let's present a simple algorithm in case of the bi-directional connections, that is, if I am your friend, you are also my friend. In order to recommend such friends, we first need to count the number of mutual friends that each pair of users have in the network. For this, we will need to implement a map reduce job that works similar to  the job that count the frequency of words in a file.  For every pair of friends in a list, we output the tuple {friend1, friend2}, 1:



In addition to this, we also output a tuple with -1 for every pair of users who are friends {user_id, friend}, -1.


Now, the reducer gets a input key denoting a pair of friends along with their list of their number of common friends.

In the reduce function, we can check if the first record has a -1 in the value. If there is such a reccord, we can ignore that pair of friends, because they already have a direct connection between them. Finally we aggregate the values of all the keys in the reducer and output the tuple for all the pair of friends which don't have third attribute as -1 in the key.


After the first map-reduce job, we obtain a list of pair of persons along the number of common friends that they have. Our final map-reduce job looks at this list {[friend1, friend2], numberOfMutualFriends} and outputs a list of persons they have maximum number of common friends with. Our map job outputs {friend1,[numberOfMutualFriends, friend2]}, {friend2,  [numberOfMutualFriends, friend1]}  .

The reducer would look at the person in the key and our comparator would look at the numberOfCommonFriends to sort the keys. This would ensure that the tuples for the same person go on the same reducer and in a sorted order by the number of common friends. Our reducer then just need to look at the top 5 values and output the list (TOP_N).

Now, we run through our social network data.


marcel;jonas,maria,jose,amanda
maria;carol,fabiola,amanda,marcel
amanda;paula,patricia,maria,marcel
carol;maria,jose,patricia
fabiola;maria
paula;fabio,amanda
patricia;amanda,carol
jose;marcel,carol
jonas;marcel,fabio
fabio;jonas,paula
carla


Let's see some interesting stuff. The recommended friends to me are:


"marcel" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["patricia", 1], ["paula", 1]]
"maria" [["jose", 2], ["patricia", 2], ["jonas", 1], ["paula", 1]]
"patricia" [["maria", 2], ["jose", 1], ["marcel", 1], ["paula", 1]]
"paula" [["jonas", 1], ["marcel", 1], ["maria", 1], ["patricia", 1]]
"amanda" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["jonas", 1], ["jose", 1]]
"carol" [["amanda", 2], ["marcel", 2], ["fabiola", 1]]
"fabio" [["amanda", 1], ["marcel", 1]]
"fabiola" [["amanda", 1], ["carol", 1], ["marcel", 1]]
"jonas" [["amanda", 1], ["jose", 1], ["maria", 1], ["paula", 1]]
"jose" [["maria", 2], ["amanda", 1], ["jonas", 1], ["patricia", 1]]



As I expected the person with most common friends to me is carol, with maria and jose.

There are  still plenty of things that can be done here in the implementation in order to have our final recommender, for example here we are not considering the followers in common or whether if we live at the same state.  Even the results and scalability can be improved.


Facebook Data


And what about recommending friends from Facebook?  I decided to mine some data based on connections between my connections.

$python  friends_recommender.py - r emr --num-ec2-instances 5  facebook_data.csv > output.dat

Let's see some results:

Let's pick my friend Rafael Carício. The top suggestions for him would be:

"Rafael_Caricio_584129827" [["Alex_Sandro_Gomes_625299988", 28], ["Oportunidadetirecife_Otir_100002020544265", 26], ["Thiago_Diniz_784848380", 20], ["Guilherme_Barreto_735956697", 18], ["Andre_Ferraz_100001464967635", 17], ["Sofia_Galvao_Lima_1527232153", 16], ["Edmilson_Rodrigues_1003183323", 15], ["Pericles_Miranda_100001613052998", 14], ["Andre_Santos_100002368908054", 14], ["Edemilson_Dantas_100000732193812", 13]]
Rafael is my old partner and studied with me at UFPE (University). All recommended are entrepreneurs or old students from CIN/UFPE.

And about my colleague Osvaldo Santana, one of the most famous python developers in Brazil.

"Osvaldo_Santana_Neto_649598880" [["Hugo_Lopes_Tavares_100000635436030", 14], ["Francisco_Souza_100000560629656", 12], ["Daker_Fernandes_Pinheiro_100000315704652", 8], ["Flavio_Ribeiro_100000349831236", 6], ["Jinmi_Lee_1260075333", 5], ["Alex_Sandro_Gomes_625299988", 5], ["Romulo_Jales_100001734547813", 5], ["Felipe_Andrade_750803015", 5], ["Adones_Cunha_707904846", 5], ["Flavio_Junior_100000544023443", 4]]

Interesting! Recommended other Python Developers as also some entrepreneurs! :)

We could play more with it, but if you want to test with your own data download it by using this app.

Twitter Friends

Great, but how can I use this algorithm in Twitter for example ?! It's a little different. In this scenario we don't assume the bi-directional link. Only because I follow you on Twitter it does not mean that you also follow me. We have now two entities: followers and the friends. 

The basic idea is to count the number of directed paths. See the full code below:
Let's see how I can get some new friends to recommend based on the list of users that posts about #recsys that I created.

Running over twitter and that is:
   
    "@marcelcaraciolo" [["@alansaid"  24],   ["@zennogantner"  21], ["@neal_lathia"  21] ]


Great I didn't know! :D I liked the recommendations. By the way they are great references in recsys.  But let's go further...  Let's check the explanations?  Why those users are recommended to me ?  It is an important feature for improving the acceptance of the given recommendation.

Making some changes at my code,  it gives us the output below:

    "@marcelcaraciolo" [["@alansaid"  24, ["@pankaj", "@recsyschallenge", "@kdnuggets",
    "@mitultiwari", "@sidooms", "@recsys2012", "@khmcnally", "@McWillemsem", "@LensKitRS",
   "@pcastellis",  "@dennisparra", "@filmaster",  "@BamshadMobasher", "@sadrewge",
   "@totopampin",  "@recsyshackday", "@plamere", "@usabart", "@mymedialite', "@reclabs",
   "@elehack","@omdb", "@osbourke", "@siah"]],    ["@zenogantner"  21, ["@sandrewge",
    "@nokiabetalabs", "@foursquareAPI",  "@mitultiwari", "@kiwitobes", "@directededge",
    "@plista", "@twitterapi", "@recsys2012",  "@xamat", "@tlalek", "@namp",
  "@lenskitRS",  "@siah", "@ocelma",  "@abellogin", "@mymedialite', "@totopampin",  "@RecsysWiki","@ScipyTip", "@ogrisel"]], ["@neal_lathia"  21,  ["@googleresearch",
    "@totopampin", "@ushahidi",  "@kaythaney", "@gj_uk", "@hmason",
    "@jure", "@ahousley", "@peteskomoroch",  "@xamat", "@tnhh", "@elizabethmdaly",
  "@recsys2012",  "@sandrewge", "@matboehmer",  "@abellogin", "@pankaj', "@jerepick",  "@alsothings","@edchi", "@zenogantner"]]

Recommendations with explanations and also distributed :D

Atépassar Extension


Ok, I haven't talked yet about Atépassar. How we recommend new friends at Atepassar ?  Atepassar is designed as Twitter, so we also have the entities friends and followers. Besides the common friends we add other attributes in the final score. For instance let's consider now the state where each user lives informed by his user profile at Atepassar. The idea is to recommend people with mutual friends and also weighted by people who lives at the same state.  For the purpose of illustration, let's start with a simple scoring approach by choosing our function to be a linear combination of state and mutual friends similarity. This gives an equation of the form 
                                         fsim(u,i) = w1 f1(u,i) + w2 f2(u,i) + b, 

where u = user, i = the new friend, f1 = the friendship similarity and f2 = the state similarity.  This equation defines a two-dimensional space like the one illustrated below:


This is a sample of the my input:

murilodumps;SC;marcoscampelo
dan_sampaio;RJ
marvinslap;PE;jwalker;marcoscampelo;juliocesarfort;guilhermepaiva;hugofsantiago;bruno;naevio;Mila21Sousa;dansampaio;thaisleao;lucianaamancio;pedro;marciomarques83;x5;narah;pamelaresende;carolcani;itl;roberta_ferreira;sexxyalice;annelouiseadv;bribarbosa;espertinha3;fzanchin;claudiasm;cauecamacho;lucianac;bicalhoferreira;dougui;monnyke;mariaaugusta38;germanabarros;professor1;rosimartome;klauklau;lugentil;rodrigo_miranda;portoalegre;mczmendonca;itsfabio;CarolFollador;ricardofay;lorenameimei;josi_patricia;analaurafonseca;daiana_ugulino;narelle_moraes;kamyu;wallacevidal;falcaoblanco;julianabraggio;tiagoop;giselevix10;natanaelsilva;giovannafc;vivoquinha;alinne_silva_oliveira;amanda_cutrim;gabrielagrisa;bruna_estudando;valquiria_pereira_alves;deniseharue;daysianef;Dayanne_F;italo;Orlando;kauanny;marcelo_;bruna_jacob;adonescunha;fahbiuzinha;Barbabela;fale_rodrigo;michelle_rva;rlsleal12345;creuzamoura;tuliocfp;aeltonf;Cintia_Evelane;hellheize;dhelly;murilodumps;Indianara;thamy_ls;christiane_freire;JulianaGarbim;matheuslino;LMC;Gorrpo;guilhermevilela;gabi;dalekrause;vanessaformigosa;SCANDALL;elaine_regina;rafs_gomes;larylmacedo;erico;spencer;hitalos;daianefarias;rldourado;veronicacordeiro;carmemsmrocha;falcettijr;evertonera;nessa;vtcc;ricardoapjustino;leonardo;lopes21;marcelosantos;Verallucia;paolaseveroo
hugofsantiago;PE;daianefarias
kauanny;PE
Dayanne_F;PE;jwalker
guilhermepaiva;PE;marvinslap;jwalker;veronicacordeiro
italo;PE;marcoscampelo;jwalker;romero_britto;Syl;mariana_mbs
maria;PE
bruno;PE;marcoscampelo;jwalker;marvinslap;rldourado;anagloriaflor;dayannef;flmendes;adm_nathaly2010;Andersonpublicitario;diemesleno;misterobsom;jessica_soares_ribeiro;Orlando;cauecamacho;itl;kk_u;lucianaamancio;josi_patricia;mczmendonca;adonescunha;x5;lugentil;rafaelsantana;katarinebaf;aquista;analaurafonseca;auberio;naevio;mz;flaviooliveira;eduardocruz;robson_ribeiro;gabrielagrisa;fahbiuzinha;nattysilveira;petsabino;eduardovg88;bibc;ninecarvalho10;bjmm;marcossouza;masdesouza;espertinha3;valquiria_pereira_alves;narelle_moraes;rodrigo3n


And the output sample:

"marcelcaraciolo" [["Gileno", 0.44411764705882351], ["raphaelalex", 0.43999999999999995], ["marcossouza", 0.43611111111111112], ["roberta_gama", 0.42352941176470588], ["anagloriaflor", 0.40937499999999999], ["rodrigo3n", 0.40769230769230769], ["alissonpontes", 0.40459770114942528], ["andreza_cristina", 0.40370370370370368], ["naevio", 0.40370370370370368], ["adonescunha", 0.40327868852459015]]


Interesting among the top  10,  5 worked with me at Atepassar (it means lots of common friends).  Let's see the code:


Conclusions

That's a simple algorithm used at Atépassar for recommending friends using some basic graph analysis concepts.  You can extend this code or use it for free at your social network! Go ahead :)   In the next post I will present the map-reduce jobs for course recommendation analysis at PyCursos.

I hope you enjoyed this article,

Best regards,
Marcel Caraciolo

Introduction to Recommendations with Map-Reduce and mrjob

Thursday, August 23, 2012


Hi all,

In this post I will present how can we use map-reduce programming model for making recommendations.   Recommender systems are quite popular among shopping sites and social network thee days. How do they do it ?   Generally, the user interaction data available from items and products in shopping sites and social networks are enough information to build a recommendation engine using classic techniques such as Collaborative Filtering.

Why Map-Reduce ?

MapReduce is a framework originally developed at Google that allows easy large scale distributed computing across a number of domains. Apache Hadoop is an open source implementation of it.  It scales well to many thousands of nodes and can handle petabytes of data. For recommendations where we have to find the similar products to a product you are interested at , we must calculate how similar pairs of items are. For instance, if someone watches the movie Matrix, the recommender would suggest the film Blade Runner. So we need to compute the similarity between two movies. One way is to find correlation between pairs of items.  But if you own a shopping site, which has 500,00 products, potentially we would have to compute over 250 billion computations. Besides the computation, the correlation data will be sparse, because it's unlikely that every pair of items will have some user interested in them. So we have a large and sparse dataset. And we have also to deal with temporal aspect since the user interest in products changes with time, so we need the correlation calculation done periodically so that the results are up to date.  For these reason the best way to handle with this scenarion and problem is going after a divide and conquer pattern, and MapReduce is a powerful framework and can be used to implement data mining algorithms.  You can take a look at this post about MapReduce or go to these video classes about Hadoop.

Map-Reduce Architecture



Meeting mrjob


mrjob is a Python package that helps you write and run Hadoop Streaming jobs. It supports Amazon's Elastic MapReduce(EMR) and it also works with your own Hadoop cluster.  It has been released as an open-source framework by Yelp and we will use it as interface for Hadoop since its legibility and ease to handle with MapReduce tasks.  Check this link to see how to to download and use it.


Movie Similarities


Imagine that you own a online movie business, and you want to suggest for your clients movie recommendations.  Your system runs a rating system, that is, people can rate movies with 1 to 5 starts, and we will assume for simplicity that all of the ratings are stored in a csv file somewhere.
Our goal is to calculate how similar pairs of movies are, so that we recommend movies similar to movies you liked.  Using the correlation we can:

  • For every pair of movies A and B, find all the people  who rated botha A and B.
  • Use these ratings to form a Movie A vector and a Movie B vector.
  • Calculate the correlation between those two vectors
  • When someone watches a movie, you can recommend the movies most correlated with it

So the first step is to get our movies file which has three columns:  (user, movie, rating). For this task we will use the MovieLens Dataset of Movie Ratings with 10.000 ratings from 1000 users on 1700 movies (you can download it at this link).

Here it is a sample of the dataset file after normalized.




So let's start by reading the ratings into the MovieSimilarities job.


You want to compute how similar pairs of movies are, so that if someone watches the movie The Matrix, you can recommend movies like BladeRunner. So how should you define the similarity between two movies ?

One possibility is to compute their correlation. The basic idea behind it is for every pair of movies A and B, find all the people who rated both A and B. Use these ratings to form a Movie A vector and a Movie B vector.  Then, calculate the correlation between these two vectors.  Now when someone watches a movie, you can now recommend him the movies most correlated with it.

So let's divide to conquer. Our first task is for each user, emit a row containing their 'postings' (item, rating). And for reducer, emit the user rating sum and count for use later steps.




Before using these rating pairs to calculate correlation,  let's see how we can compute it.  We know that they can be formed as vectors of ratings, so we can use linear algebra to perform norms and dot products, as alo to compute the length of each vector or the sum over all elements in each vector. By representing them as matrices, we can perform several operations on those movies.
To summarize, each row in calculate similarity will compute the number of people who rated both movie and movie2 , the sum over all elements in each ratings vectors (sum_x, sum_y) and the squared sum of each vector (sum_xx, sum__yy). So  we can now can calculate the correlation between the movies. The correlation can be expressed as:



So that's it! Now the last step of the job that will sort the top-correlated items for each item and print it to the output.


So let's see the output. Here's a sample of the top output I got:


MovieA MovieB Correlation
Return of the Jedi (1983) Empire Strikes Back, The (1980) 0.787655
Star Trek: The Motion Picture (1979) Star Trek III: The Search for Spock (1984) 0.758751
Star Trek: Generations (1994) Star Trek V: The Final Frontier (1989) 0.72042
Star Wars (1977) Return of the Jedi (1983) 0.687749
Star Trek VI: The Undiscovered Country (1991) Star Trek III: The Search for Spock (1984) 0.635803
Star Trek V: The Final Frontier (1989) Star Trek III: The Search for Spock (1984) 0.632764
Star Trek: Generations (1994) Star Trek: First Contact (1996) 0.602729
Star Trek: The Motion Picture (1979) Star Trek: First Contact (1996) 0.593454
Star Trek: First Contact (1996) Star Trek VI: The Undiscovered Country (1991) 0.546233
Star Trek V: The Final Frontier (1989) Star Trek: Generations (1994) 0.4693
Star Trek: Generations (1994) Star Trek: The Wrath of Khan (1982) 0.424847
Star Trek IV: The Voyage Home (1986) Empire Strikes Back, The (1980) 0.38947
Star Trek III: The Search for Spock (1984) Empire Strikes Back, The (1980) 0.371294
Star Trek IV: The Voyage Home (1986) Star Trek VI: The Undiscovered Country (1991) 0.360103
Star Trek: The Wrath of Khan (1982) Empire Strikes Back, The (1980) 0.35366
Stargate (1994) Star Trek: Generations (1994) 0.347169
Star Trek VI: The Undiscovered Country (1991) Empire Strikes Back, The (1980) 0.340193
Star Trek V: The Final Frontier (1989) Stargate (1994) 0.315828
Star Trek: The Wrath of Khan (1982) Star Trek VI: The Undiscovered Country (1991) 0.222516
Star Wars (1977) Star Trek: Generations (1994) 0.219273
Star Trek V: The Final Frontier (1989) Star Trek: The Wrath of Khan (1982) 0.180544
Stargate (1994) Star Wars (1977) 0.153285
Star Trek V: The Final Frontier (1989) Empire Strikes Back, The (1980) 0.084117



As we would expect we can notice that
  • Star Trek movies are similar to other Star Trek movies.
  • The people  who likes Star Trek movies are not so fans of Star Wars and vice-versa;
  • Star Wars Fans will be always fans! :D
  • The Sci-Fi movies are quite similar to each other;
  • Star Trek III: The Search for Spock (1984) is one the best movies of Star Trek (several positive correlations)

To see the full code, checkout the Github repository here.


Book Similarities


Let's see another dataset. What about Book Ratings ? Let's see this dataset of 1 million book ratings.   Here's again a sample of it:




But now we want to compute other similarity measures besides correlation. Let's take a look on them.

Cossine Similarity

Another common vector-based similarity measure.


Regularized Correlation

We could use regularized correlation by adding N virtual movie pairs that have zero correlation. This helps avoid noise if some movie pairs have very few raters in common.

Jaccard 
The implicit data can be useful. In some cases only because you rate a Toy Store movie, even if you rate it quite horribly, you can still be interested in similar animation movies.  So we can ignore the value itself of each rating and use a set-based similarity measure such as the Jaccard Similarity.


Now, let's add all those similarities to our mapreduce job and make some adjustments by making a new job for counting the number of raters for each movie. It will be required for computing the jaccard similarity.

Ok,  let's take a look at the book similarities now with those new fields.



BookA BookB Correlation Cossine Reg Corr Jaccard Mutual Raters
The Return of the King (The Lord of The Rings, Part 3)The Voyage of the Dawn Treader (rack) (Narnia) 0 0.998274 0 0.068966 2
The Return of the King (The Lord of the Rings, Part 3) The Man in the Black Suit : 4 Dark Tales 0 1 0 0.058824 6
The Fellowship of the Ring (The Lord of the Rings, Part 1) The Hobbit : The Enchanting Prelude to The Lord of the Rings 0.796478 0.997001 0.49014 0.045714 16
The Two Towers (The Lord of the Rings, Part 2) Harry Potter and the Prisoner of Azkaban (Book 3) -0.184302 0.992536 -0.087301 0.022277 9
Disney's 101 Dalmatians (Golden Look-Look Books) Walt Disney's Lady and the Tramp (Little Golden Book) 0.88383 1 0.45999 0.166667 5
Disney's 101 Dalmatians (Golden Look-Look Books) Disney's Beauty and the Beast (Golden Look-Look Book) 0.76444 1 0.2339 0.166667 7
Disney's Pocahontas (Little Golden Book) Disney's the Lion King (Little Golden Book) 0.54595 1 0.6777 0.1 4
Disney's the Lion King (Disney Classic Series) Walt Disney Pictures presents The rescuers downunder (A Little golden book) 0.34949 1 0.83833 0.142857 3
Harry Potter and the Order of the Phoenix (Book 5) Harry Potter and the Goblet of Fire (Book 4) 0.673429 0.994688 0.559288 0.119804 49
Harry Potter and the Chamber of Secrets (Book 2) Harry Potter and the Goblet of Fire (Book 4) 0.555423 0.993299 0.496957 0.17418 85
The Return of the King (The Lord of The Rings, Part 3) Harry Potter and the Goblet of Fire (Book 4) -0.2343 0.02022 -0.08383 0.015444 4




  • Lord of The Rings, books are similar to other Lord of The Ring books
  • Walt Disney books are similar to other Walt Disney books. 
  • Lord of The Ring books does not stick together Harry Potter books.

The possibilities are endless.

But is it possible to generalize our input and make our code to generate similarities for different inputs ? Yes it is.  Let's abstract our input. For this, we will create a VectorSimilarities Class that represents input data in the following format:


So if we want to define a new input format, just subclass the VectorSimilarities class and implement the method input.

So here's the class for the book recommendations using our new VectorSimilarities.

And here's the class for the movies recommendations. It simply reads from a data file and lets the VectorSimilarities superclass do the work.


Conclusions

As you noticed map-reduce is a powerful technique for numerical computation and speacially when you have to compute large datasets. There are several optimization I can do in those scripts such as numpy vectorizations for computing the similarities. I will explore more these features in the next posts: one handling with recommender systems and popular social networks as also how you can use the Amazon EMR infrastructure to compute your jobs!

I'd like to thank Edwin Chen and his post using those examples with Scala and whose post inspired me to explore these examples above in Python.

All code for those examples above can be downloaded at my github repository.

Stay tunned,

I hope you enjoyed this article,

Best regards,

Marcel Caraciolo

Recommendations and how to measure the ROI with some metrics ?

Sunday, July 8, 2012

Hi all,

We talked a lot about recommender systems, specially discussing the techniques and algorithms used to build and evaluate algorithmically those systems. But let's discuss now how can we measure in quantitative terms how a social network or an on-line store can measure the return of investment (ROI) of a given recommendation.

The metrics used in recommender systems


We talk a lot about F1-measure, Accuracy, Precision, Recall, AUC,  those buzzwords widely known by the machine learning researchers and data mining specialists. But do you know what is CTR, LOC, CER or TPR ?  Let's explain more about those metrics and how they can evaluate the quantitative benefits of a given recommendation.

First, it is important to understand what is a metricA metric is a meaure system that quantifies a trend, dynamics or a certain characteristic. It is commonly used to explain phenomenas, identify the causalities , share discoveries or project results in future events.  Define and monitor those metrics are important to evaluate the return of investment (ROI) of specific actions, demands and test hypothesis.

For recommender systems we can use metrics to evaluate their performance on conversion, interaction or impact. In the figure 1 we can see those groups and how the metrics are distributed:

Metrics groups for evaluate the recommender systems


The impact measures include the places where the recommendations are presented, for example, the e-commerce home page, the product list page or the shopping cart page; and the number of recommendation lists; which is the total number of recommendation lists shown inside the store in a specified period of time. They can provide a signal the coverage or the amplitude of the recommendation service in the website.

The most important measure group is the interaction. The CTR (click-through rate) is the one of the metrics most used nowadays in this group to evaluate the engagement level of the users with the recommendations. It quantifies the level of interest among the recommended products.  It is calculated by dividing the number of clicks in the recommended items and the total number of recommendations presented.

The third one and the most relevant in the group are the metrics that measure the conversion of the recommendation service. Among those metrics, the most popular are: 1) the rate of orders with recommendation, that is, the division between the number of orders with recommendations and the total number of orders. 2) the rate of items recommended per order created by recommendation, that is, from the orders what's the proportion between the number of recommended items and the total of items in the order. 3) The increase of the average ticket, which corresponds to the division between the average ticket of the orders that contains recommended items minus the average ticket of the store and the average ticket of the store.  Finally the revenues increase rate, that corresponds to the revenues generated by recommendations divided by the difference of the total profit and the revenue by recommendation.

It is important to notice that the those metrics correspond to percentual values measured in a specified time period, so the divisions evaluated above must be multiplied by 100% to determine correctly the proposed taxes or the percentual values. So let's review the presented metrics and respectively abbreviations:

- REC:  number of recommendations presented in a list.
- LOC:  places where the recommendation lists are placed.
- CER:  total of clicks in the recommendations
- CTR (%):  rate of clicks in the recommendations
- TPR (%):  proportion of orders with recommendations
- TIR (%):  proportion of recommended items per order with recommendation
- IAT (%):  increase in the average ticket
- IR (%):  increase in the revenue


Understanding the metrics

For better understanding of the metrics illustrated above let's use a real world scenario and present how to calculate each of them.  Let's consider the artificial data presented at Table 1, where the ORDERS is the total number of orders in the store, ORDERS_REC is the total number of orders with recommendations, NIP corresponds to the average number of items per order,  NIRP is the  average number of recommended items per order with recommendations. AT is the average ticket and ATR is the average ticket of orders with recommendations.

The Table 1 shows that the store had 150.000 items recommended presented at 01/06/2012 and 1400 orders closed.

Table 1:  Historical Data of Sales at E-Commerce WebSite


Using this data, we can calculate the following metrics:

CTR =  (CER/REC) * 100 = (18.000 / 150.000) * 100%  =  12%
TPR =  (ORDERS_REC/ORDERS) * 100 % = (250/1400) * 100% =  17.9%
TIR =  (NIRP/NIP) * 100%  =  (1.7/4.5) * 100% = 37.8%
ATM = (ATR - AT) / (AT)  * 100% =  (315-268/268) = 17,5% 

The last metric proposed here refers to the percentual increase at the revenue.  Considering the data available in the table 1, the store above profited a total of R$ 375.200,00 at 01/06/2012. Having the total sales from recommendations in R$ 67,000,00 the increase of profit will be:

IR  =  (67.000) / (375.200 - 67.000) * 100 =  21,73%

The results present that 12% of times that the recommendations are presented, one is clicked; from all the orders purchased at the store, 17,9% has at least one item recommended; from the items at the shopping cart, 37,8% were recommended; and  the recommendation increases in 17,5% the value of the average ticket in the store.  About the revenue, the recommendation resulted an increase of 21,7%.

So until now we presented the metrics and some numbers about how to calculate them. But let's go further and see how we can now compare for example now, there are three recommendation approaches that we want to test at our web store using a kind of test A/B in a specified period (You don't know what is a test A/B? Read it here.)

So let's consider three approaches for example:


- Technique 1:  Content Based Filtering



- Technnique 2:  Only Most popular Ranking


- Technique 3:  Collaborative Filtering



The Table 2 presents the performance of the recommendation system in the three approaches, with the average value of the following recommendation metrics:  CTR, TPR, TIR, IAT e IR for one period date.


Table 2: Perfomance at our e-commerce store with each recommendation approach
                         

In the table 2 we can see that the average interaction is between 6.9% and 14.8% with the recommendations approaches implemented at the store.  It means that, at least 6,9% of the recommended items presented were clicked.  About the conversion metrics, we can observe that the recommendations are promoting new sales, which it wouldn't exist if there was no recommendations.   The conversion rates are between 4,5 % to 13,5% at those stores.  The best recommendation approach tested  was the third one with 14,4% of increase at the profit.  It is important to notice that this rate is calculated in average, so during the period analyzed there was some peaks in the increase of the profit, for instance sometimes 30% , and 4%.

At the same time, the numbers indicate that the recommendations improve an significant up-sell in the orders with recommendations, since at least half of the itens at the shopping carts came from recommendations (the minimum avg TIR was 48.6%)

One result that can came to our attention was the difference at the performance of sales between the store with the technique 1 and the one with the technique 3 when compared to the performance of the store with the technique 2.   The stores with Collaborative Filtering and Content Based Filtering has a better impact  in recommendations than the technique 2, since in those stores there are personalized recommendations in several places at the website, where in the approach 2 there is only a lower number of pages thast have recommendations.  So if the recommendations are assertive and there are several impact places where people can see the recommendations, the expected result is even better, as the numbers explain at the table presented above.


What we can do with those metrics

Evaluate the result of the conversion of the recommender system at your website is critical. We generally don't focus on those metrics and give more importance to accuracy and better coverage, but what it really imports is the improvement in sales or user acceptance in clicks, etc.  The personalization of a social network of a e-commerce using recommendation systems must be evaluated periodically.   Some tips for you who wants to plan to do this:

- Define a metrics's plan:  what the metrics most important to measure at your website ?  CTR?  TPR?  TIR?

- Establish the goals or the reference target values:  For instance, we want an increat at the profit of 10%  (AF = 10%)  and the increase of average ticket of recommendations at 15% (ATM = 15%).

- Monitor the metrics using the correct tools: It is important to have an web analytics dashboard to analyzed the results as also to obtatin the metrics described above and other relevant parts of your business.

Recommender Systems is more than only algorithms, it is important to understand how to apply them and measure them closely to see how they are effective or even if it needs to be redesigned or improved.  With all those steps and metrics you will be able to find the best configuration for your website and the effective recommendation strategy to present to your clients.


I hope you enjoyed this article,

Best regards,

Marcel Caraciolo


PS: This article is based on the brazilian article at E-commerceBrazil Magazine June/2012 Edition. I recommend a lot if you are a brazilian to read it either.








Data mining through education

Sunday, May 27, 2012

Hi all,

It has been a while since my last post. I had to be offline for some months to work harder in bigger project in the educational field.  I love teaching and specially share content including machine learning and data mining topics. But I decided to go further and I co-founded with my partner Gileno Filho a educational platform to teach Python around Brazil called http://www.pycursos.com




It is a e-learning platform with one main goal: spread the programming language Python around Brazil by teaching the platform and its applications such as the course about regular expressions, scientific computing with Python and now web development with  Django Python. A lot of work and a lot of learning as well. The result in 4 months was amazing! More than 100 students already studied with us, with projection until July with more than 500! I know it isn't an incredible number but without any efforts on marketing only on the word-of-mouth we accomplished these results.

But more news will come specially in machine learning! It's not news that I've been working for about three years in educational field, where I managed as scientist chief in the brazilian social network atepassar.com

With all this background I will start to post more about data mining applied on educational such as  how can you rank the courses or how can machine learning can assess the student mastery.

So join me in this journey, and see how all this data can be used through the tailoring of student's experience.

Cheers,

Marcel Caraciolo


Some data and machine learning talks videos from PyCon Us 2012

Monday, March 12, 2012



Unfortunately this year I couldn't participate of the PyCon, the world meeting of Python developers. I had a poster accepted this year, but I had some problems that didn't allow me to go. Apart from that, the best part is that I could watch talks, tutorials and keynotes from the congress.

Thanks to the PyVideo team, they uploaded all the videos from the PyCon quickly! The best ones that I enjoyed (of course related to data mining, natural language processing and machine learning) I will share with you:







17. IPython: Python at your fingertips




There are some of talks and tutorials that I enjoyed a lot. I recommend you to take a look at all the videos available from PyCon at this link.

I really hope next  year I can go to this meeting :)

Cheers,

Marcel

Guide to Recommender Systems Book Online

Friday, February 24, 2012



Hi all,

This year one of my goals is to write a book such as a guide to teach recommender systems for programmers. I know there are several textbooks that focus on providing a theorical foundation for recommender systems, and as result, may seem difficult to understand. For programmers that want to learn how to start to use or understand the components of a recommender system, this book is what they are looking for.  

This guide follows a learn-by-doing approach. Therefore, I will use theory and apply it through the exercises and experiment with Python code.  I hope when you complete the book you will be able to understand how to build a recommender system and give you the first steps to apply them at your own systems. The textbook is laid out as a series of small steps that will guide you for undestanding the recommender system techniques. 

This book is available for download for free under a Creative Commons license. This project is also leaded by my colleague Ricardo Caspirro, who will review and translate it to portuguese language.

Below I provide the table of contents of the book.


Guide to Recommender Systems


The link for the online guide is available here.


http://muricoca.github.com/recommendation-lectures/index.html


Table of Contents


Chapter 01: Introduction to Recommender Systems

Finding out what recommender system is and what problems it solves. And a fast review of what you will be able to do when you finish this book.

Chapter 02: Collaborative Filtering

This chapter focus on how you can use the state-of-the-art techniques of collaborative filtering that makes automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from similar users (user-based) or similar items( item-based).


Chapter 03: Content Based Filtering
Recommender systems that  suggest an item to a user based upon a description of the item and a profile of the user's interests. Although thedetails of various systems differ, content-based recommendation systems share in common a means for describing the items that may be recommended, a means for creating a profile of the user that describes the types of items the user likes, and a means of comparing items to the user profile to determine what to recommend. 


Chapter 04: Hybrid Based Filtering
This chapter will focus how to pick the best features of collaborative and of content and mix them to build hybrid recommender systems. It will present the current work on this field and an example of  how it works and how you can decide the best strategy to select.

Chapter 05:  Model - Based Recommenders
Techniques that will include memory-based techniques or data mining techniques such as association analysis, symbolic data analysis and classification/clustering techniques will be covered in this chapter.

Chapter 06:  Evaluation of Recommender Systems
This chapter starts with a short description on how to evaluate the recommender systems and the commonly used metrics for compare the recommender algorithms in the development and deployment stages.

Chapter 07:  Recommender Systems and Distributed-Computing
Recommender Systems suffer with sparse matrices where the user x items preferences are sparsed (lots of missing values - preferences). It results on large datasets with millions of items, users and preferences.  For this task it is considered to use distributed computing techniques such as map-reduce to distribute the recommendations. This chapter will cover those topics.

Chapter 08:  Study Case
It will present a study case of a mobile recommender system for recommend users to another users using several techniques showed above and how we tested and deployed it.

Chapter 09:  Recommender Systems the Next Generation
This chapter brings the next generation of recommender systems, describing what the research is going after in several fields such as ubiquity, semmantics, etc.

Chapter 10:  Meeting Python-RecSys Framework
It will present the Python-RecSys framework for building recommender systems with Python in a easy way. It will describe how to build or test already implemented techniques or develop new ones and deploy them with frameworks Web and REST.


This book is under development, please let me know if there are any suggestions or corrections to make over one of those chapters. If you see that there is a topic that needs an extra chapter or a topic that I am missing, please also let me know and comment.


I hope you enjoy this work, specially the developers!


Regards,

Marcel Caraciolo