wtorek, 4 lutego 2014

Clustering Udacity forum users

One of the questions I wanted to ask is can I cluster users into some groups. For clustering I wanted to use kmeans.
First I had to prepare simple export.
Mapper takes forum and user files and selects proper data from them:

import sys
import csv

def mapper():
    reader = csv.reader(sys.stdin, delimiter='\t')
    writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)

    for line in reader:
        if line[0]=="id" or line[0]=="user_ptr_id":
             continue;
        if len(line)==5:
                l = (line[0],'A',line[1],line[2],line[3],line[4]);
                writer.writerow(l)
        else:
                l =(line[3],'B')
                writer.writerow(l)

def main():
    import StringIO

    mapper()
    sys.stdin = sys.__stdin__

main()

Reducer which outputs userid along with his badges, karma and posts count:

#!/usr/bin/python
import sys
import csv
def reducer():
    oldKey = None;
    rep=0
    gold=0
    silver=0
    bronze=0
    count = 0
    reader = csv.reader(sys.stdin, delimiter='\t')
    for line in reader:
        if line[1]=='A':
                if oldKey:
                        print '\t'.join([oldKey,rep,gold,silver,bronze,str(count)])
                oldKey, rep, gold, silver, bronze = line[0],line[2],line[3],line[4],line[5]
                count=0
        else:#B
                count+=1
    if oldKey:
        print '\t'.join([oldKey,rep,gold,silver,bronze,str(count)])
def main():
    import StringIO
    reducer()

if __name__ == "__main__":
    main()

I have used Java Modelling Tools (http://jmt.sourceforge.net/) to visualize k-means clustering and it looks like that we can split our users into 3 clusters where:
17432 users (99%) are in cluster 1, red:
Info
Center
Std. Dev.
Kurt.
Skew.
Reputation
111.457E0
273.317E0
330.457E-1
518.127E-2
Gold
270.996E-3
930.424E-3
868.928E-1
717.735E-2
Silver
878.499E-3
244.692E-2
874.737E-1
683.432E-2
Bronze
421.489E-2
613.437E-2
843.112E-1
584.215E-2
Count
823.078E-2
199.274E-1
820.611E-1
723.129E-2

Cluster 2, 157 users, blue:






Info
Center
Std. Dev.
Kurt.
Skew.
Reputation
555.198E1
267.607E1
284.950E-2
166.334E-2
Gold
712.739E-2
919.289E-2
112.543E-1
272.010E-2
Silver
211.210E-1
204.831E-1
512.309E-2
191.835E-2
Bronze
511.529E-1
324.686E-1
260.647E-2
133.968E-2
Count
302.185E0
238.706E0
122.783E-1
252.269E-2

Cluster 3, 18 users, pink:
Info
Center
Std. Dev.
Kurt.
Skew.
Reputation
267.654E2
105.582E2
123.350E-2
143.433E-2
Gold
242.222E-1
307.142E-1
145.243E-2
154.269E-2
Silver
846.111E-1
768.483E-1
-537.560E-3
863.259E-3
Bronze
134.889E0
103.588E0
-104.684E-2
678.600E-3
Count
760.833E0
622.366E0
-159.373E-2
331.745E-3

Plotting those 3 cluster against two main variables we receive this image:
y-axis – number of posts
x-axis - reputation


I can see that most of the users are not active and there is very small group which helps a lot.

Brak komentarzy:

Prześlij komentarz