One of the questions I wanted to ask is can I cluster users into some groups. For clustering I wanted to use kmeans.
First I had to prepare simple export.
Mapper takes forum and user files and selects proper data from them:
import sys
import csv
def mapper():
reader = csv.reader(sys.stdin, delimiter='\t')
writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
for line in reader:
if line[0]=="id" or line[0]=="user_ptr_id":
continue;
if len(line)==5:
l = (line[0],'A',line[1],line[2], line[3],line[4]);
writer.writerow(l)
else:
l =(line[3],'B')
writer.writerow(l)
def main():
import StringIO
mapper()
sys.stdin = sys.__stdin__
main()
Reducer which outputs userid along with his badges, karma and posts count:
#!/usr/bin/python
import sys
import csv
def reducer():
oldKey = None;
rep=0
gold=0
silver=0
bronze=0
count = 0
reader = csv.reader(sys.stdin, delimiter='\t')
for line in reader:
if line[1]=='A':
if oldKey:
print '\t'.join([oldKey,rep,gold, silver,bronze,str(count)])
oldKey, rep, gold, silver, bronze = line[0],line[2],line[3],line[ 4],line[5]
count=0
else:#B
count+=1
if oldKey:
print '\t'.join([oldKey,rep,gold, silver,bronze,str(count)])
def main():
import StringIO
reducer()
if __name__ == "__main__":
main()
I have used Java Modelling Tools (http://jmt.sourceforge.net/) to visualize k-means clustering and it looks like that we can split our users into 3 clusters where:
17432 users (99%) are in cluster 1, red:
Cluster 2, 157 users, blue:
|
Info
|
Center
|
Std. Dev.
|
Kurt.
|
Skew.
|
Reputation
|
555.198E1
|
267.607E1
|
284.950E-2
|
166.334E-2
|
Gold
|
712.739E-2
|
919.289E-2
|
112.543E-1
|
272.010E-2
|
Silver
|
211.210E-1
|
204.831E-1
|
512.309E-2
|
191.835E-2
|
Bronze
|
511.529E-1
|
324.686E-1
|
260.647E-2
|
133.968E-2
|
Count
|
302.185E0
|
238.706E0
|
122.783E-1
|
252.269E-2
|
Cluster 3, 18 users, pink:
Info
|
Center
|
Std. Dev.
|
Kurt.
|
Skew.
|
Reputation
|
267.654E2
|
105.582E2
|
123.350E-2
|
143.433E-2
|
Gold
|
242.222E-1
|
307.142E-1
|
145.243E-2
|
154.269E-2
|
Silver
|
846.111E-1
|
768.483E-1
|
-537.560E-3
|
863.259E-3
|
Bronze
|
134.889E0
|
103.588E0
|
-104.684E-2
|
678.600E-3
|
Count
|
760.833E0
|
622.366E0
|
-159.373E-2
|
331.745E-3
|
Plotting those 3 cluster against two main variables we receive this image:
y-axis – number of posts
x-axis - reputation
I can see that most of the users are not active and there is very small group which helps a lot.

Brak komentarzy:
Prześlij komentarz