czwartek, 13 lutego 2014

Content-Security-Policy issues with iOS Chrome

Few days ago I had an opportunity to trace an issue with iOS Chrome not loading a page. The page with all resources were downloaded properly but Chrome was constantly showing that it's still working on loading page. Result is lack of 'on load' events. Problem only occurred when reloading site. Copying whole content to local static web server didn't replicate the issue so it wasn't the problem of content. I was able to cut whole page and return simple 'hello world' page and it turns out that problem still exist on original webserver and it looks like I had to look deeper - http headers. I had created a sample webserver to show the problem I have found:

import time
import BaseHTTPServer

class HTTPHandler(BaseHTTPServer.BaseHTTPRequestHandler):
    def do_POST(s):
        length = int(s.headers['Content-Length'])
        print length
        data = s.rfile.read(length).decode('utf-8')
        print data
    def do_GET(s):
        s.send_response(200)
        s.send_header("Content-Security-Policy", "script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; object-src 'self'; img-src 'self' ; media-src 'self'; frame-src 'self'; font-src 'self' ;connect-src 'self'; report-uri '192.168.43.17/report'")
        s.send_header("Content-Type", "text/html;charset=UTF-8")
        s.end_headers()
        s.wfile.write("<html><head><title>hello</title></head><body><p>hello world %s</p></body></html>"% s.path)
if __name__ == '__main__':
    server_class = BaseHTTPServer.HTTPServer
    httpd = server_class(('192.168.43.17', 80), HTTPHandler)
    try:
        httpd.serve_forever()
    except KeyboardInterrupt:
        pass

    httpd.server_close()

There is only one not ordinary element here - CSP header which secures site from cross site scripting and give mechanism of reporting security violations. It looks like Chrome is reporting problems - it violates directives:
1) frame-src with uri: chromeinvoke://cd931b8a0ca6aaed193d25b429ee4019
"csp-report":{
   "document-uri": "http://192.168.43.17/",
   "referrer": "",
   "violated-directive": "frame-src 'self'",
   "original-policy": "script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; object-src 'self'; img-src 'self' ; media-src 'self'; frame-src 'self'; font-src 'self' ;connect-src 'self'; report-uri '192.168.43.17/report'",
   "blocked-uri": "chromeinvoke://cd931b8a0ca6aaed193d25b429ee4019",
   "source-file": "http://192.168.43.17/",
   "line-number": 1
}
2) connect-src with uri: https://localhost
"csp-report":{
   "document-uri": "http://192.168.43.17/",
   "referrer": "",
   "violated-directive": "connect-src 'self'",
   "original-policy": "script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; object-src 'self'; img-src 'self' ; media-src 'self'; frame-src 'self'; font-src 'self' ;connect-src 'self'; report-uri '192.168.43.17/report'",
   "blocked-uri": "https://localhost",
   "source-file": "http://192.168.43.17/",
   "line-number": 1
}
3) violations of frame-src with uri: chromenull://
"csp-report":{
   "document-uri": "http://192.168.43.17/",
   "referrer": "",
   "violated-directive": "frame-src 'self'",
   "original-policy": "script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; object-src 'self'; img-src 'self' ; media-src 'self'; frame-src 'self'; font-src 'self' ;connect-src 'self'; report-uri '192.168.43.17/report'",
   "blocked-uri": "chromenull://",
   "source-file": "http://192.168.43.17/",
   "line-number": 21
}
4) frame-src with uri: chromeinvokeimmediate://3726692da42473af155b530fe0e48c61
"csp-report":{
   "document-uri": "http://192.168.43.17/",
   "referrer": "",
   "violated-directive": "frame-src 'self'",
   "original-policy": "script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; object-src 'self'; img-src 'self' ; media-src 'self'; frame-src 'self'; font-src 'self' ;connect-src 'self'; report-uri '192.168.43.17/report'",
   "blocked-uri": "chromeinvokeimmediate://3726692da42473af155b530fe0e48c61",
   "source-file": "http://192.168.43.17/",
   "line-number": 2
}

Further investigation shown that:
Issue with reporting internal/plugins url is known, it is already submitted here
Changing frame-src from 'self' to * solves loading site issue but is lowering security.
Interesting fact is that when switching from anonymous mode to normal I can notice for a short time an iframe:

wtorek, 4 lutego 2014

Clustering Udacity forum users

One of the questions I wanted to ask is can I cluster users into some groups. For clustering I wanted to use kmeans.
First I had to prepare simple export.
Mapper takes forum and user files and selects proper data from them:

import sys
import csv

def mapper():
    reader = csv.reader(sys.stdin, delimiter='\t')
    writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)

    for line in reader:
        if line[0]=="id" or line[0]=="user_ptr_id":
             continue;
        if len(line)==5:
                l = (line[0],'A',line[1],line[2],line[3],line[4]);
                writer.writerow(l)
        else:
                l =(line[3],'B')
                writer.writerow(l)

def main():
    import StringIO

    mapper()
    sys.stdin = sys.__stdin__

main()

Reducer which outputs userid along with his badges, karma and posts count:

#!/usr/bin/python
import sys
import csv
def reducer():
    oldKey = None;
    rep=0
    gold=0
    silver=0
    bronze=0
    count = 0
    reader = csv.reader(sys.stdin, delimiter='\t')
    for line in reader:
        if line[1]=='A':
                if oldKey:
                        print '\t'.join([oldKey,rep,gold,silver,bronze,str(count)])
                oldKey, rep, gold, silver, bronze = line[0],line[2],line[3],line[4],line[5]
                count=0
        else:#B
                count+=1
    if oldKey:
        print '\t'.join([oldKey,rep,gold,silver,bronze,str(count)])
def main():
    import StringIO
    reducer()

if __name__ == "__main__":
    main()

I have used Java Modelling Tools (http://jmt.sourceforge.net/) to visualize k-means clustering and it looks like that we can split our users into 3 clusters where:
17432 users (99%) are in cluster 1, red:
Info
Center
Std. Dev.
Kurt.
Skew.
Reputation
111.457E0
273.317E0
330.457E-1
518.127E-2
Gold
270.996E-3
930.424E-3
868.928E-1
717.735E-2
Silver
878.499E-3
244.692E-2
874.737E-1
683.432E-2
Bronze
421.489E-2
613.437E-2
843.112E-1
584.215E-2
Count
823.078E-2
199.274E-1
820.611E-1
723.129E-2

Cluster 2, 157 users, blue:






Info
Center
Std. Dev.
Kurt.
Skew.
Reputation
555.198E1
267.607E1
284.950E-2
166.334E-2
Gold
712.739E-2
919.289E-2
112.543E-1
272.010E-2
Silver
211.210E-1
204.831E-1
512.309E-2
191.835E-2
Bronze
511.529E-1
324.686E-1
260.647E-2
133.968E-2
Count
302.185E0
238.706E0
122.783E-1
252.269E-2

Cluster 3, 18 users, pink:
Info
Center
Std. Dev.
Kurt.
Skew.
Reputation
267.654E2
105.582E2
123.350E-2
143.433E-2
Gold
242.222E-1
307.142E-1
145.243E-2
154.269E-2
Silver
846.111E-1
768.483E-1
-537.560E-3
863.259E-3
Bronze
134.889E0
103.588E0
-104.684E-2
678.600E-3
Count
760.833E0
622.366E0
-159.373E-2
331.745E-3

Plotting those 3 cluster against two main variables we receive this image:
y-axis – number of posts
x-axis - reputation


I can see that most of the users are not active and there is very small group which helps a lot.

Udacity search functionality improvements

In lesson 4 of Udacity "Intro to hadoop and Map Reduce" there was inverted index exercise. You can find code below - Mapper:

import sys
import csv
import re
def mapper():
    reader = csv.reader(sys.stdin, delimiter='\t')
    delimiters = ['[',']','#','$','-','=','/',' ','\t','\n','.','!','?',':',';','\"','(',')','<','>',','];
    regexPattern = '|'.join(map(re.escape, delimiters))
    for line in reader:
        #skip header..
        if line[8]=="added_at":
             continue;
        node = line[0];
        body = line[4];
        words = re.split(regexPattern, body.lower(), 0)
        for word in words:
                if len(word)>0:
                        print word, '\t', node;
def main():
    import StringIO
    mapper()

if __name__ == "__main__":
    main()

Reducer:

#!/usr/bin/python
import sys
def reducer():
    oldKey = None
    array = []
    for line in sys.stdin:
        data = line.strip().split("\t");
        thisKey, word = data;
        if oldKey and oldKey != thisKey:
                print oldKey,'\t','\t'.join(array)
                array=[]
        oldKey=thisKey
        array.append(word)
    if oldKey:
        print oldKey,'\t','\t'.join(array)
def main():
    import StringIO
    reducer()

if __name__ == "__main__":
    main()

In final project there was exercise for creating Top 10 tags which required reading whole export.
If we want to find Top 10 contributors we apply this pattern and read whole file again. It is not efficient.
We could use slightly modified code and create index for posts or tags.
Mapper for user activity, reducer not modified:

#!/usr/bin/python
import sys
import csv
import re
def mapper():
    reader = csv.reader(sys.stdin, delimiter='\t')
    for line in reader:
        #skip header..
        if line[8]=="added_at":
             continue;
        user = line[3]
        post = line[0]
        print user,'\t',post
def main():
    import StringIO
    mapper()

if __name__ == "__main__":
    main()


Given that we could count posts very fast:

#!/usr/bin/python
import sys
def mapper():
    for line in sys.stdin:
        data = line.strip().split("\t");
        print str(len(data)-1).zfill(10),data[0]
def main():
    import StringIO
    mapper()

if __name__ == "__main__":
    main()

Mapper adds leading zeros to make proper sorting, MR job sorts data, there is no reducer (identity) and as a result we have all the users with counted posts, sorted by count. Below you can find top 10 contributors:

0000000954 100008240
0000001015 100005156
0000001021 100008306
0000001064 100007518
0000001416 100008230
0000001419 100005396
0000001448 100000461
0000001494 100008518
0000001660 100008283
0000001793 100005361
0000001910 100001071

Same code applies for counting tags, but instead of users we emit tags.