Radosław Stankiewicz: lutego 2014

czwartek, 13 lutego 2014

Content-Security-Policy issues with iOS Chrome

Few days ago I had an opportunity to trace an issue with iOS Chrome not loading a page. The page with all resources were downloaded properly but Chrome was constantly showing that it's still working on loading page. Result is lack of 'on load' events. Problem only occurred when reloading site. Copying whole content to local static web server didn't replicate the issue so it wasn't the problem of content. I was able to cut whole page and return simple 'hello world' page and it turns out that problem still exist on original webserver and it looks like I had to look deeper - http headers. I had created a sample webserver to show the problem I have found:

import time
import BaseHTTPServer

class HTTPHandler(BaseHTTPServer.BaseHTTPRequestHandler):
def do_POST(s):
length = int(s.headers['Content-Length'])
print length
data = s.rfile.read(length).decode('utf-8')
print data
def do_GET(s):
s.send_response(200)
s.send_header("Content-Security-Policy", "script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; object-src 'self'; img-src 'self' ; media-src 'self'; frame-src 'self'; font-src 'self' ;connect-src 'self'; report-uri '192.168.43.17/report'")
s.send_header("Content-Type", "text/html;charset=UTF-8")
s.end_headers()
s.wfile.write("<html><head><title>hello</title></head><body><p>hello world %s</p></body></html>"% s.path)
if __name__ == '__main__':
server_class = BaseHTTPServer.HTTPServer
httpd = server_class(('192.168.43.17', 80), HTTPHandler)
try:
httpd.serve_forever()
except KeyboardInterrupt:
pass

httpd.server_close()

There is only one not ordinary element here - CSP header which secures site from cross site scripting and give mechanism of reporting security violations. It looks like Chrome is reporting problems - it violates directives:
1) frame-src with uri: chromeinvoke://cd931b8a0ca6aaed193d25b429ee4019
"csp-report":{
"document-uri": "http://192.168.43.17/",
"referrer": "",
"violated-directive": "frame-src 'self'",
"original-policy": "script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; object-src 'self'; img-src 'self' ; media-src 'self'; frame-src 'self'; font-src 'self' ;connect-src 'self'; report-uri '192.168.43.17/report'",
"blocked-uri": "chromeinvoke://cd931b8a0ca6aaed193d25b429ee4019",
"source-file": "http://192.168.43.17/",
"line-number": 1
}
2) connect-src with uri: https://localhost
"csp-report":{
"document-uri": "http://192.168.43.17/",
"referrer": "",
"violated-directive": "connect-src 'self'",
"original-policy": "script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; object-src 'self'; img-src 'self' ; media-src 'self'; frame-src 'self'; font-src 'self' ;connect-src 'self'; report-uri '192.168.43.17/report'",
"blocked-uri": "https://localhost",
"source-file": "http://192.168.43.17/",
"line-number": 1
}
3) violations of frame-src with uri: chromenull://
"csp-report":{
"document-uri": "http://192.168.43.17/",
"referrer": "",
"violated-directive": "frame-src 'self'",
"original-policy": "script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; object-src 'self'; img-src 'self' ; media-src 'self'; frame-src 'self'; font-src 'self' ;connect-src 'self'; report-uri '192.168.43.17/report'",
"blocked-uri": "chromenull://",
"source-file": "http://192.168.43.17/",
"line-number": 21
}
4) frame-src with uri: chromeinvokeimmediate://3726692da42473af155b530fe0e48c61
"csp-report":{
"document-uri": "http://192.168.43.17/",
"referrer": "",
"violated-directive": "frame-src 'self'",
"original-policy": "script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; object-src 'self'; img-src 'self' ; media-src 'self'; frame-src 'self'; font-src 'self' ;connect-src 'self'; report-uri '192.168.43.17/report'",
"blocked-uri": "chromeinvokeimmediate://3726692da42473af155b530fe0e48c61",
"source-file": "http://192.168.43.17/",
"line-number": 2
}

Further investigation shown that:
Issue with reporting internal/plugins url is known, it is already submitted here.
Changing frame-src from 'self' to * solves loading site issue but is lowering security.
Interesting fact is that when switching from anonymous mode to normal I can notice for a short time an iframe:

wtorek, 4 lutego 2014

Clustering Udacity forum users

One of the questions I wanted to ask is can I cluster users into some groups. For clustering I wanted to use kmeans.

First I had to prepare simple export.

Mapper takes forum and user files and selects proper data from them:

import sys

import csv

def mapper():

reader = csv.reader(sys.stdin, delimiter='\t')

writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)

for line in reader:

if line[0]=="id" or line[0]=="user_ptr_id":

continue;

if len(line)==5:

l = (line[0],'A',line[1],line[2],line[3],line[4]);

writer.writerow(l)

else:

l =(line[3],'B')

writer.writerow(l)

def main():

import StringIO

mapper()

sys.stdin = sys.__stdin__

main()

Reducer which outputs userid along with his badges, karma and posts count:

#!/usr/bin/python

import sys

import csv

def reducer():

oldKey = None;

rep=0

gold=0

silver=0

bronze=0

count = 0

reader = csv.reader(sys.stdin, delimiter='\t')

for line in reader:

if line[1]=='A':

if oldKey:

print '\t'.join([oldKey,rep,gold,silver,bronze,str(count)])

oldKey, rep, gold, silver, bronze = line[0],line[2],line[3],line[4],line[5]

count=0

else:#B

count+=1

if oldKey:

print '\t'.join([oldKey,rep,gold,silver,bronze,str(count)])

def main():

import StringIO

reducer()

if __name__ == "__main__":

main()

I have used Java Modelling Tools (http://jmt.sourceforge.net/) to visualize k-means clustering and it looks like that we can split our users into 3 clusters where:

17432 users (99%) are in cluster 1, red:

Info	Center	Std. Dev.	Kurt.	Skew.
Reputation	111.457E0	273.317E0	330.457E-1	518.127E-2
Gold	270.996E-3	930.424E-3	868.928E-1	717.735E-2
Silver	878.499E-3	244.692E-2	874.737E-1	683.432E-2
Bronze	421.489E-2	613.437E-2	843.112E-1	584.215E-2
Count	823.078E-2	199.274E-1	820.611E-1	723.129E-2

Cluster 2, 157 users, blue:

Info	Center	Std. Dev.	Kurt.	Skew.
Reputation	555.198E1	267.607E1	284.950E-2	166.334E-2
Gold	712.739E-2	919.289E-2	112.543E-1	272.010E-2
Silver	211.210E-1	204.831E-1	512.309E-2	191.835E-2
Bronze	511.529E-1	324.686E-1	260.647E-2	133.968E-2
Count	302.185E0	238.706E0	122.783E-1	252.269E-2

Cluster 3, 18 users, pink:

Info	Center	Std. Dev.	Kurt.	Skew.
Reputation	267.654E2	105.582E2	123.350E-2	143.433E-2
Gold	242.222E-1	307.142E-1	145.243E-2	154.269E-2
Silver	846.111E-1	768.483E-1	-537.560E-3	863.259E-3
Bronze	134.889E0	103.588E0	-104.684E-2	678.600E-3
Count	760.833E0	622.366E0	-159.373E-2	331.745E-3

Plotting those 3 cluster against two main variables we receive this image:

y-axis – number of posts

x-axis - reputation

I can see that most of the users are not active and there is very small group which helps a lot.

Udacity search functionality improvements

In lesson 4 of Udacity "Intro to hadoop and Map Reduce" there was inverted index exercise. You can find code below - Mapper:

import sys

import csv

import re

def mapper():

reader = csv.reader(sys.stdin, delimiter='\t')

delimiters = ['[',']','#','$','-','=','/',' ','\t','\n','.','!','?',':',';','\"','(',')','<','>',','];

regexPattern = '|'.join(map(re.escape, delimiters))

for line in reader:

#skip header..

if line[8]=="added_at":

continue;

node = line[0];

body = line[4];

words = re.split(regexPattern, body.lower(), 0)

for word in words:

if len(word)>0:

print word, '\t', node;

def main():

import StringIO

mapper()

if __name__ == "__main__":

main()

Reducer:

#!/usr/bin/python

import sys

def reducer():

oldKey = None

array = []

for line in sys.stdin:

data = line.strip().split("\t");

thisKey, word = data;

if oldKey and oldKey != thisKey:

print oldKey,'\t','\t'.join(array)

array=[]

oldKey=thisKey

array.append(word)

if oldKey:

print oldKey,'\t','\t'.join(array)

def main():

import StringIO

reducer()

if __name__ == "__main__":

main()

In final project there was exercise for creating Top 10 tags which required reading whole export.

If we want to find Top 10 contributors we apply this pattern and read whole file again. It is not efficient.

We could use slightly modified code and create index for posts or tags.

Mapper for user activity, reducer not modified:

#!/usr/bin/python

import sys

import csv

import re

def mapper():

reader = csv.reader(sys.stdin, delimiter='\t')

for line in reader:

#skip header..

if line[8]=="added_at":

continue;

user = line[3]

post = line[0]

print user,'\t',post

def main():

import StringIO

mapper()

if __name__ == "__main__":

main()

Given that we could count posts very fast:

#!/usr/bin/python

import sys

def mapper():

for line in sys.stdin:

data = line.strip().split("\t");

print str(len(data)-1).zfill(10),data[0]

def main():

import StringIO

mapper()

if __name__ == "__main__":

main()

Mapper adds leading zeros to make proper sorting, MR job sorts data, there is no reducer (identity) and as a result we have all the users with counted posts, sorted by count. Below you can find top 10 contributors:

0000000954 100008240

0000001015 100005156

0000001021 100008306

0000001064 100007518

0000001416 100008230

0000001419 100005396

0000001448 100000461

0000001494 100008518

0000001660 100008283

0000001793 100005361

0000001910 100001071

Same code applies for counting tags, but instead of users we emit tags.