wtorek, 18 marca 2014

Udacity Data Wrangling with Mongo DB: Las Vegas exercice

For final project I have chosen Las Vegas region - it was one of my tour point during my last holidays. I remember what was the process of choosing the hotel - We have opened one of the booking sites and searched for good prices and reviews. 

I wanted to try different approach - choose hotel & casino based on neighborhood, how many other casinos and hotels are in 10 min walk - 500m radius.

Some information about data provided into mongodb - ideal information, how it should look like:
{
"id": "2406124091",
"type: "node",
"visible":"true",
"created": {
          "version":"2",
          "changeset":"17206049",
          "timestamp":"2013-08-03T16:43:42Z",
          "user":"linuxUser16",
          "uid":"1219059"
        },
"pos": [41.9757030, -87.6921867],
"address": {
          "housenumber": "5157",
          "postcode": "60625",
          "street": "North Lincoln Ave"
        },
"amenity": "restaurant",
"cuisine": "mexican",
"name": "La Cabana De Don Luis",
"phone": "1 (773)-271-5176"

what I really had:

{
  "building": "yes", 
  "website": "http://www.caesarspalace.com", 
  "amenity": "casino", 
  "node_refs": [
    "389482445", 
    "1483478762", 
    "1483478753", 
[...]
    "389482448", 
    "389482445"
  ], 
  "gnis:county_name": "Clark", 
  "created": {
    "uid": "336460", 
    "changeset": "9675966", 
    "version": "8", 
    "user": "robgeb", 
    "timestamp": "2011-10-28T12:11:39Z"
  }, 
  "tourism": "hotel", 
  "wheelchair": "yes", 
  "wikipedia": "en:Caesars Palace", 
  "ele": "644", 
  "visible": null, 
  "address": {
    "city": "Las Vegas", 
    "county": "Clark", 
    "state": "NV", 
    "street": "Las Vegas Boulevard", 
    "postcode": "89109", 
    "housenumber": "3570"
  }, 
  "gnis:feature_id": "2472987", 
  "type": "way", 
  "id": "115672893", 
  "name": "Caesars Hotel and Casino"
}

More complex nodes (buildings, ways) don't have position, they  reference to other nodes responsible mostly for having only position - for example 4 nodes, one for each corner of the building. 

MongoDB doesn't support joins, so in order to query for location of hotel I need to collect locations first.

I have created script - for each node with node_refs I iterate over array and create array of locations. I don't want to give one specific location because it generally invalid for roads and long buildings. MongoDB 'near' function in aggregate pipeline supports array for filtering, but doesn't support as a center location. Here is the script:

 p = db.lv.find({'node_refs':{'$exists':1}});
    for el in p:
        #lets add some details
        points = [];

        if 'pos_many' in el:
            continue;

        for ref in el['node_refs']:
            one = db.lv.find_one({'id':ref});
            if one is None:
                continue;
            if 'pos' in one:
                points.append(one['pos']);
                if not 'is_referenced' in one:
                    one['is_referenced'] = 1;
                    db.lv.save(one);
        el['pos_many'] = points;
        db.lv.save(el);
Apart of updating nodes with locations, my script flags nodes which were referenced. I would like to see what kind of nodes are referenced, are there only locations or can I find for example reference to bus stop or tram station. I could filter/delete base on this flag.

Script has updated more than 70 000 objects with references,  almost 678 000 nodes which were referenced. Only 72 were named nodes like tram station so I can't delete those nodes but I won't loose to much information if I filter this data.

Now I can run query to filter all the casinos in Las Vegas region:

db.lv.find({'$or':[{'name': '/Casino/'},{'amenity':'casino'}]})

It gives me more than 50 casinos, some of them are known to me.
For each casino now I can query:
db.lv.aggregate([
                      {
                        '$geoNear': {
                                    'near': pos,
                                    'distanceField': "dist.calculated",
                                    'maxDistance': 0.5/111.12,
                                    'query': {'id':{'$ne':el['id']},'$or':[{'name': '/Casino/'},{'amenity':'casino'}]},
                                    'includeLocs': "dist.location",
                                    'uniqueDocs': 1
                                    
                                  }
                      }
                   ]);

Result of this query for is this table - Top 10 casinos:
casino# of casinos nearby
Bill's Gamblin' Hall & Saloon8
Bellagio Hotel and Casino8
Imperial Palace Hotel and Casino7
Flamingo Hotel and Casino7
Harrah's Hotel and Casino7
Tropicana Hotel and Casino6
Paris Hotel and Casino6
Caesars Hotel and Casino6
Excalibur Hotel and Casino5


List of nearby casinos for top 2 [name , distance in meters, location (lat,lon)]:

Bill's Gamblin' Hall & Saloon:
  1. Flamingo Hotel and Casino 63.6570179247 [36.1154373, -115.1723441]
  2. Bellagio Hotel and Casino 105.992429283 [36.1143679, -115.1733477]
  3. Caesars Hotel and Casino 206.917388636 [36.1154496, -115.1743421]
  4. Paris Hotel and Casino 209.970162638 [36.1130181, -115.1725101]
  5. Bally's Hotel and Casino 229.545641554 [36.1143607, -115.1705686]
  6. Imperial Palace Hotel and Casino 334.493011664 [36.1179157, -115.1726557]
  7. Harrah's Hotel and Casino 397.143685884 [36.118481, -115.172568]
  8. Planet Hollywood Hotel and Casino 466.059119841 [36.1109562, -115.1711528]

Bellagio Hotel and Casino:
  1. Bill's Gamblin' Hall & Saloon 109.676718942 [36.114907, -115.1725608]
  2. Caesars Hotel and Casino 144.503466085 [36.1150138, -115.1744618]
  3. Flamingo Hotel and Casino 165.401843786 [36.1155678, -115.1725383]
  4. Paris Hotel and Casino 173.189321322 [36.1130181, -115.1725101]
  5. Bally's Hotel and Casino 280.117818406 [36.1136115, -115.1709408]
  6. Imperial Palace Hotel and Casino 406.50188833 [36.1179157, -115.1726557]
  7. Planet Hollywood Hotel and Casino 447.491259701 [36.1109562, -115.1711528]
  8. Harrah's Hotel and Casino 470.026846719 [36.118481, -115.172568]
Both hotels are located in a very center of city, on the corners of Las Vegas Boulevard and Flamingo Road. What I couldn't verify in this dataset is that Bill's Gamblin' Hall & Saloon is currently closed.



Known issues: 
When choosing one location of hotel I have chosen first location in array (on of the corners), you can notice this on picture above. It should be center of location.


Brak komentarzy:

Prześlij komentarz