--- title: "Geospatial Queries using Pymongo in R" author: "Lampros Mouselimis" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Geospatial Queries using Pymongo in R} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- Since I submitted the [*geojsonR* package](https://CRAN.R-project.org/package=geojsonR) I was interested in running geospatial MongoDB queries using GeoJson data. I decided to use PyMongo (through the reticulate package) after opening two Github issues [here](https://github.com/jeroen/mongolite/issues/7) and [here](https://github.com/rstudio/reticulate/issues/81). In my opinion, the PyMongo library is huge and covers a lot of things however, my intention was to be able to run geospatial queries from within R. ##### The GeoMongo package The GeoMongo package allows the user, * to insert and query **only** GeoJson data using the **geomongo** R6 class * to read data in either json (through the geojsonR package) or BSON format (I'll explain later when BSON is necessary for inserting data) * to validate a json instance using a schema using the **json_schema_validator()** function (input parameters are R named lists) * to utilize MongoDB console commands using the **mongodb_console()** function. The *mongodb_console()* function takes advantage of the base R *system()* function. For instance, MongoDB console commands are necessary in case of bulk import / export of data as documented [here](https://docs.mongodb.com/database-tools/mongoimport/) and [here](https://docs.mongodb.com/database-tools/mongoexport/).
I was able to reproduce the majority of geospatial MongoDB queries ( System Requirements : MongoDB (>= 3.4) and Python (>= 3.5) ) using a number of blog posts on the web, however I'll take advantage of the following two in order to explain how one can use the **GeoMongo** package for this purpose: * [first example blog post](http://thecodebarbarian.com/80-20-guide-to-mongodb-geospatial-queries) [ The 80/20 Guide to MongoDB Geospatial Queries, by Valeri Karpov] * [second documentation example](https://docs.mongodb.com/manual/geospatial-queries/) [ Geospatial Queries, in MongoDB documentation ]
##### queries based on [first example blog post](http://thecodebarbarian.com/80-20-guide-to-mongodb-geospatial-queries) When inserting data using the *geomongo* R6 class the user has the option (via the *TYPE_DATA* parameter) to either give a *character string (or vector)*, a *list*, a *file* or a *folder of files* as input. To start with, I'll use the following character strings ( they appear in the first example blog post , the "_id" 's were removed), ```{r, eval = F} library(GeoMongo) # important : the property-names of each geojson object should be of type character string loc1 = '{ "name" : "Squaw Valley", "location" : { "type" : "Point", "coordinates" : [ -120.24, 39.21 ] } }' loc2 = '{ "name" : "Mammoth Lakes", "location" : { "type" : "Point", "coordinates" : [ -118.9, 37.61 ] } }' loc3 = '{ "name" : "Aspen", "location" : { "type" : "Point", "coordinates" : [ -106.82, 39.18 ] } }' loc4 = '{ "name" : "Whistler", "location" : { "type" : "Point", "coordinates" : [ -122.95, 50.12 ] } }' # create a vector of character strings char_FILES = c(loc1, loc2, loc3, loc4) ```
Before inserting the data one should **make sure that MongoDB is running on the Operating System**. Information on how to **install MongoDB** can be found [here](https://docs.mongodb.com/manual/installation/).
The **geomongo** R6 class will be initialized and a database and collection will be created,
```{r, eval = F} init = geomongo$new(host = 'localhost', port = 27017) # assuming MongoDB runs locally getter_client = init$getClient() # get MongoClient() init_db = getter_client[["example_db"]] # create a new database init_col = init_db$create_collection("example_col") # create a new collection ```
After the preliminary steps, one can continue by inserting the *char_FILES* object to the relevant database / collection using the **geoInsert** method. The *TYPE_DATA* parameter equals here to *dict_many* meaning it can take either a *list of lists (nested list)* or a *character vector of strings*,
```{r, eval = F} init$geoInsert(DATA = char_FILES, # input data TYPE_DATA = 'dict_many', # character vector of strings as input COLLECTION = init_col, # specify the relevant collection GEOMETRY_NAME = "location") # give the 'geometry name' of each geo-object ```
One can now run various commands to check the correctness of the inserted data,
```{r, eval = F} init_db$collection_names() # prints out the collection names of the relevant database ``` ```{r, eval = F} "example_col" ``` ```{r, eval = F} init_col$find_one() # prints one of the inserted geometry objects ``` ```{r, eval = F} $`_id` 5984a0b742b2563fb5838f6a $location $location$type [1] "Point" $location$coordinates [1] -120.24 39.21 $name [1] "Squaw Valley" ``` ```{r, eval = F} init_col$count() # prints the number of the inserted geometry objects ``` ```{r, eval = F} [1] 4 ```
I'll continue reproducing some of the geo-queries of the [first example blog post](http://thecodebarbarian.com/80-20-guide-to-mongodb-geospatial-queries) from within an R-session.
The first query is about the number of locations in the state of Colorado, where Colorado is approximated as the below GeoJson square,
```{r, eval = F} { "type": "Polygon", "coordinates": [[ [-109, 41], [-102, 41], [-102, 37], [-109, 37], [-109, 41] ]] } ```
and the corresponding MongoDB query would be,
```{r, eval = F} db.locations.find({ ... location: { ... $geoIntersects: { ... $geometry: { ... type: "Polygon", ... coordinates: [[ ... [-109, 41], ... [-102, 41], ... [-102, 37], ... [-109, 37], ... [-109, 41] ... ]] ... } ... } ... } ... }) ```
This query can be *translated* in R in the following way: * *curly braces* correspond to *R-lists* * *arrays* (of size 2) to *R-vectors*,
```{r, eval = F} query_geoIntersects = list('location' = list('$geoIntersects' = list('$geometry' = list( type = "Polygon", coordinates = list( list( c(-109, 41), c(-102, 41), c(-102, 37), c(-109, 37), c(-109, 41) ) ) ) ) ) ) ```
and the **find** METHOD of **geoQuery** function will be used to return locations which are within the boundaries of Colorado,
```{r, eval = F} loc_intersect = init$geoQuery(QUERY = query_geoIntersects, # query from previous chunk METHOD = "find", # the method to use COLLECTION = init_col, # the collection to use GEOMETRY_NAME = "location", # the geometry name to use TO_LIST = FALSE) # returns a data.table loc_intersect ```
The output can be returned either as a *list* or as a *data.table*,
```{r, eval = F} # data.table format location.type location.coordinates1 location.coordinates2 name id 1: Point -106.82 39.18 Aspen 5984a0b742b2563fb5838f6c ```
The next few code chunks will show how to return documents that are *within a certain distance of a given point* using the *geoWithin* and *centerSphere* operators (locations with a square of circumradius 300 miles centered on San Francisco, approximately latitude 37.7, longitude -122.5). ```{r, eval = F} # MongoDB query db.locations.find({ ... location: { ... $geoWithin: { ... $centerSphere: [[-122.5, 37.7], 300 / 3963.2] ... } ... } ... }) ```
and the corresponding query in R,
```{r, eval = F} geoWithin_sph = list('location' = list('$geoWithin' = list('$centerSphere' = list( c(-122.5, 37.7), 300 / 3963.2) ) ) ) # no need to specify again the "COLLECTION" and "GEOMETRY_NAME" parameters # as we use the same initialization of the R6 class with the previous query res_geoWithin_sph = init$geoQuery(QUERY = geoWithin_sph, METHOD = "find") res_geoWithin_sph ``` ```{r, eval = F} # example output location.type location.coordinates1 location.coordinates2 1: Point -118.90 37.61 2: Point -120.24 39.21 name id Mammoth Lakes 5984a0b742b2563fb5838f6b Squaw Valley 5984a0b742b2563fb5838f6a ```
One can read more about the magic number 3963.2 (radius of the Earth) either in the [first example blog post](http://thecodebarbarian.com/80-20-guide-to-mongodb-geospatial-queries) or in the [MongoDB documentation](https://docs.mongodb.com/manual/tutorial/calculate-distances-using-spherical-geometry-with-2d-geospatial-indexes/).
Here one can also plot the output locations using the **leaflet** package,
```{r, eval = F} map_dat <- leaflet::leaflet() map_dat <- leaflet::addTiles(map_dat) map_dat <- leaflet::addMarkers(map_dat, lng = unlist(res_geoWithin_sph$location.coordinates1), lat = unlist(res_geoWithin_sph$location.coordinates2)) map_dat ```
![](leaflet1.png)
The next query utilizes the **aggregate** method to return the locations *sorted by distance from a given point*,
```{r, eval = F} # MongoDB query db.locations.aggregate([{ ... $geoNear: { ... near: { ... type: 'Point', ... coordinates: [-122.5, 37.1] ... }, ... spherical: true, ... maxDistance: 900 * 1609.34, ... distanceMultiplier: 1 / 1609.34, ... distanceField: 'distanceFromSF' ... } ... }]) ```
and the corresponding query in R,
```{r, eval = F} query_geonear = list('$geoNear' = list(near = list( type = "Point", coordinates = c(-122.5, 37.1) ), distanceField = "distanceFromSF", maxDistance = 900 * 1609.34, distanceMultiplier = 1 / 1609.34, spherical = TRUE) ) func_quer_geonear = init$geoQuery(QUERY = query_geonear, METHOD = "aggregate") func_quer_geonear ```
```{r, eval = F} # example output distanceFromSF location.type location.coordinates1 location.coordinates2 1: 190.8044 Point -120.24 39.21 2: 201.0443 Point -118.90 37.61 3: 863.9478 Point -106.82 39.18 name id Squaw Valley 5984a0b742b2563fb5838f6a Mammoth Lakes 5984a0b742b2563fb5838f6b Aspen 5984a0b742b2563fb5838f6c ```

##### queries based on the [second (MongoDB) documentation example](https://docs.mongodb.com/manual/geospatial-queries/)
I picked this documentation example in order to show how someone can use the **command** METHOD besides the **find** and **aggregate** methods.
First I'll build a new collection (*places*) and then I'll insert the example data,
```{r, eval = F} places_col = init_db$create_collection("places") # create a new collection ``` ```{r, eval = F} # important : the property-names of each geojson object should be of type character string place1 = '{ "name": "Central Park", "location": { "type": "Point", "coordinates": [ -73.97, 40.77 ] }, "category": "Parks" }' place2 = '{ "name": "Sara D. Roosevelt Park", "location": { "type": "Point", "coordinates": [ -73.9928, 40.7193 ] }, "category": "Parks" }' place3 = '{ "name": "Polo Grounds", "location": { "type": "Point", "coordinates": [ -73.9375, 40.8303 ] }, "category": "Stadiums" }' # create a vector of character strings doc_FILES = c(place1, place2, place3) ```
```{r, eval = F} init$geoInsert(DATA = doc_FILES, # insert data TYPE_DATA = 'dict_many', # character vector of strings as input COLLECTION = places_col, # specify the relevant collection GEOMETRY_NAME = "location") # give the 'geometry name' of each geo-object ```
```{r, eval = F} # outputs the collection names init_db$collection_names() ``` ```{r, eval = F} # example output [1] "places" "example_col" ```
```{r, eval = F} places_col$count() # number of geojson objects in collection ``` ```{r, eval = F} [1] 3 ```
After the data is inserted one can now query the data using the **command** METHOD.
Worth mentioning for this particular method are the differences between MongoDB and PyMongo. The following code chunk shows the MongoDB *runCommand*,
```{r, eval = F} db.runCommand( { geoNear: "places", near: { type: "Point", coordinates: [ -73.9667, 40.78 ] }, spherical: true, query: { category: "Parks" } } ) ```
which corresponds to the following query in *GeoMongo* (similar to PyMongo), ```{r, eval = F} Args_Kwargs = list("geoNear", "places", near = list("type" = "Point", "coordinates" = c(-73.9667, 40.78)), spherical = TRUE, query = list("category" = "Parks")) ```
Information about the various parameters of the *command* method can be found in the [PyMongo documentation](https://pymongo.readthedocs.io/en/stable/api/pymongo/database.html).
Then the GeoMongo **command** method takes the parameters in the same way as the *find* or *aggregate* methods,
```{r, eval = F} init$geoQuery(QUERY = Args_Kwargs, METHOD = "command", COLLECTION = places_col, DATABASE = init_db, # additionally I have to specify the database TO_LIST = FALSE) ```
which returns only the 'Parks' (of the *category* property name) from the input documents, ```{r, eval = F} obj.category obj.location.type obj.location.coordinates1 obj.location.coordinates2 1: Parks Point -73.9700 40.7700 2: Parks Point -73.9928 40.7193 obj.name dis id Central Park 1147.422 5985b4d242b2563fb5838f6e Sara D. Roosevelt Park 7106.506 5985b4d242b2563fb5838f6f ```
The following two blog posts include also a variety of geospatial queries ( [here](http://tugdualgrall.blogspot.com/2014/08/introduction-to-mongodb-geospatial.html) and [here](https://docs.mongodb.com/manual/tutorial/geospatial-tutorial/) ).
More details about the *geomongo* R6 class and each method (*read_mongo_bson()*, *geoInsert()*, *geoQuery()*) can be found in the *Details* and *Methods* of the package documentation.
##### When to input data in bson rather than in json format (applies to the geomongo R6 class)
When inserting data to MongoDB there are cases where the *id* appears in the following format, ```{r, eval = F} # data taken from : https://docs.mongodb.com/manual/tutorial/geospatial-tutorial/ example_dat = '{"_id": {"$oid":"55cba2476c522cafdb053add"}, "location": {"coordinates":[-73.856077,40.848447],"type":"Point"}, "name":"Morris Park Bake Shop"}' ```
```{r, eval = F} bson_col = init_db$create_collection("example_bson") # create a new collection ```
Inserting the *example_dat* in the *bson_col* will raise an error,
```{r, eval = F} init$geoInsert(DATA = example_dat, # insert data TYPE_DATA = 'dict_one', # single list as input COLLECTION = bson_col, # specify the relevant collection GEOMETRY_NAME = "location", # give the 'geometry name' of each geo-object read_method = "geojsonR") ``` ```{r, eval = F} # example output Error in py_call_impl(callable, dots$args, dots$keywords) : InvalidDocument: key '$oid' must not start with '$' ```
This error is explained also in a similar [StackOverflow question](https://stackoverflow.com/questions/42089045/bson-errors-invaliddocument-key-oid-must-not-start-with-trying-to-insert)
In such a case, one has to change the *read_method* to **mongo_bson** to correctly insert the data,
```{r, eval = F} init$geoInsert(DATA = example_dat, # insert data TYPE_DATA = 'dict_one', # single character string as input COLLECTION = bson_col, # specify the relevant collection GEOMETRY_NAME = "location", # give the 'geometry name' of each geo-object read_method = "mongo_bson") ```
Finally, we can check the correctness of the inserted data,
```{r, eval = F} bson_col$count() ``` ```{r, eval = F} # example output [1] 1 ```
```{r, eval = F} bson_col$find_one() ``` ```{r, eval = F} # example output $`_id` 55cba2476c522cafdb053add $location $location$type [1] "Point" $location$coordinates [1] -73.85608 40.84845 $name [1] "Morris Park Bake Shop" ```