Handling multiple, large datasets using GIS: current progress

Possibly the biggest challenge built into the EngLaId project is in how we bring together and synthesise the diversely recorded datasets that we are using.  Whilst some consistencies exist between the recording methods used by different data sources, there remains a considerable amount of diversity.  English Heritage (EH), England’s 84 Historic Environment Records (HERs), the Portable Antiquities Scheme (PAS) and other data providers / researchers all keep their own separate databases, all recorded in different ways, and the entries within which relate to some of the same objects and to some different objects.

As a result, there is considerable duplication between different data sources, which is not at all easy to extract.  Where data objects have names, such as in the case of many larger sites, this can be used to assess duplication (assuming all datasets use the same names for an object), but this does not apply to the much more common case of objects with no assigned names.

Therefore, the best way in which to discover duplication and attempt to present a synthesis between different datasets is to test for spatial similarity.  In other words, if a Roman villa is present in the same space within two different datasets, we can assume that it is the same villa.  However, this in turn is complicated by the fact that different data sources contain data recorded to different levels of spatial precision and using different data types (e.g. points vs polygons).  The way that I am experimenting with to get around this problem is in applying a tessellation of grid squares over the map and testing the input datasets for which objects fall within each square, recording their type and period, and aggregating across datasets to assess presence or absence of each site type for each period.

The first stage is to simplify down the terms used in the input dataset to a set of (currently) eight output terms (these are still not fully defined as yet and the number of output terms will undoubtedly grow).  This is partly so that the output text fields do not exceed the 254 character limit for ArcGIS shapefiles (I will be working on a solution to this, probably involving moving to the geodatabase format), and partly so that we can identify objects of similar type recorded using different terminologies.  This is accomplished through the use of a Python script.

The grid square tessellations were created using the tools provided as part of the Geospatial Modelling Environment software, which is free to download and use.  So far, I have created tessellations at resolutions of 1km x 1km, 2km x 2km, and 5km x 5km to cover the different scales of analysis to be undertaken (and ultimately to give flexibility in the resolution of outputs for publishing purposes with regard to the varying requirements of our data providers).  These were then cut down to the extent of England using a spatial query.

ArcGIS’s identity tool was then used to extract which input objects fell within which grid square (or squares in the case of large polygons and long lines).  The attribute tables for these identity layers were then exported and run through another Python script to aggregate the entries for each grid square and to eliminate duplication for each grid square.  The table output by the script (containing the cell identifier, a text string of periods, and a text string of types per period) was then joined to the grid square tessellation layer based upon the identifier for each cell.  The result is a layer consisting of a series of grid squares, each of which carries a text string attribute recording the broad categories of site type (by period) falling within itself.

This methodology means that we can bring together different datasets within a single schema.  Input objects that overlap more than one output square can record their presence within several output squares* (assuming they are represented in the GIS as polygons / lines of appropriate extent).  Querying the data to produce broad-scale maps of our different periods and/or categories of data is simple (using the ArcMap attribute query system’s ‘LIKE’ query, remembering to use appropriate wildcards [% for shapefiles] to catch the full set of terms within each text field**).  The analysis can also be redone using different resolutions of grid tessellation, depending on the quality of input data and the spatial scale of research question considered (e.g. 1km x 1km or 2km x 2km or 5km x 5km squares).

So far, this methodology has only been tested using EH’s National Record of the Historic Environment (NRHE) data (as seen online at PastScape: the process described above is also capturing the relevant identifiers to link through the data to PastScape, with an eye on linked data output in our final website) and using an initial, rather arbitrary, set of simplification terms to produce test results, but it should be straightforward to extend this system to encompass the various other datasets that we are in the process of gathering.  As an example of the output produced, here is a map of Roman settlement sites in the south west of England (settlement being defined here as entries containing any of the words: villa, house, settlement, hut, roundhouse, room, burh, town, barn, building, floor, mosaic; some of these terms obviously do not apply to the Roman period and the list will be subject to revision before final outputs are produced):

Roman settlement in the SW
Roman settlement in the south west of England

As can be seen, on the scale of a region the output is both clear and instructive.  The result is one that shows presence or absence of a type of site within each cell, with no quantification given of how many of each type (as we ultimately will not know whether the total count is due to duplication or due to genuine multiplicity).  This picture will only get better once we have fully defined the terms used in our simplification process and once we start building in more data from our other data sources.

I shall be presenting a paper on this subject at CAA in Southampton in March.

Chris Green

* Whether this is appropriate or whether they should fall only within the square within which the majority of the polygon falls is still open to debate.  I feel that under a strict rationale of presence / absence, they should appear in all squares they overlap, but this could present a misleading picture in cases where, for example, a small site overlapped the junction of four large grid squares.

** e.g. [Term] LIKE ‘%RO_SETTLEMENT%’

Author: Chris Green

Postdoctoral Researcher (GIS)

11 thoughts on “Handling multiple, large datasets using GIS: current progress”

  1. This is very interesting and I may use these techniques in my own research, which uses HER data. At the moment, I’m totally ignoring polygon data in favour of point data, since that is by far the easiest way to analyse the landscape using geostatistical methods, which I intend.

    1. Indeed, another nice thing about using grid squares is you can easily convert them to points (i.e. their centres; GME records this on creation) or even a raster dataset (if suitable) for use with further analytical tools. The technique would, thus, also form a good way to include polygon data into your input dataset, if you found it necessary to do so.

Leave a Response

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s