GIS: latest developments

This is just a brief update to outline some of the latest developments to the project’s GIS element.

Web mapping: constructing an Open Source geostack

John, Xin and I have been thinking this month about how we might go about constructing the web-based mapping element seen as a vital component of the final website that we will produce as part of the project.  Xin constructed a good initial plan and then I found a very useful tutorial and software package on OpenGeo.  This OpenGeo Suite is well-documented and seems to function well as a route to provide webpage-embedded interactive maps.

The team installed the Suite on the OeRC’s server and used it to feed data to a test webpage (written based upon the OpenLayers API).  The results were very promising and we hope to use this software / process as part of the construction of our final website output, where appropriate.  What data and results the final website will map is still very much an open question!

Converting Ordnance Survey grid references

Whilst looking at and importing AIP data for field systems into the project’s GIS database, I wrote a small piece of Python code to convert Ordnance Survey National Grid References (NGRs) to numeric x and y coordinates for GIS implementation (in metres, not latitude / longitude).  The code should be available for download here.  It runs in Python (2.7) and would need expanding to include the capture of NGRs from a source file and the writing of output x y coordinates to a results file, as it is only really of use for bulk conversion purposes.  Apologies for the quality of my coding…

My previous (brief) online guide to this subject can be found here.  Incidentally, my main tip for doing this manually is: make sure you do not confuse your easting and your northing!

Processing NMP data for GIS analysis

I have also been working on processing the NMP data for the south west and south east received from English Heritage.

It was found to be useful to copy the raster layers into a Raster Catalog object in ArcGIS.  This makes possible various analytical methods that would otherwise be closed to raster data (such as the Select by Location tool), and also makes it easier to handle multiple tiles in one batch (including controlling the display of blank vector tiles at inappropriate scales for raster rendering, e.g. when zoomed out too far to show any detail).  The Raster Catalog will be updated as further data is received.

With the vector data (in CAD format), it was found to be useful to convert the files into a single Geodatabase (.gdb) object in ArcGIS.  This, again, makes it easier to handle multiple tiles at once (including maintaining symbology across tiles) and also makes it easier to output composites of multiple tiles to other formats (such as shapefiles).  Again, this Geodatabase will be updated as further data is received.

The results look good and should form a strong basis for the research undertaken by the project (in conjunction with our other data sources, of course).

Chris Green

Aerial photography and ground obscuration (part 2)

Following on from my previous post on this subject, I have now produced a second version of the obscuration layer.  On the suggestion of Graham Fairclough of English Heritage, this version includes the same obscuration factors as before (woodland, water, buildings / roads / railways), but also adds in areas of alluvial and peat sub-surface deposits.  These types of deposit tend to obscure archaeological features that were present on the former land surface before they formed, due to their thickness.  However, this is not as complete an obscuration as with the previous categories used, for several reasons:

1.  Peaty soils across England are being eroded by agricultural / drainage practices, revealing their buried archaeological material.

2.  Archaeological sites that were constructed after (or later on during) the formation of the deposits will not be (or will be less likely to be) obscured, i.e. the older a site is, the more likely it is to be obscured.

3.  Peat / alluvium deposits may be thin enough for substantial buried archaeological features to show through the masking effect, especially if denuded by more modern intervention.

As such, this result should be viewed more critically than the previous one, in that some areas showing as highly obscured may, in fact, show some archaeological features from the air (notably the region around the Wash), especially when dealing with sites from more recent times.  Also, as with all models, the result presented should not be taken as perfected.  Here, then, is the map showing percentage obscuration for 1km x 1km grid squares across England (built environment, water, woodland, peat, alluvium):

Obscuration of buried features from visual aerial prospection including geological factors
% obscuration of buried features from visual aerial prospection including geological factors

The data for peat and alluvium deposits were taken from the British Geological Survey’s 1 in 625,000 geology dataset (superficial deposits) which they provide for free download and unrestricted usage (subject to providing appropriate acknowledgment) under their open data initiative.  This data is provided at the perfect scale for a task such as the one undertaken (i.e. national patterning), but would be less useful for more intensive surveys.  Together with the OS Open Data also used, however, it does demonstrate the excellent results that can be produced as a consequence of organisations opening up their data for free usage by researchers (and, by extension, the general public).

Chris Green

Handling multiple, large datasets using GIS: current progress

Possibly the biggest challenge built into the EngLaId project is in how we bring together and synthesise the diversely recorded datasets that we are using.  Whilst some consistencies exist between the recording methods used by different data sources, there remains a considerable amount of diversity.  English Heritage (EH), England’s 84 Historic Environment Records (HERs), the Portable Antiquities Scheme (PAS) and other data providers / researchers all keep their own separate databases, all recorded in different ways, and the entries within which relate to some of the same objects and to some different objects.

As a result, there is considerable duplication between different data sources, which is not at all easy to extract.  Where data objects have names, such as in the case of many larger sites, this can be used to assess duplication (assuming all datasets use the same names for an object), but this does not apply to the much more common case of objects with no assigned names.

Therefore, the best way in which to discover duplication and attempt to present a synthesis between different datasets is to test for spatial similarity.  In other words, if a Roman villa is present in the same space within two different datasets, we can assume that it is the same villa.  However, this in turn is complicated by the fact that different data sources contain data recorded to different levels of spatial precision and using different data types (e.g. points vs polygons).  The way that I am experimenting with to get around this problem is in applying a tessellation of grid squares over the map and testing the input datasets for which objects fall within each square, recording their type and period, and aggregating across datasets to assess presence or absence of each site type for each period.

The first stage is to simplify down the terms used in the input dataset to a set of (currently) eight output terms (these are still not fully defined as yet and the number of output terms will undoubtedly grow).  This is partly so that the output text fields do not exceed the 254 character limit for ArcGIS shapefiles (I will be working on a solution to this, probably involving moving to the geodatabase format), and partly so that we can identify objects of similar type recorded using different terminologies.  This is accomplished through the use of a Python script.

The grid square tessellations were created using the tools provided as part of the Geospatial Modelling Environment software, which is free to download and use.  So far, I have created tessellations at resolutions of 1km x 1km, 2km x 2km, and 5km x 5km to cover the different scales of analysis to be undertaken (and ultimately to give flexibility in the resolution of outputs for publishing purposes with regard to the varying requirements of our data providers).  These were then cut down to the extent of England using a spatial query.

ArcGIS’s identity tool was then used to extract which input objects fell within which grid square (or squares in the case of large polygons and long lines).  The attribute tables for these identity layers were then exported and run through another Python script to aggregate the entries for each grid square and to eliminate duplication for each grid square.  The table output by the script (containing the cell identifier, a text string of periods, and a text string of types per period) was then joined to the grid square tessellation layer based upon the identifier for each cell.  The result is a layer consisting of a series of grid squares, each of which carries a text string attribute recording the broad categories of site type (by period) falling within itself.

This methodology means that we can bring together different datasets within a single schema.  Input objects that overlap more than one output square can record their presence within several output squares* (assuming they are represented in the GIS as polygons / lines of appropriate extent).  Querying the data to produce broad-scale maps of our different periods and/or categories of data is simple (using the ArcMap attribute query system’s ‘LIKE’ query, remembering to use appropriate wildcards [% for shapefiles] to catch the full set of terms within each text field**).  The analysis can also be redone using different resolutions of grid tessellation, depending on the quality of input data and the spatial scale of research question considered (e.g. 1km x 1km or 2km x 2km or 5km x 5km squares).

So far, this methodology has only been tested using EH’s National Record of the Historic Environment (NRHE) data (as seen online at PastScape: the process described above is also capturing the relevant identifiers to link through the data to PastScape, with an eye on linked data output in our final website) and using an initial, rather arbitrary, set of simplification terms to produce test results, but it should be straightforward to extend this system to encompass the various other datasets that we are in the process of gathering.  As an example of the output produced, here is a map of Roman settlement sites in the south west of England (settlement being defined here as entries containing any of the words: villa, house, settlement, hut, roundhouse, room, burh, town, barn, building, floor, mosaic; some of these terms obviously do not apply to the Roman period and the list will be subject to revision before final outputs are produced):

Roman settlement in the SW
Roman settlement in the south west of England

As can be seen, on the scale of a region the output is both clear and instructive.  The result is one that shows presence or absence of a type of site within each cell, with no quantification given of how many of each type (as we ultimately will not know whether the total count is due to duplication or due to genuine multiplicity).  This picture will only get better once we have fully defined the terms used in our simplification process and once we start building in more data from our other data sources.

I shall be presenting a paper on this subject at CAA in Southampton in March.

Chris Green

* Whether this is appropriate or whether they should fall only within the square within which the majority of the polygon falls is still open to debate.  I feel that under a strict rationale of presence / absence, they should appear in all squares they overlap, but this could present a misleading picture in cases where, for example, a small site overlapped the junction of four large grid squares.

** e.g. [Term] LIKE ‘%RO_SETTLEMENT%’

Aerial photography and ground obscuration

Examination of aerial photography is one of the primary methods by which archaeologists have surveyed the landscape of England for new sites and for new information about known sites, in a process that continues to this day.  However, it is only possible to find buried archaeological features by this method under certain conditions.  One particular adverse condition that halts all aerial photographic survey work is the obscuration of the ground surface by human and natural features.  Woodlands / forests (LiDAR can see through these to some extent, but photography cannot), lakes, buildings, roads, railways, etc. can hide the ground surface and make the detection of surface and subsurface features impossible.

As a result, distributions of archaeological sites discovered through aerial prospection will inevitably be biased towards areas of open country, particularly arable and pasture lands.  If we wish to make quantitative statements about such distributions, we need a methodology by which to quantify the obscuration of the ground surface, in order to demonstrate which areas of apparent blankness on such a distribution map are, in fact, only blank due to the impossibility of aerial prospection.

When the Ordnance Survey made available some of its data under its OpenData initiative, it became possible to undertake this quantification of obscuration using some quite simple (albeit intensive) computational methods.  This is because the Vector Map product produced by the OS is organised thematically, making it quite simple to download and join together thematic map layers for the whole of the UK (as the current project is only concerned with England, the method discussed below has only been undertaken for England, however).  This forms a series of data layers that would have been very difficult to pull together prior to the OpenData initiative.

To build up a map of ground obscuration for England, the following OS OpenData layers were downloaded and joined together for several regions (European parliamentary constituencies) that together spanned the whole country*: buildings, water areas, forested areas (all polygons), roads, and railways (line data).  It would have been possible to include other layers (such as glasshouses), but it was decided that those listed above were sufficient to produce a good generalised map.  The spatial precision of these layers actually appears very good, especially for the resolution of analysis undertaken (see below).  Buffers were generated for the roads and railway lines, of varying width depending on the type of entity (based on a quick survey of a few entities of each type on Google Earth): 10m for most types of road and for narrow gauge railways; 15m for A roads and single track railways; 20m for trunk roads; and 25m for motorways and multi track railways.

The buffer layers, buildings, water areas and forest layers were then joined together using the union tool in ArcGIS to create a polygon map of ground obscuration for each region.  A 1km by 1km polygon grid square layer was generated using Geospatial Modelling Environment and then reduced down to the outline of England via a spatial overlap query.  The identity tool in ArcGIS was then used to calculate how the polygons in the obscuration layers overlapped with the grid polygons, and the area of each resulting overlap polygon was then calculated.  The attribute tables were exported for these output layers and joined together in Excel into one big table listing the ID number (CELLID) for the related grid square and the area of each obscuration polygon within that square.  A python script was written which went through this table, adding together the total area of obscuration for each CELLID (this took around seven hours to process), and outputting a new table listing CELLID and total area of obscuration.

This output table was joined to the 1km by 1km grid square layer in ArcGIS based upon the CELLID.  We now knew the total area of obscuration for each kilometre grid square of England.  The percentage obscuration was calculated and this percentage figure was then used to create a 1km resolution raster layer showing what percentage of each cell’s ground surface area was obscured by buildings, woodland, water, roads and railways:

% obscuration of ground cover for 1km grid squares in England
% obscuration of ground cover for 1km grid squares in England

Obviously, as with all models, this is not a perfect or perfected result, but I do believe that it provides a very useful quantification of the extent to which the ground surface of England is obscured to any aerial visual observer (the picture would be somewhat different for LiDAR prospection, as then I would not have included trees as a form of obscuration).  There are undoubtedly other types of obscuration feature that could also have been included (areas of alluvium or peat, perhaps) and there may be some types of included feature that can, in certain circumstances, be seen through.  It does, however, provide a good basis for quantifying the extent to which gaps in aerial prospection results for England have resulted from the impossibility of achieving results through that method.  In the context of this project, this is particularly relevant when dealing with English Heritage’s National Mapping Program data, as this was constructed on the basis of aerial survey.

– Chris Green

* This division into regions was purely to ease the processing burden on ArcGIS and my computer.