Chris Green yesterday attended the first day of the Institute for Archaeologists (IfA) conference for 2012 at Oxford Town Hall. He presented a slightly updated version of the paper he delivered a short while ago at CAA2012. The day was informative and enjoyable, so many thanks to Martin Newman and Edmund Lee at EH for the invitation to take part.
Our next team conference appearance should be at the Landscape Archaeology Conference in Berlin in early June. Letty ten Harkel and Chris Green should both be in attendance.
On another note, we would like to thank the HEROs from the following regions for their recent supply of data to our project: Oxford UAD, Lancashire, Worcestershire, Coventry, North East Lincolnshire, Leicestershire, Lincoln UAD, South Yorkshire, Tees. It is greatly appreciated!
Last week, two members of the EngLaId team attended 2012’s international Computer Applications and Quantitative Methods in Archaeology (CAA) at the University of Southampton.
John Pybus spoke in the session Joined-Up Data: what are the new research questions? His paper discussed some of the ways in which our project is hoping to use semantic web technologies to help us answer our research questions.
Chris Green spoke in the session Large databases and datasets about the proposed methodology by which we are attempting to bring together our multiple and very varied datasets within a single analytical GIS environment (see previous blog post for some of the detail). Chris will be presenting a longer version of this paper at the IfA conference in Oxford on 18 April, in the session Where’s IT all going 2?
Useful comment on both papers was gratefully received. The Proceedings should be published in early 2013. Next year’s CAA is in Perth, Australia, again at the end of March: Chris is certainly hoping to attend.
This is just a quick post to advertise the fact that the deadlines have been extended for the three DPhil studentships associated with the EngLaId project. Details can be found here. These studentships are now open for applications until the end of June 2012, so please apply if you think you’re the right person for one of the positions and would like to be a part of our exciting project!
The EngLaId team had a productive meeting with Simon Crutchley and Martin Newman from English Heritage last week, where we presented our work to date on their data (the NMP and its associated attribute database AMIE/NRHE), and discussed how we might combine / synthesise their data with our other datasets. We also discussed many other topics, including the history of the NMP. Our thanks go to Simon and Martin for coming up to Oxford to meet the team.
Often when dealing with spatial data recorded using Ordnance Survey (OS) National Grid References (NGRs for short), you will find that different objects within your data are recorded to differing levels of spatial precision. This can lead to misrepresentative results if treated simply as point objects (especially for objects poessessing only a very vague location) and, therefore, we need a method of representation that is truer to the actual spatial precision of the data objects.
To help explain, NGRs are formatted as follows (fee free to skip this paragraph if you understand NGRs): the OS divides the UK up into 100km by 100km grid squares, which are designated using two letters. Therefore, an NGR consists of two letters followed by a string of numbers, the numbers representing the object’s location within the large grid square. There should be an even number of these digits, with the first half representing the easting and the second half the northing. However, the exact number of digits will vary according to how precisely located the object is. For example, an NGR such as SQ2222233333 is in the square SQ at the location 22222 metres east, 33333 metres north; however we would have to add zeroes to a shorter NGR as the precision of the point is less well known, e.g. SP444555 is in the square SP at the location 44400, 55500. For more on converting NGRs to numeric coordinates, see my previous post.
However, simply counting the digits is not always a sufficient measure of coordinate precision, as the person who entered the data may have tagged zeroes onto the end of the easting and northing. For example, the NGR ST5500066000 appears to be accurate to the nearest metre, but it is unlikely that a point would fall exactly onto the 55000, 66000 point and, as such, it is much more likely that this point is recorded to the nearest kilometre. I have a small Python script to perform this test, which can be downloaded here.
In any event, the end result of this is that an NGR that is apparently a point is, in fact, properly the origin point (southwest corner) of a square of width and height equal to its precision, e.g. the NGR ST5566 represents a square 1km by 1km spreading northeast from its location, whereas the NGR SQ2222233333 is a 1m by 1m square. Therefore, if we want to represent these points properly in a GIS environment, we have to present them as squares of appropriate size.
So how can we automate creating this representation in ArcMap (version 10)?* I will assume reasonable familiarity with the software.
If we have a point shapefile of our locations (use a duplicate to preserve the original), with their associated attribute data including their spatial precision**, we can create these squares in four relatively simple steps. Here are our points on a map:
First, we open up the attribute table for the layer. Then we add a new field to the table, which we shall call MIDPOINT and which will be of the Float type (i.e. a floating point number). Then we open up the Field Calculator for the MIDPOINT field and enter the calculation [Precision field] / 2.0 as follows:
This new field will be used to add on to the x and y coordinates of each point in the layer to move them to the centre of each square to be created. We then open the Field Calculator for the Shape field and enter the code snippet seen here (make sure you have the Python parser selected at the top):
The expression can be downloaded here. This results in the points moving to a new location to the north east in the midpoint of the squares we will soon create (notice that some points have moved a long way yet others remain close to their origin):
Next we create buffers for our moved points using the Buffer tool in ArcToolbox. You want to set it to use the MIDPOINT field as the buffer size (we use the MIDPOINT rather than the full precision field as the size will be the radius of the circle created, not the diameter) and you do not want it to dissolve the buffers where they overlap, as we want one object for each point. The result should look something like this:
Finally, we use the Feature Envelope to Polygon tool in ArcToolbox using the buffer layer as input. This, helpfully, retrieves the envelopes that enclose each circle (used to make spatial queries faster etc.), which is equivalent to the squares that we were after in the first place. Therefore, this layer is the result that we desired:
Any points that do not seem to have output squares in the image do, in fact, have them but they are too small to see at this scale (e.g. 1m or 10m precision). Overall, this is a simple way to do this task in just a few minutes whilst avoiding any overly complex scripting. I think I will work on a Python script to achieve this task in one step, but the method just described works perfectly well in the interim.
The final result means that we now both have a visual representation of the area to which each of our NGR “points” actually relates, and also a layer which can be used in further analyses when trying to understand the contents of our layer: all of the layer attributes from the original layer should find their way automatically into the final output, so we can perform whatever analyses we wish on our new, more properly representative geographic objects.
* ArcMap is part of ESRI’s ArcGIS and is the GIS software that we are predominantly using in this project.
** If you did not record the precision of the NGRs whilst converting them, you could use the Field Calculator to work it out (in the same way as described above for the other calculations). You would have to create a fields called something like ‘PRECISION’ and ‘TRUEPRECIS’, and then use the expression code downloadable here and here in turn on the first and then second fields.
The Archaeological Investigations Project (AIP) is an ongoing project at the University of Bournemouth that collates the brief results of commercial archaeological investigations in England for each year. With the very kind assistance of their Ehren Milner, we have extracted data from their database for our period in respect of two specific classes of monument type: settlements and field systems. These two search queries were selected first as they seemed to me to be two of the key monument types of interest to our project: we will most probably ask Ehren to undertake further, different or revised queries in addition in the future.
The search terms used were taken from the English Heritage thesaurus, as that is used by many organisations and it seemed a sensible starting point. Terms were selected from the list where they were relevant to our period of interest (e.g. we did not include terms under the settlement class like ‘Olympic village’ or ‘railway workers temporary settlement’), leaving out some longer terms which ought to be captured by a subset of their content (e.g. by including ‘field system’, we also ought to catch items named ‘Celtic field system’, ‘coaxial field system’, ‘enclosed field system’, etc.).
The field system query term list was:
RIDGE AND FURROW
The settlement query term list was:
It is not claimed that these lists are exhaustive or likely to capture every instance of these two classes of monument. For instance, under settlement I was uncertain whether I should have included ‘villa’ or not, especially as I did include terms like ‘enclosure’ and ‘fort’. The searches would also not capture objects recorded using any different terminologies, or sites of too small a size to be recorded within monument types of this sort of scale (e.g. an intervention which only discovered one ditch of a field system or settlement could not properly be recorded as such by the excavator without further supportive evidence, so would likely be recorded in the AIP as simply ‘ditch’).
In any event, subject to these caveats, the data received from Ehren (in .mdb database format) was exported to a .csv table and processed using a Python script to convert the OS NGRs (see my previous post) to numeric x and y coordinates (only 3 sites of over 4,000 seemed to have incorrectly recorded NGRs, i.e. they had an odd number of digits) and to extract the periods recorded for each site to a separate field in the new output table. This processed table can then be imported into ArcGIS in the conventional fashion.
Here is a map of the results, plotted against our initial proposed case study locations:
This data is especially useful as it helps to fill gaps between areas where the National Mapping Programme (NMP) has not yet been undertaken. For example, Cambridgeshire has had much commercial archaeology done in recent years, but has not been surveyed by the NMP as yet. Hopefully, by using data from the HERs, EH data, PAS data, and AIP data in combination (alongside other more specific datasets), we ought to be able to build up an excellent picture of English archaeology for our period.
This is just a brief update to outline some of the latest developments to the project’s GIS element.
Web mapping: constructing an Open Source geostack
John, Xin and I have been thinking this month about how we might go about constructing the web-based mapping element seen as a vital component of the final website that we will produce as part of the project. Xin constructed a good initial plan and then I found a very useful tutorial and software package on OpenGeo. This OpenGeo Suite is well-documented and seems to function well as a route to provide webpage-embedded interactive maps.
The team installed the Suite on the OeRC’s server and used it to feed data to a test webpage (written based upon the OpenLayers API). The results were very promising and we hope to use this software / process as part of the construction of our final website output, where appropriate. What data and results the final website will map is still very much an open question!
Converting Ordnance Survey grid references
Whilst looking at and importing AIP data for field systems into the project’s GIS database, I wrote a small piece of Python code to convert Ordnance Survey National Grid References (NGRs) to numeric x and y coordinates for GIS implementation (in metres, not latitude / longitude). The code should be available for download here. It runs in Python (2.7) and would need expanding to include the capture of NGRs from a source file and the writing of output x y coordinates to a results file, as it is only really of use for bulk conversion purposes. Apologies for the quality of my coding…
My previous (brief) online guide to this subject can be found here. Incidentally, my main tip for doing this manually is: make sure you do not confuse your easting and your northing!
Processing NMP data for GIS analysis
I have also been working on processing the NMP data for the south west and south east received from English Heritage.
It was found to be useful to copy the raster layers into a Raster Catalog object in ArcGIS. This makes possible various analytical methods that would otherwise be closed to raster data (such as the Select by Location tool), and also makes it easier to handle multiple tiles in one batch (including controlling the display of blank vector tiles at inappropriate scales for raster rendering, e.g. when zoomed out too far to show any detail). The Raster Catalog will be updated as further data is received.
With the vector data (in CAD format), it was found to be useful to convert the files into a single Geodatabase (.gdb) object in ArcGIS. This, again, makes it easier to handle multiple tiles at once (including maintaining symbology across tiles) and also makes it easier to output composites of multiple tiles to other formats (such as shapefiles). Again, this Geodatabase will be updated as further data is received.
The results look good and should form a strong basis for the research undertaken by the project (in conjunction with our other data sources, of course).
Following on from my previous post on this subject, I have now produced a second version of the obscuration layer. On the suggestion of Graham Fairclough of English Heritage, this version includes the same obscuration factors as before (woodland, water, buildings / roads / railways), but also adds in areas of alluvial and peat sub-surface deposits. These types of deposit tend to obscure archaeological features that were present on the former land surface before they formed, due to their thickness. However, this is not as complete an obscuration as with the previous categories used, for several reasons:
1. Peaty soils across England are being eroded by agricultural / drainage practices, revealing their buried archaeological material.
2. Archaeological sites that were constructed after (or later on during) the formation of the deposits will not be (or will be less likely to be) obscured, i.e. the older a site is, the more likely it is to be obscured.
3. Peat / alluvium deposits may be thin enough for substantial buried archaeological features to show through the masking effect, especially if denuded by more modern intervention.
As such, this result should be viewed more critically than the previous one, in that some areas showing as highly obscured may, in fact, show some archaeological features from the air (notably the region around the Wash), especially when dealing with sites from more recent times. Also, as with all models, the result presented should not be taken as perfected. Here, then, is the map showing percentage obscuration for 1km x 1km grid squares across England (built environment, water, woodland, peat, alluvium):
The data for peat and alluvium deposits were taken from the British Geological Survey’s 1 in 625,000 geology dataset (superficial deposits) which they provide for free download and unrestricted usage (subject to providing appropriate acknowledgment) under their open data initiative. This data is provided at the perfect scale for a task such as the one undertaken (i.e. national patterning), but would be less useful for more intensive surveys. Together with the OS Open Data also used, however, it does demonstrate the excellent results that can be produced as a consequence of organisations opening up their data for free usage by researchers (and, by extension, the general public).
Possibly the biggest challenge built into the EngLaId project is in how we bring together and synthesise the diversely recorded datasets that we are using. Whilst some consistencies exist between the recording methods used by different data sources, there remains a considerable amount of diversity. English Heritage (EH), England’s 84 Historic Environment Records (HERs), the Portable Antiquities Scheme (PAS) and other data providers / researchers all keep their own separate databases, all recorded in different ways, and the entries within which relate to some of the same objects and to some different objects.
As a result, there is considerable duplication between different data sources, which is not at all easy to extract. Where data objects have names, such as in the case of many larger sites, this can be used to assess duplication (assuming all datasets use the same names for an object), but this does not apply to the much more common case of objects with no assigned names.
Therefore, the best way in which to discover duplication and attempt to present a synthesis between different datasets is to test for spatial similarity. In other words, if a Roman villa is present in the same space within two different datasets, we can assume that it is the same villa. However, this in turn is complicated by the fact that different data sources contain data recorded to different levels of spatial precision and using different data types (e.g. points vs polygons). The way that I am experimenting with to get around this problem is in applying a tessellation of grid squares over the map and testing the input datasets for which objects fall within each square, recording their type and period, and aggregating across datasets to assess presence or absence of each site type for each period.
The first stage is to simplify down the terms used in the input dataset to a set of (currently) eight output terms (these are still not fully defined as yet and the number of output terms will undoubtedly grow). This is partly so that the output text fields do not exceed the 254 character limit for ArcGIS shapefiles (I will be working on a solution to this, probably involving moving to the geodatabase format), and partly so that we can identify objects of similar type recorded using different terminologies. This is accomplished through the use of a Python script.
The grid square tessellations were created using the tools provided as part of the Geospatial Modelling Environment software, which is free to download and use. So far, I have created tessellations at resolutions of 1km x 1km, 2km x 2km, and 5km x 5km to cover the different scales of analysis to be undertaken (and ultimately to give flexibility in the resolution of outputs for publishing purposes with regard to the varying requirements of our data providers). These were then cut down to the extent of England using a spatial query.
ArcGIS’s identity tool was then used to extract which input objects fell within which grid square (or squares in the case of large polygons and long lines). The attribute tables for these identity layers were then exported and run through another Python script to aggregate the entries for each grid square and to eliminate duplication for each grid square. The table output by the script (containing the cell identifier, a text string of periods, and a text string of types per period) was then joined to the grid square tessellation layer based upon the identifier for each cell. The result is a layer consisting of a series of grid squares, each of which carries a text string attribute recording the broad categories of site type (by period) falling within itself.
This methodology means that we can bring together different datasets within a single schema. Input objects that overlap more than one output square can record their presence within several output squares* (assuming they are represented in the GIS as polygons / lines of appropriate extent). Querying the data to produce broad-scale maps of our different periods and/or categories of data is simple (using the ArcMap attribute query system’s ‘LIKE’ query, remembering to use appropriate wildcards [% for shapefiles] to catch the full set of terms within each text field**). The analysis can also be redone using different resolutions of grid tessellation, depending on the quality of input data and the spatial scale of research question considered (e.g. 1km x 1km or 2km x 2km or 5km x 5km squares).
So far, this methodology has only been tested using EH’s National Record of the Historic Environment (NRHE) data (as seen online at PastScape: the process described above is also capturing the relevant identifiers to link through the data to PastScape, with an eye on linked data output in our final website) and using an initial, rather arbitrary, set of simplification terms to produce test results, but it should be straightforward to extend this system to encompass the various other datasets that we are in the process of gathering. As an example of the output produced, here is a map of Roman settlement sites in the south west of England (settlement being defined here as entries containing any of the words: villa, house, settlement, hut, roundhouse, room, burh, town, barn, building, floor, mosaic; some of these terms obviously do not apply to the Roman period and the list will be subject to revision before final outputs are produced):
As can be seen, on the scale of a region the output is both clear and instructive. The result is one that shows presence or absence of a type of site within each cell, with no quantification given of how many of each type (as we ultimately will not know whether the total count is due to duplication or due to genuine multiplicity). This picture will only get better once we have fully defined the terms used in our simplification process and once we start building in more data from our other data sources.
I shall be presenting a paper on this subject at CAA in Southampton in March.
* Whether this is appropriate or whether they should fall only within the square within which the majority of the polygon falls is still open to debate. I feel that under a strict rationale of presence / absence, they should appear in all squares they overlap, but this could present a misleading picture in cases where, for example, a small site overlapped the junction of four large grid squares.
Examination of aerial photography is one of the primary methods by which archaeologists have surveyed the landscape of England for new sites and for new information about known sites, in a process that continues to this day. However, it is only possible to find buried archaeological features by this method under certain conditions. One particular adverse condition that halts all aerial photographic survey work is the obscuration of the ground surface by human and natural features. Woodlands / forests (LiDAR can see through these to some extent, but photography cannot), lakes, buildings, roads, railways, etc. can hide the ground surface and make the detection of surface and subsurface features impossible.
As a result, distributions of archaeological sites discovered through aerial prospection will inevitably be biased towards areas of open country, particularly arable and pasture lands. If we wish to make quantitative statements about such distributions, we need a methodology by which to quantify the obscuration of the ground surface, in order to demonstrate which areas of apparent blankness on such a distribution map are, in fact, only blank due to the impossibility of aerial prospection.
When the Ordnance Survey made available some of its data under its OpenData initiative, it became possible to undertake this quantification of obscuration using some quite simple (albeit intensive) computational methods. This is because the Vector Map product produced by the OS is organised thematically, making it quite simple to download and join together thematic map layers for the whole of the UK (as the current project is only concerned with England, the method discussed below has only been undertaken for England, however). This forms a series of data layers that would have been very difficult to pull together prior to the OpenData initiative.
To build up a map of ground obscuration for England, the following OS OpenData layers were downloaded and joined together for several regions (European parliamentary constituencies) that together spanned the whole country*: buildings, water areas, forested areas (all polygons), roads, and railways (line data). It would have been possible to include other layers (such as glasshouses), but it was decided that those listed above were sufficient to produce a good generalised map. The spatial precision of these layers actually appears very good, especially for the resolution of analysis undertaken (see below). Buffers were generated for the roads and railway lines, of varying width depending on the type of entity (based on a quick survey of a few entities of each type on Google Earth): 10m for most types of road and for narrow gauge railways; 15m for A roads and single track railways; 20m for trunk roads; and 25m for motorways and multi track railways.
The buffer layers, buildings, water areas and forest layers were then joined together using the union tool in ArcGIS to create a polygon map of ground obscuration for each region. A 1km by 1km polygon grid square layer was generated using Geospatial Modelling Environment and then reduced down to the outline of England via a spatial overlap query. The identity tool in ArcGIS was then used to calculate how the polygons in the obscuration layers overlapped with the grid polygons, and the area of each resulting overlap polygon was then calculated. The attribute tables were exported for these output layers and joined together in Excel into one big table listing the ID number (CELLID) for the related grid square and the area of each obscuration polygon within that square. A python script was written which went through this table, adding together the total area of obscuration for each CELLID (this took around seven hours to process), and outputting a new table listing CELLID and total area of obscuration.
This output table was joined to the 1km by 1km grid square layer in ArcGIS based upon the CELLID. We now knew the total area of obscuration for each kilometre grid square of England. The percentage obscuration was calculated and this percentage figure was then used to create a 1km resolution raster layer showing what percentage of each cell’s ground surface area was obscured by buildings, woodland, water, roads and railways:
Obviously, as with all models, this is not a perfect or perfected result, but I do believe that it provides a very useful quantification of the extent to which the ground surface of England is obscured to any aerial visual observer (the picture would be somewhat different for LiDAR prospection, as then I would not have included trees as a form of obscuration). There are undoubtedly other types of obscuration feature that could also have been included (areas of alluvium or peat, perhaps) and there may be some types of included feature that can, in certain circumstances, be seen through. It does, however, provide a good basis for quantifying the extent to which gaps in aerial prospection results for England have resulted from the impossibility of achieving results through that method. In the context of this project, this is particularly relevant when dealing with English Heritage’s National Mapping Program data, as this was constructed on the basis of aerial survey.
– Chris Green
* This division into regions was purely to ease the processing burden on ArcGIS and my computer.