More

Using attribute data for legend labeling in QGIS?


For features in a map it is possible to stuck data from the attribute table together and get a data driven label. I try to do this for my legend labels. I want to combine values from my attribute table to get a customised label. I want a legend with rule based label entries. Rather similar to "rule based style" but with combined column values.

For example: Klass_1 (3 %)

For instance Klass_1 is from a column "Klass" and 3 % from a column "Percentage".

Thats the plan:

How can I do this?


I've sort of replicated your situation.

Not sure if there's an easier way to do it, but the following code snippet works for me (I'm using QGIS v.2.8). To use it, activate your layer in the ToC, open the QGIS Python console, and copy&paste the code there.

lyr = iface.activeLayer() renderer = lyr.rendererV2() children = renderer.rootRule().children() fieldName1 = "Klass" fieldName2 = "Percentage" for child in children: # Iterate through groups if child.filter(): feat = next(lyr.getFeatures(QgsFeatureRequest(child.filter())), None) if feat: child.setLabel( feat.attribute(fieldName1) + " (" + feat.attribute(fieldName2) + ")") for subChild in child.children(): # Iterate through subgroups if subChild.filter(): feat = next(lyr.getFeatures(QgsFeatureRequest(subChild.filter())), None) if feat: subChild.setLabel( feat.attribute(fieldName1) + " (" + feat.attribute(fieldName2) + ")")

The code assumes fields to concatenate are of typeString.

After the code runs, open the layer properties and you should see the new concatenated labels. Click onOkto refresh your legend.

I obtain this:

Let me know if you have troubles with it.


If this is a set of styles that you apply as part of a routine workflow, it might be worth setting up a QML style file that can be applied to any workspace of the same type.

If the styling of elements is thematic (and you know what columns drive the theme) you can build a QML file programmatically straight out of PostgreSQL (assuming you have your data in there… I put all our data into PostgreSQL using ogr2ogr, as a matter of routine).

For example, I have a file with ~150 styles for government zone polygons (subsets of the zones have the same style). We wanted to categorise by zone_code, but have a more-informative Legend entry - and we wanted to use a colour ramp as dictated by a specific table column:

To create my QML file, I wrote a python script that looks like this -

import psycopg2 # Set up connection to database conn_string = "host='localhost' dbname="[db]" user="[me]" password="[not telling]"" # Connect to database conn = psycopg2.connect(conn_string) # set up default cursor (to run queries) cursor = conn.cursor() # Get LGA details; store them to a list cursor.execute('select xmlelement(name qgis, xmlattributes('2.8.2-Wien' as version), xmlelement(name "renderer-v2", xmlattributes('zone_code' as attr, 0 as symbollevels, 'categorizedSymbol' as type), xmlelement(name symbols, xmlagg(xmlelement(name symbol, xmlattributes(1 as alpha, 'fill' as type, gid-1 as name), xmlelement(name layer, xmlattributes(0 as pass, 'SimpleFill' as class, 0 as locked), xmlelement(name prop, xmlattributes('color' as k, rgba as v)), xmlelement(name prop, xmlattributes('outline_color' as k, '230,230,230,255' as v)))))), xmlelement(name categories, xmlagg(xmlelement(name category, xmlattributes('true' as render, gid-1 as symbol, zone_code as value, zone_code||' ('|| description || ')' as label)))))) from se_zones') xml_new ="for row in cursor: print row[0] xml_new = row[0] #print 'XML_New: ', xml_new # testing only # pre-pend QGIS-specific DOCYTPE to XML outStr= "
" + xml_new # open destination XML file and save XML to it outFile = open("Zones.qml", 'w') outFile.write(outStr)

The script is kludgy (there's no need for thefor row in cursorsincecursoronly returns one row), but result is exactly what I wanted: a QML file that I could distribute to my colleagues so that they could style Zoned data with 3 clicks.

Note that in this particular instance, zone_code and description are concatenated to give the Legend entry.

When theme files need to be built for different sets of categories, only the query has to change: we have ~25 different theme sets, and they were all constructed from that one script by changing the query and the destination file.

The output looks like this -

          

For your specific theme you would have to write the appropriate query (which means you need to know the columns that 'drive' the style categories).


Demystifying Japan Official Topographical Map Shapefiles

Just recently here on the Hokkaido Wilds, we’ve begun providing free, downloadable, printable PDF topographic maps of the ski touring, hiking, and snowshoeing routes we publish here on the site just look for the TOPOMAP+ button on route pages. We use QGIS, a GIS software, for the mapping work. So far we’ve got maps for the Italian Route (PDF), Bankei Hut hike (PDF), Asahidake-Kurodake Traverse (PDF), Monbetsu-dake snowshoe route (PDF), Sapporo-dake ski tour route (PDF) among others. The workflow to produce these maps is: make the topomap in QGIS, export as PDF, import the PDF into inDesign, do the layout/description/photos etc.

NOTE: I’m an amateur who is, really, grossly under-qualified to be schooling anyone on GIS stuff. I do understand Japanese though, so if you’re experienced with GIS and you have a question about the shapefiles or anything else that might require some Japanese interpretation, please let me know in the comments section.

If you are familiar with QGIS, applying styles, and the Geospacial Information Authority of Japan (GSI) vector files, and you just want to have a go at reproducing the Hokkaido Wilds map styles, then download our QGIS styles here (ZIP file).

If you want more details about attributes and the different kinds of shapefiles, then read on.

Contents (in-page links)

For much of the process of using GSI vector files, including downloading DEM vector/shape files and applying for permissions (see below), you’ll need a GSI account. So, first, make an account.

  1. Access https://onestop.gsi.go.jp/onestopservice/onestop_login.html and click on 新規登録 or ‘sign up’ if using Google Translate or similar.
  2. Follow the on-screen instructions to create your account.

NOTE: I checked the account creation page using the built-in Google Translate feature on Google Chrome, and the English translation was mostly legible. Note however that the page will jump back to Japanese when fields are updated etc. Just re-translate the page and you should be OK. I also tried copy and pasting the account signup page URL into the Google Translate page, but this didn’t go as well (pages stalled etc). Where possible, try using your browser’s in-built translation features.

In order to use GSI map data for your map project, permission must be sought from the GSI director. This is all free, no charges required, even for commercial use. This can be done online, but the interface is in Japanese. You can, however, use the built-in English translation feature of your browser.

For the Hokkaido Wilds purposes, we are using (使用) the map data and visualizing it in our own way, so there is no need to apply for reproduction (複製) permissions (as would be the case if we use the GSI raster map tiles). So, I’ll only talk about the data usage permissions here.

For our purposes we applied for the use of the GSI Basic Geospatial Information Digital Maps (基盤地図情報 – actual map data including contour lines, DEM etc) and National Land Numerical Information (数値地図 – just for bus routes http://nlftp.mlit.go.jp/ksj-e/index.html).

Procedure for applying for permission to use GSI map data

  1. Log into your GSI account (https://onestop.gsi.go.jp/onestopservice/onestop_login.html). See account creation details above.
  2. Access https://onestop.gsi.go.jp/onestopservice/.
  3. Click on the 申請 (application) button for ①数値地図、紙地図、空中写真、基盤地図情報等の複製承認申請または使用承認申請 (Google Translate: Application for copying approval or application for approval for use such as numerical map, paper map, aerial photograph, foundation map information etc).
  4. Work through the application flow, according to this PDF, which will cover the use (使用) of GSI vector data for creation of paper and/or PDF maps.
  5. Once you’ve filled everything in and submitted it, a PDF copy of your permission will be sent to your email within about 3 working days from GSI.
  6. On any map project covered by your permission, you’ll need to display the Japanese text indicated on the PDF permissions document (see an example here, at the bottom of the top middle panel of the first page).

Use the styles here to reproduce the Hokkaido Wilds topomap styles in QGIS (ZIP file): download

Reproduce and/or build on the Hokkaido Wilds PDF map layouts/color schemes using the following inDesign files.

NOTE: The PDF map portions are created/edited in QGIS, exported as PDF from QGIS, and inserted into inDesign.

FONTS: These layouts use the free Lato and Jaapokki fonts. For any Japanese text, we’re using the Kozuka Gothic Pr6N font.


Using attribute data for legend labeling in QGIS? - Geographic Information Systems

Open Journal of Forestry Vol.10 No.01(2020), Article ID:97100,19 pages
10.4236/ojf.2020.101004

Analysis of Spatio-Temporal Dynamics of Land Use in the Bouba Ndjidda National Park and Its Adjacent Zone (North Cameroun)

José Elvire Boukeng Djiongo 1,2 , André Desrochers 1 , Marie Louise Tiencheu Avana 3 , Damase Khasa 1 , Louis Zapfack 4 , Éric Fotsing 5

1 Department of Wood and Forest Sciences, Faculty of Forestry, Geography and Geomatics, Laval University, Quebec, Canada

2 Garoua Wildlife School, Garoua, Cameroun

3 Department of Forestry, Faculty of Agronomy and Agricultural Sciences, University of Dschang, Dschang, Cameroon

4 Department of Plant Biology, Faculty of Sciences, University of Yaoundé 1, Yaoundé, Cameroon

5 Department of Computer Engineering, University Institute of Technology Fotso Victor University of Dschang, Bandjoun, Cameroon

Copyright © 2020 by author(s) and Scientific Research Publishing Inc.

This work is licensed under the Creative Commons Attribution International License (CC BY 4.0).

Received: October 19, 2019 Accepted: December 10, 2019 Published: December 13, 2019

We evaluated the dynamics of land use in the Bouba Ndjidda National Park (BNNP) and adjacent areas, in northern Cameroon. Using a maximum likelihood supervised classification of satellite images from 1990 to 2016, coupled with field and a socio-economic survey, we performed a robust land-use classification. Between 1990 and 2016, the area included eight classes of land use, with the largest in 1990 being the woody savannah (42.9%) followed by the gallery forest (20.2%) and the clear forest (16.3%). Between 1990 and 1999, the gallery forest lost 64.8% of its area mostly to the benefit of woody savannahs. Between 1999 and 2016, the largest loss of area was that of the clear forest, which decreased generally by 43.2% in favor of woody savannah. Rates of increase of crop field areas were 59.6% and 78.8% respectively for the periods of 1990 to 1999 and 1999 to 2016 to the detriment of woody savannahs. We attribute the changes in land use observed mainly to the increasing human population and associated agriculture, overgrazing, fuelwood harvesting and bush fires. The exploitation of non-timber forest products and climatic factors may also have changed the vegetation cover. We recommend the implementation of farming techniques with low impact on the environment such as agroforestry.

Remote Sensing, Spatio-Temporal Dynamics, Bouba Ndjidda National Park, Vegetation Cover, Land Use

Protected Areas (PAs) are an important component of biodiversity conservation in most countries (Tardif & Sarrasin, 2014). They are central to protect endangered species and are also viewed as significant providers of ecosystem services and biological resources (Dudley, 2008). Yet, Central African PAs are undergoing degradation (Doumenge et al., 2015). Clearing of land for agriculture and logging for urban markets, increase pressure on PAs. Economic activities such as agriculture, animal husbandry, hunting, timber harvesting and conservation efforts through protected areas are usually made without land use planning in most African regions (Arouna et al., 2009). One of the consequences of these practices is the degradation of vegetation cover both inside and adjacent to Protected Areas.

The Bouba Ndjidda National Park (BNNP or “park”) is part of the network of protected areas of northern Cameroon. Of the three national parks in this network, it is the richest and most diversified in wildlife species (Omondi et al., 2008). It constitutes the Bouba Ndjidda Technical Operational Unit (TOU) with the hunting zones (ZIC) that surround it. Due to poaching, this protected area lost about 200 elephants in 2012 (Scholte, 2012). For about two decades, the growth of human population in this area has also favored the collection of wood and the extension of agricultural areas (Nature Information Tracks, 2017). However, these pressures remain less quantified at the spatial and temporal scales, limiting the development of strategies adapted to the context. Given the multifaceted pressure on the vegetative cover of the BNNP and adjacent areas, it is relevant to study the different states of the plant formations, and their tendency, and to infer the driving factors of these changes.

The availability of satellite images from Landsat TM, ETM+ and OLI-TIRS sensors, which have been used earlier for the analysis of vegetation cover in Protected Areas in tropical zones, provides an opportunity to characterize and monitor changes in vegetation cover in and around these PAs.

The purpose of this study is to understand the spatial and temporal changes of the vegetation cover in order to guide the decision making regarding the land use of this area. More specifically, the study aims to make a spatio-temporal analysis of land use in the BNNP and its periphery between 1990 and 2016, and to determine the factors driving those changes.

2.1. Characteristics of the Study Area

The BNNP is located in the Soudano-Sahelian zone of Cameroon, east of the network of Protected Areas of the Northern Region. Created in 1947, the BNNP extends between 08˚21'N -09˚00'N, and 14˚25'E -14˚55'E. Its surface area is about 700,000 ha including the hunting zones (10, 11, 12, 17, 20, 21, 23) that surround it (Figure 1). The climate is Sudano-Sahelian as described by Gonné (2016). The locality of Bouba Ndjidda receives between 800 and 1250 mm of

Figure 1 . Location of Bouba Ndjidda National Park and its adjacent zone.

precipitation per year, the rainiest months being August and September. The average annual temperature is 28˚C. Soils are generally ferruginous leached tropical or a combination of ferruginous and hydromorphic soils (Brabant, 1972). According to Nature Information Tracks (2017), the vegetation within the Bouba Ndjida National Park is composed of characteristic species of the Soudanian zone found in woody savannahs, gallery forests and dry clear forests. Around the BNNP, there are shrubby savannahs, crop fields and forest fallows (Ministry of Forest and Wildlife, 2010b). Twenty-eight villages are adjacent to the east, south and west limits of the park and belonging to the district of Tcholliré, Madingring and Rey Bouba. The population of these districts is characterized by a great diversity of ethnic groups (autochthonous and semi-migrant groups) totalling 27,500 inhabitants in 1987 (RGPH, 1987). It rose to 194,065 in 2005 (RGPH, 2005) and was estimated at 260,149 in 2016 (RGPH, 2010b). Agriculture and livestock rearing are the main activities practiced extensively the main crops being cotton (the only cash crop), maize, groundnuts and sorghum (Djiongo, 2015 Ministry of Forest and Wildlife, 2010b).

2.2. Map Data and Digital Processing of Satellite Images

We classified land use from Landsat 4 - 5 (TM) images of November 24, 1990, Landsat 7 (ETM+) of November 17, 1999 and Landsat 8 (OLI/TIRS) of November 23, 2016. This identification of successive vegetation states over a period of twenty-six years allows a better appreciation in the states of change of plant formations. Several reasons justify the choice of this period: 1) It was in the early 1990s that rural forestry emerged for foresters (Montagne & Dubus, 1992) by which “rural people must become the managers of their entire land, including woodlands, based on the notion of a multipurpose tree”. Priority is thus given to rural forestry, which is perfectly integrated with other peasant activities such as livestock and agricultural production (Goudet, 1992). 2) This period also includes the year 1994, from which the new Cameroonian forestry law provides a framework for the management of natural formations by offering local populations the opportunity to set up “community forests” (Gautier & Seignobos, 2003). 3) The availability of Landsat images at the same times of the year for these three dates. We chose images acquired at the beginning of the dry season because of their availability and quality. During this period, the differentiation of land-use elements such as crops, herbaceous plants and ligneous plants is maximal (Tabopda, 2009). We downloaded the satellite images from the Earth Explorer (EE) Geospace of the United States Geological Survey (USGS). They were all acquired for the same season of the year in order to reduce problems related to differences in solar angles, soil moisture and phenological changes in vegetation. We used GRASS 7.2.2 (GRASS Development Team, 2017) and QGIS 3.0 (Quantum Geographic Information System Development Team, 2018) for digital image processing and integration of results with other geographic data sources, respectively. The first phase of the treatment consisted of the display of color compositions in false color by superposition of the red and green near-infrared canals for the three dates (4R/3V/2B for Landsat TM of 1990 and Landsat ETM+ of 1999, then 5R/3V/2B for Landsat OLI-TIRS of 2016). We limited classification to the park limits and the seven adjoining hunting area that serve as buffer zone around it. A mask was then applied to extract this area of interest. We identified and grouped the similar pixels using a Maximum Likelihood Classifier (MLC) classification. On the basis of the information from the analogue interpretation, our knowledge of the BNNP and its periphery, and field surveys, the twelve classes initially retained were refined after reclassification into eight major classes as shown in Figure 2.

To verify the validity of the performance of the classifications, we used ground-thruthing data collected in the field and the control zones of each class to create a confusion matrix. After validation, we integrated all the relevant spatial information (road network, hydrographic and subdivisional headquaters near the park) into a set of annual land use maps. We performed an evaluation of the areas of the different land cover classes each year and estimated the area of land use changes between 1990 and 1999, and between 1999 and 2016. We quantified vegetation cover dynamics by calculating the overall change rate (Tapobda et al., 2008 Arouna et al., 2009). To estimate the mean annual expansion rate T, we calculated the logarithm ratio of the difference in area between two dates over the number of years of change (Arouna et al., 2009 Taibou et al., 2017).

2.3. Collection and Analysis of Socio-Economic Data

We conducted a participatory rural appraisal from august to November 2016,

Figure 2 . From initial classes to final classes by merging similar themes.

through semi-structured interviews and direct observations to identify the factors affecting the spatio-temporal changes. Out of the 28 villages adjacent to the park, 13 villages selected based on their geographical location (proximity with the park) were sampled. In each village, we realized a focus group to obtain additional information on infrastructure, land use patterns and farming techniques. At the end these focus group, household heads were randomly selected, based on their availability (the period of study coincided with the period of crops harvest) and 126 household heads were surveyed. In each district, the representatives of local administrations (water and forest, agriculture, livestock, territorial administration) and Cotton Development Company were also interviewed. The main information sought during the surveys related to their status (migrant or native), household size, plot size, income-generating activities, exploitation of non-timber forest products and wood forest products in the park, sources of fuel wood supply and their perceptions of the degradation of vegetation cover. Socio-economic data processing was performed by calculating the relative frequencies of each parameter under study. To obtain the data of the population of subdivisions adjoining to the study area in 2016, we made projections based on data from the general census of the population in 2005, considering the average annual rate of population growth of 2.7% (RGPH, 2010b).

3.1. Land Use between 1990, 1999 and 2016 at the BNNP and Its Periphery

In 1990 the crop fields occupied 2.2% (15,322 ha) of its area, clear forest 16.3% (11,088 ha), gallery forest 20.2% (139,314 ha) and woody savannah 42.9% (295,387 ha) (Table 1). The vegetation cover consisting of the clear forest, gallery forest, saxicolous forest, woody savannah and grassland savannah represented 95% (654,370 ha) of the surface area of the park and its periphery. Clear forest and gallery forest were found mostly in the southeast and north of the park, while woody savannah covered the park almost entirely (Figure 3(a)). This high recovery rate of vegetation cover denotes the importance of the woody area in 1990. With 4000 ha, the infrastructure accounted for only 0.6% of the surface area of the BNNP and its periphery. In 1999, crop fields covered 3.6% (24,455 ha), bare soil 0.8% (5527 ha) and infrastructure 1.6% (11,000 ha) of the territory. The vegetation cover occupied 94% of the BNNP and its periphery (Table 1), mainly distributed in the center and south of the park (Figure 3(b)). In 2016, with 43,714 ha, the crop fields represented 6.3% of the area of BNNP and its periphery. Crop fields were highly concentrated in the east and in the west of the park, especially around the villages located along roads. The vegetation cover was 82% (566,430 ha) compared to 1.8% (12,146 ha) for infrastructure and 9.7% (67,122 ha) for bare soil (Table 1). Only the central part of the park remained fairly well preserved (Figure 3(c)).

3.2. Change of the Vegetation Cover and Other Land Use Units between 1990 and 2016

Between 1990 and 1999, infrastructure, crop field, grassland savannah, woody savannah and the saxicolous forest are the units of land use that have known

Table 1 . Land use in the Bouba Ndjidda national park and adjacent areas between 1990 and 2016.

Legend: I: Infrastructure BS: Bare soils or least vegetated soils CF: Crop fields GS: Grassland savannah WS: Woody savannah CF: Clear forest SF: Saxicolous forest GF: Gallery forest.

Figure 3 . Map of land use in the BNNP and its periphery in (a) 1990 (b) 1999 and (c) 2016.

progressive change. After the saxicolous forest, crop fields experienced the largest increase in area, from 15,322 ha in 1990 to 24,455 ha in 1999 for an overall increase rate of 59.6%. In contrast, the clear forest and gallery forest experienced regressive change with annual growth rates of −4.9% and −11.6%, respectively (Table 2). Between 1999 and 2016, the largest progressive changes were in bare soils and crop fields with average annual growth rates of 14.6% and 3.4%. In 17 years, the clear forest was the type of vegetation that lost the largest area (31,300 ha) with an overall rate of −43.2%. Generally, the total vegetation cover decreased by −82,000 ha for an annual expansion rate of −0.8% (Table 2).

3.3. Matrix of Land Use Change

The matrix of land use change presents in a quantified manner all the changes that took place during the different time intervals. The matrices of land use change for the periods of 1990 to 1999 and 1999 to 2016 are shown in Table 3. The diagonals of the matrices represent the surfaces areas of each land use type that remained stable from one period to another.

Between 1990 and 1999, 70.6% (

208,443 ha) of woody savannah remained unchanged (Table 3). However, the vegetation lost 16% and 3% of its surface area in favor of grassland savannah and crop fields, respectively. The most important negative change was that of the clear forest which lost 64.7% (

72,470 ha) of its surface to woody savannah, representing more than half of its surface area, leading to a change in the physiognomy of the landscape.

Table 2 . Changes in land use area in the BNNP and its periphery between 1990 and 1999 and between 1999 and 2016.

Legend: I: Infrastructure BS: Bare soils or least vegetated soils CF: Crop fields GS: Grassland savannah WS: Woody savannah CF: Clear forest SF: Saxicolous forest GF: Gallery forest S1: Surface area in 1990 S2: Surface area in 1999 S3: Surface area in 2016. C1: Loss or increase in surface area between 1990 and 1999 C2: Loss or increase in surface area between 1999 and 2016. Tg: Total change rate T: Average annual expansion rate.

Table 3 . Matrix of land use change in the BNNP and its periphery from 1990 to 1999 and from 1999 to 2016 (surface area in %).

Legend: I: Infrastructure BS: Bare soils or least vegetated soils CF: Crop fields GS: Grassland savannah WS: Woody savannah CF: Clear forest SF: Saxicolous forest GF: Gallery forest.

Between 1999 and 2016, the highest conversion was still that of the clear forest, which lost 3.8% (2743 ha), 5.6% (4065 ha), 21.9% (15,875 ha) and 46.5% (33,675 ha) of its surface area respectively to crop fields, grassland savannah, gallery forest and to the woody savannah. At the same time, a considerable decline in woody savannah was observed in favor of the crop fields, with a loss of 5.8% (23,229 ha). The same trend was obtained for grassland savannah, which lost 20.3% and 6.1% of its surface area to bare soils and crop fields, respectively. Generally, between 1990 and 2016, the vegetation cover conversions amounted to 18.8% in favor of the anthropized area represented by crop fields, bare soils and buildings.

3.4. Local Perception Factors of Degradation of the BNNP and Its Periphery

The neighboring populations cited several factors responsible for the degradation of the vegetation cover in the BNNP and its periphery (Figure 4).

Demographic factors were the main causes of degradation of the vegetation cover in the BNNP and its periphery as reported by 92.5% of the respondents. In fact, between 1987 and 2016, the population of the three subdivisions (Madingring, Tcholiré and Rey Bouba) that adjoin the park increased considerably (Figure 5). In 29 years, this population almost quintupled in the Rey Bouba Subdivision where it rose from 10,000 inhabitants in 1987 to approximately 146,700 inhabitants in 2016 (RGPH, 1987 RGPH, 2005 RGPH, 2010b). According to data from the third census of the population of Cameroon, a household is composed of an average of 5 persons, with more than 15% of rural households being composed of at least 9 persons. The latter situation reflects the size of households in the Mayo Rey Division. The migrations

Figure 4 . Frequency of citation of causes of degradation of the vegetation cover by populations neighboring to the BNNP.

Figure 5 . Change in human population of the Subdivisions surrounding BNNP between 1987 and 2016 (RGPH, 1987 RGPH, 2005 RGPH, 2010b)/Projection 2016.

organized by the administration since the year 1972 with the aim of decreasing population densities in the Far North and spontaneous migrations in the Benoue basin (North Cameroon) favored this strong demographic growth. According to Dugué et al. (1994) cited by Mfewou (2013), the extent of migrant installations after the years 1970s in the Benoue basin corresponds to a territory of about 20 inhabitants/km 2 with an annual growth rate of 3.14%. In our study, 40% of those surveyed were migrants.

Agriculture: Out of the 84.2% of the neighboring populations surveyed, agriculture was the second cause of degradation of the vegetation cover in the BNNP and its periphery. The BNNP occupies 76.74% of the population of the Bouba Ndjidda TOU (WWF, 2007). It is mainly dominated in order of importance by food crops (maize, groundnuts, cow pea, millet/sorghum, cassava, yam), which constitute the staple foods for the majority of the population and cash crop cultivation (with cotton being the only cash crop in the Division). Cultivated land for the different crops in the Mayo Rey Division is increasing. The areas of the plots exploited per household varied from 1 to more than 20 ha according to the objectives, the financial means and the agricultural equipment owned by farmers. However, 84.13% of the respondents had plans to increase their arable plots by more than 2 ha (14.29%) and by 1 ha (1.50%). The extension of cultivation space of a specific crop is also linked to its profitability during the previous agricultural season. For example, if during the year during which the price of a bag of 50 kg of maize can vary from 12,000 FCFA to 20,000 FCFA, the farmer will tend to increase the areas reserved for maize cultivation without necessarily reducing that of other crops, in order to maximize profit the following year. Farmers initiate new agricultural plots after the total or partial destruction of natural vegetation cover.

Livestock rearing is the second economic activity after agriculture in the study area. It is practiced extensively by sedentary and transhumant shepherds. Sixty-five percent (65.8%) of the respondents indicated that overgrazing was responsible for the degradation of the vegetation cover through the setting up of the camps and the pruning of the trees. In the dry season, transhumant pastoralists entered and settled inside the Protected Areas. During their displacement, they routinely prune fodder species (Afzelia africana, Ficus sp., Acacia sp. and Balanites eagyptiaca) to feed livestock, and the bark of certain trees such as Lannea kerstingii, Andansonia digitata that serves as bedding. This high pastoral pressure is explained by the fact that the settling of migrants in the area has led to the occupation of the pastoral grazing areas of pastoralists, forcing them to look for the new territories which were found in the parks and the adjoining hunting areas. Pressure from herders was high in the northern part of the park due to the presence of the Vaimba river, which does not dry completely in the dry season and which therefore attracted Mbororo breeders from Chad. The herders’ pressure was low inside the park, due to increased surveillance around the office of conservation service and the tourist camp. However, in the hunting zone 23 abandoned in recent years, the shepherds had taken up residence.

Harvesting of fuel wood (firewood and charcoal production) is one of the most harmful practices for the natural regeneration of vegetation cover in Protected Areas. This harvesting, which is manifested by anarchic cuttings, were frequent inside the BNNP and the surrounding hunting zones. According to local officials of the Ministry of Forests and Wildlife (MINFOF), the illegal cutting of woody species is the most reported offense after wildlife poaching. The populations adjacent to the BNNP non-selectively use a variety of woody species as firewood while for charcoal production, they are selective and use in order of importance Daniella oliveri (42.11%), Annogeissus leocarpus (38.6%), Prosopis africana (15.7%), Ficus capensis (1.7%), Zizuphus mauritiana (1.7%). Per capita consumption of firewood around BNNP is estimated at 1.17 kg/day and that of coal at 3.45 kg/inhabitant/day (field surveys). Some officials of MINFOF issue wood traders Authorizations for the Collection and Transport of Deadwood (ACTBM) subject to payment of 500 FCFA for a carrier whose quantity is equivalent to a stere wood. The Divisional Delegate of Forests and Wildlife for Mayo Rey Division has signed a note within the framework of the usufruct rights of local populations, authorizing free access to dead wood for domestic use. Inspite of this provision, neither the collection sites, the quantities granted nor the quality (fresh wood forbidden fresh) are not always respected.

Bush fires, mainly of human origin, were used by farmers at the periphery of the park to clean their farms (slash-and-burn agriculture). Inside the park, they were used either for hunting by poachers or for the renewal of grazing by BNNP managers and transhumant pastoralists who enter the park illegally. The effects of bush fires on plant resources result in an increase in soil temperature, the destruction of surface organic material, and in turn, to a decrease in the productivity of the vegetation.

The exploitation of non-timber forest products (NTFPs) was reported by 23.3% of respondents to be one of the causes of degradation of vegetation cover. The BNNP and its periphery constitute for the local populations an important reserve of products for the traditional pharmacopoeia, the construction of huts and food. There, they collect backs of trees (Daniella oliveri, Detarium microcarpum, Anogeissus leocarpus, etc.), straw (Jardinea congoensis and Hyparrhenia rufa), fruits and leaves of trees (Tamarindus indica, Brideliasp., etc.), and honey. This exploitation of NTFPs is usually anarchic and uncontrolled. The exploitation of NTFPs and wood forest products inside the park is prohibited and regulated in adjacent ZICs. The use of unsustainable harvesting techniques by farmers could lead to a decrease or even to the extinction of some plant species.

Climatic factors (rainfall): Between 1990 and 2016, the rainfall of the Bouba Ndjidda TOU exhibited a downward trend (Figure 6). Depending on whether the recorded rainfalls are in excess or in deficit, they affect land use differently, mainly on the extension of agricultural spaces. The dates marking the start of rainy seasons determined farming practices.

During the last 26 years, low rainfall was recorded in 1992 and 2011. This, with other factors could explain the increases in cultivated lands the following years.

4.1. Land Classes and Image Processing

We identified eight classes of land cover, five linked to natural vegetation (clear forest, saxicolous forest, gallery forest, woody savannah and grassland savannah) and three linked to anthropogenic activities (bare soil, crop fields and infrastructure). Vegetation surveys carried out by Bosch (1976) and Nature Information Tracks (2017) in the BNNP showed that the plant formations are made of six classes including clear forest, gallery forest, saxicolous forest, shrubby savannah, woody savannah and grassland savannah. As part of this study, for a better cartographic representation, we reclassified the type 1 savannah (shrubby) as woody savannah. Apart from the anthropic land uses, the mapping results confirmed the physionomic descriptions done by Bosch (1976) and Nature Information Tracks (2017).

Figure 6 . Interannual variation in meanrainfall of the study area from 1990 to 2016. Source: Garoua Weather Station, 2017.

Regarding the method for mapping land use, the strong influence of vegetation on the spectral signatures made it difficult to differentiate between the plant formations (clear forest, woody savannah, grassland savannah and crop fields). We ascribe the confusion between the grassland savannah and crop fields to the fact that the useful trees left in the fields give to this class the appearance of a grassland in some places the same is true for fallows classified as crop fields due to the difficulty in differentiating them from the grassland savannah. The spectral confusion between water and bare soil created by bush fire also made the classification difficult. In this case, ground-thruthing helped us to assign homogeneous pixels well known and accurate to different classes. In spite of these difficulties, the image treatment process as well as the mapping results remain satisfactory, since the kappa indices obtained were 76.5%, 73.6% and 72.03%, for mages of 1990, 1999 and 2016, respectively. The quality of images and the choice of thematic classes could explain such results (Geymen & Baz, 2008). For all three dates, the overall classification rate was greater than 70%, which permits the validation of the maps (Kabba & Li, 2011 Pontius, 2000).

4.2. Dynamics of Vegetation Cover and Explanatory Factors

The spatio-temporal analysis of land use in the BNNP and its periphery showed a redecline of natural plant formations of 13.4% (clear forest, gallery forest, saxicolous forest, woody savannah and grassland savannah) in 26 years at the expense of anthropogenic cover, namely crop fields, bare soil, and infrastructure, the surface area of which increased fourfold. The clear forest particularly recorded a decrease of more than 50% of its surface. These results are similar to those of Temgoua et al. (2018b) in the classified forest of Djio-li-Kera in southeast Chad and those of Benoudjita & Ignassou (2017) in and around the W Regional Park of Benin, which showed a continuous regression in forest and savannah at the expense of fallows, crop fields, bare soil and settlements. The search for new arable land to feed a population in perpetual growth could justify the increase over the years of anthropogenic spaces. Indeed, the average growth rate of the population in the study area is 3.9%, far above the national rate of 2.8% (RGPH, 2010a). Following this demography and the migrants recorded in the region since the 1970s, many plots were installed in the park during the last two decades. In addition, the network of Protected Areas of North Cameroon occupied 44% of the total surface area of the region BNNP and its adjacent zone occupying about 50% of Mayo Rey Division (Ministry of Forest and Wildlife, 2010a). These Protected Areas benefited from a more or less strict protection status depriving neighboring populations of arable surfaces, which forced them to extend agricultural lands towards the Protected Areas. Demography is one of the main causes of degradation of BNNP and its periphery. It significantly influences land use through the size of household farm assets (Houessou et al., 2013). In fact, the needs of a family vary according to the number of constitutive members thus, the larger it is, the greater the need. In response to increasing household demands farmers often decide to clear new fields to overcome family burdens. This corroborates the results of Ouedraogo et al. (2010) who concluded at the end of their work that there is a strong correlation between population growth and land degradation. We explain the influence of agriculture on land degradation by the presence of the Cotton Development Corporation (SODECOTON) in the region. Indeed, the cotton sector is supported by the state with subsidies via the provision of diverse inputs (fertilizer), varied supervision (farmers’ supervision), and available market for the cotton produced (purchase of all production). The quantity of inputs provided free of charge or on credit depends on the size of the cotton field. Subsidies acquired by the farmer probably contribute to the increase in the area of cotton fields. To further optimize the use of fertilizers in cotton fields, farmers juxtapose cornfields and cotton fields (depending on the direction of runoff and the location of cotton field upstream) or rotate crops by growing maize on former cotton fields that sometimes benefit from new wasteland. Generally, there is net expansion of cotton cultivation in the study area (Bokagne, 2006) with SODECOTON opting for the opening of extension service centres to facilitate farmers’ increase of cotton production through the occupation new wasteland, and to facilitate the transport and recovery of cotton produced. From an overgrazing point of view, tree pruning and livestock grazing by transhumant pastoralists are the main threats to BNNP’s plant resources the pressure being higher on the species Afzelia africana, which is systematically pruned to feed livestock. Hountondji (2008) also described the harmful effects of transhumant livestock farming on the degradation of vegetation cover in the Sahelian and Sudanian zones of West Africa. In addition to the pressure on specific plant species, the regular passage of recent bush fires does not favor the recovery of natural vegetation and thus contributes to increasing bare soil surfaces (Lubalega et al., 2018). The exploitation of NTFPs is a usufruct right of riparian populations of Protected Areas in tropical countries. However, the understanding of this notion remains subject to conflicting and divergent interpretations. Although poaching was not cited by farmers as a cause of vegetation cover degradation, it indirectly affects woody resources. The same observation was made by the UICN (2014) in the study on the drivers of deforestation and degradation in the Sangha Tri-National (TNS) landscape. Poachers use wood for camp construction, game conservation and heating. The effects of climate on the vegetation cover could be explained by the fact that when rainfall corresponds to the expectations of the producers, the cultivated lands are not systematically increased between two cropping seasons. But when the rainfall is below average, there is a compensatory increase of farming area.

Several scientific studies in Cameroon and elsewhere have attempted to document and explain land use change. Tabopda (2009) documented deforestation through the harvesting of fuel wood in the Protected Areas of the Far North of Cameroon and Temgoua et al. (2018a) reported agricultural extension and grazing as factors leading to the degradation of the Ajei community forest in the North West region of Cameroon. In Burkina Faso, Tankoano et al. (2016) identified agriculture, timber harvesting and extension of residential areas as the main drivers of degradation in the Deux Balé National Park. Other studies at the global level have categorized factors related to land cover and land use change. Fotsing (2009) classified them into two broad categories, namely: human factors that include socio-economic and political factors and biophysical factors. All these factors constitute a complex network of interactions between them making it possible to understand the dynamics of land cover or specific land use in space. For Lambin et al. (2007) cited by Boko (2012), there are six categories of direct or indirect factors contributing to the dynamics of land cover and land use change namely demographic, economic and technological, cultural, globalization, institutional and natural variability. Apart from globalization, there is no difference between the factors cited by Fotsing et al. (2013) and Lambin et al. (2007) and those of this study. However, in their review, Lambin et al. (2003) conclude that changes in land use are due to a combination of causes that can also act as constraints in forcing actors to make decisions about degradation, innovation or displacement.

According to Boserup (2002), population growth can lead to land degradation in the short term, but can also stimulate innovation, especially the adoption and intensification of eco-farming and conservation techniques as agroforestry which is known as any simultaneous or sequential association of trees, crops and/or animals in the same production unit (Atangana et al., 2014). The use of agroforestry systems on the buffer zone of protected area can offer a variety of products and services that meet some of the needs of local populations. These benefits therefore allow them to avoid using the protected natural area to support themselves. In this context, considering the growing needs of riparian communities for arable land, pastures, fuelwood and NTFPs, the practice of agroforestry should be intensified. We also recommend strengthening surveillance of the park, especially in the northern zone, which is subject to high livestock rearing pressures.

The main aim of this research was to analyze and quantitatively assess the spatio-temporal changes in vegetation cover in the Bouba Ndjidda National Park and its surrounding area between 1990 and 2016. The vegetation cover has declined by approximately 13.44% in 26 years, representing a loss of approximately 87,940 ha for fields, bare soil and human infrastructure. The corresponding increase of anthropogenic land use has as a corollary a change of facies and physiognomy in this Protected Area. The combined effects of demographic, socio-economic and climatic factors could explain the observed changes. To be consistent with the current paradigm of participatory management, we recommend the development of income-generating agro-sylvopastoral activities. This involves improving the agricultural and pastoral production systems in line with population growth. A macro zoning that takes into account a multipurpose use of the zone is also recommended.

The authors declare no conflicts of interest regarding the publication of this paper.


Mapping the observed and modelled intracontinental distribution of non-marine ostracods from South America

Ecological niche modelling (ENM) has been used to quantify the potential occurrence of species, by identifying the main environmental factors that determine the presence of species across geographical space. We provide a large-scale survey of the distribution of ostracod species in South America, by using the domains of 25 river basins. From 221 known ostracod species, we estimate the potential distribution of 61 species, using ENM. Ten clusters of potential distribution patterns were found. Clusters 8 and 9 grouped most of the species, which presented high similarity of niche between them. Heterocypris paningi Brehm, 1934 (group 1) obtained higher niche variability. The minimum temperatures of the coldest month and the mean elevation of the river basin were most important to predict the potential distribution of ostracods of most groups. South America has a complex pattern of elevation, which affects species distributions indirectly through changes in local factors. For instance, the Andes mountains might impose a barrier for ostracod distribution in the southern part of South America because of the low temperatures and precipitation. The ENM indicated that some regions and/or basins of South America might be susceptible to the entry of several ostracod species, presently absent, including non-native species.

This is a preview of subscription content, access via your institution.


3. Diversity of beef production systems in the region

Río de la Plata grasslands is a region of 700 000 km 2 comprising parts of Argentina and Brazil and the whole of Uruguay (28°–38°S 47°–67°W). Landscape heterogeneity in the region is reflected in subregions that are defined by vegetation communities associated with edapho-topographic characteristics (Soriano 1992, Hasenack et al 2010, Brazeiro et al 2012) (figure 1 and table 1). Climate conditions differ following southwest to northeast gradients in annual precipitation (from 700 to 1600 mm) and average annual temperature (from 14 °C to 22 °C). The climatic gradients determine the relative dominance of C3 and C4 grass species (Burkart 1975), giving rise to two major biomes: Pampas and Campos. While C3 species dominate in the Pampas of Argentina, the Campos of Brazil and Uruguay are dominated by C4 grasses, although in winter the biomass of C3 species increases substantially in the Uruguayan Campos (Berretta et al 2000).

Figure 1. Subregions of Río de la Plata grasslands (León et al 1984, Soriano 1992, Boldrini 1997, Brazeiro et al 2012). Local names of the subregions are in parentheses.

Table 1. Characteristics of the subregions of the Río de la Plata grasslands region ordered by biome and country.

Biome Country Area (million ha) 8 Subregion 9 , 10 Dominant soil types 9 Area with native grasslands (%) 11 Average farm size (ha) 11 Farms with cattle (%) 11 Ownership (%) 11 , 12 Main production system 13
Pampas (C3-dominated) Argentina 9.3 Flooding (1) Mollic Solonetz 68 605 93 70 Cattle (cow-calf)
8.3 Southern (2) Haplic/Luvic Phaeozems 29 697 81 66 Crops-cattle (finishing)
12.9 Subhumid (3) Phaeozems 17 526 72 66 Crops
1.5 Semiarid (4) Calcaric Phaeozems 7 824 89 70 Crops
7.4 Rolling (5) Phaeozems 33 222 48 59 Crops
3.2 Mesopotamic (6) Phaeozems/Eutric Vertisols 50 388 84 67 Crops-cattle (finishing)
Campos 7 (C4-dominated) Uruguay 2.8 West sediment (7) Eutric Vertisols/Phaeozems 54 357 78 59 Cattle (finishing), crops
3.2 Basalt (8) Lithic Leptosols/Phaeozems 89 728 87 67 Cattle (cow-calf), sheep
1.2 Gondwanic sediment (9) Haplic Luvisols/Phaeozems 79 362 65 66 Cattle (cow-calf)
2 Eastern sierras (10) Phaeozems 79 294 92 67 Cattle (cow-calf), sheep
2.2 Graven Merin (11) Mollic Planosols 71 393 74 63 Cattle (cow-calf), rice
1 Graven Santa Lucía (12) Phaeozems 41 57 74 64 Cattle (full cycle), dairy, horticulture
2.6 Crystalline shield (13) Phaeozems 69 395 93 59 Cattle (full cycle), dairy
Brazil 0.7 Litoral (14) Dystri-Ferralic Arenosols 64 126 68 81 Horticulture, rice, cattle (cow-calf)
3.9 SE Sierras (15) Alisols/Regosols/Lixisols 54 56 79 82 Cattle (cow-calf), sheep
2.9 Missons (16) Ferrasols/Leptosols/Arenosols 31 53 74 86 Crops
3.3 Central depression (17) Planosols/Alisols/Acrisols 44 53 71 84 Rice, cattle (finishing)
2.4 Campanha (18) Leptosols/Plinthosols/Phaeozems/Vertisols 70 204 82 83 Cattle (cow-calf)
Argentina 2.7 Corrientes (19) Ferrasols/Luvisols/Solonetzs 85 928 82 80 Cattle (cow-calf)

7 This biome can be divided into the northern Campos, in southern Brazil and northeast Argentina, and the southern Campos in Uruguay. 8 Calculated from (INDEC 2002, MGAP 2002, Berretta 2003, IBGE 2006, Viglizzo and Jobbágy 2010). 9 Defined after (Soriano 1992, Boldrini 1997, Berretta 2003, Viglizzo and Jobbágy 2010, Brazeiro et al 2012). 10 Numbers refer to the regions in figure 1. 11 Calculated from (INDEC 2002, MGAP 2002, IBGE 2006, Viglizzo and Jobbágy 2010). 12 Percentage of the area owned by farmers. The remainder is owned by corporations or rented. 13 Defined after INDEC (2002), MGAP (2002), Antuña et al (2010), Viglizzo and Jobbágy (2010), SAGyP and INTA (2013), and Boldrini (2007).

The native grasslands constitute the main source of feed for 43 million heads of cattle and sustain the livelihoods of 260 000 farm households (INDEC 2002, IBGE 2006, MGAP 2013b). Farm size in terms of land and cattle heads vary among regions (table 1). Most of the farms (81%) and 66% of the land are owned by families. The remainder is owned by corporations (INDEC 2002, IBGE 2006, MGAP 2013b).

Two types of beef production systems can be distinguished: reproduction oriented or 'cow-calf' systems and meat production or 'finishing' systems (Beauchemin et al 2010). Farms may be specialised in one of these types, but combinations of both on a single farm are also found ('full cycle' systems) (figure 2).

Figure 2. Main livestock production systems in the Río de la Plata grasslands region. Numbers indicate the weight of an animal when leaving a production system. Arrows indicate the flow of animals from one production system to the next. From left to right the importance of native grassland as source of animal feed declines and feed from exotic species increases. Based on Arzubi et al (2013), Modernel et al (2013), Becoña et al (2014).

Cow-calf farms specialise in animal reproduction and derive their main income from selling calves and culled cows. These farms typically also raise sheep for wool or meat production (Royo Pallarés et al 2005), giving rise to competition between sheep and cattle for the grassland feed resource.

Finishing systems mainly fatten male calves. Farms may specialise in 'backgrounding' (the phase from male calf to young steer) and/or 'fattening' (the phase from young steer to slaughter weight). In both systems animals may be fed on native grasslands, leys or grains (feedlots), defining different production systems with distinct shares of native grassland, crop-ley rotations (mixed crop-livestock systems) or continuous crop rotations.

Steer-to-cow ratios of less than 0.4 indicate specialisation in the cow-calf system, ratios of 0.4–1.2 indicate full cycle farm systems, and ratios greater than 1.2 specialisation in finishing beef cattle (Rossanigo et al 2012) (figure 2).


Salt marsh vegetation on the Croatian coast: plant communities and ecological characteristics

There is a lack of a comprehensive study of eastern Adriatic salt marsh vegetation with special attention to plant–soil relationships that determine individual plant assemblages. We surveyed 41 sites of salt marshes on the Croatian coastline in order to classify their vegetation by numerical methods and to compare the resulting groups in terms of soil chemical properties. A clear zonation between plant communities along the hydro-sequence was identified and was well represented by the dominance of individual diagnostic species. Two large vegetation groups were detected, well distinguished by mean species richness and soil properties. The first group, assigned into the classes Thero-Salicornietea and Sarcocornietea fruticosae, contains three subgroups of succulent, sparse stands of species-poor vegetation on the mudflat zone flooded by sea water, characterised by high salinity, electric conductivity, exchangeable Mg and K, and low nutrient content (total nitrogen, organic carbon) of the substrate. In the second group, tall rush communities (class Juncetea maritimi), three subordinate clusters, were identified, occurring in the upper, brackish zone with infrequent tides. Their soils had low salinity and electric conductivity and increased total nitrogen, organic carbon and exchangeable Mg and Ca. Vegetation within the second group occurring in the uppermost tidal zone had the highest species-richness, nutrient content in the soil and the lowest salinity. It has not been previously identified. Here, we described it as the new association Limonio narbonensisCaricetum divisae.

This is a preview of subscription content, access via your institution.


Results

Objective 1: research on ecosystem change at the circumpolar scale

Note: Percentages have been rounded and may not equal 100%. Ecological levels studied in more than 20% of the articles are in bold.

Objectives 2 and 3: case study of Inuit regions

Observed impacts of climate change on the SES

Fig. 2 . (A) Social-ecological network of the observed impacts of climate change in Inuit regions (n = 63 articles 43 with a primary focus on ecosystem processes and 20 on ES and human well-being). The size of the nodes is proportional to their weighted degree centrality. The thickness of the edges represents their weight and is proportional to the number of times an impact was reported. In (B), (C), and (D), we highlighted three nodes (sea ice decline, marine and coastal food, and basic material for good life) with particularly high degree centrality measures. The rest of the network is greyed out to highlight the direct connections of each of these three nodes. (B) Highlight on the node with the highest weighted degree centrality, sea ice decline (91 out-degree = 74), and its direct connections (C) highlight on the ES with the highest weighted in-degree, marine and coastal food (33), and its direct connections (D) highlight on the human well-being constituents with the highest weighted in-degree, basic material for good life (32), and its direct connections.

Projected impacts of climate change on the SES

Fig. 3 . Social-ecological network of the future impacts of climate change in Inuit regions (n = 28 articles 27 with a primary focus on ecosystem processes and 1 on ES and Inuit well-being). The size of the nodes is proportional to their weighted degree centrality. The thickness of the edges represents their weight and is proportional to the number of times an impact was reported.

Using attribute data for legend labeling in QGIS? - Geographic Information Systems

Over 8.6% of India’s population is comprised of tribal communities 1 . Constitutionally recognized for affirmative action as Scheduled Tribes (STs), they are mostly forest-dwelling and have lived in and around legally protected forest areas in south, central and north-east India 2 . Among various social categories in India, the ST population have poorer access to healthcare as well as poor national and state-level average population health outcomes 3 – 5 . These include higher maternal and infant mortality and morbidity due to communicable diseases, higher childhood malnutrition and higher rates of non-communicable diseases, which have increasingly been reported in recent years. Comparison of various demographic, health and nutrition indicators shows nearly uniformly poor health status among ST populations but with variations across Indian states (see Extended data for a table showing comparison) 6 , 7 . The social group Scheduled Tribe (ST) has its origins in state-specified lists for implementing affirmative action policies as per the Indian constitution. Several communities that may not have a close association with forests are also in this list. Hence, several tribal communities closely associated with forests prefer the term Adivasi . However, this is not true across the country and a common terminology covering India’s tribal communities is contested on linguistic, historical, ethnic and legal grounds 8 – 10 . Our hypotheses refer mostly to forest-associated tribal communities (hence Adivasis ), but the term ST has been preferred for the purposes of this paper due to its widespread usage in health literature. At the stage of dissemination in reports and peer-reviewed journals, we may reconsider this terminology based on our interactions, to reflect how the communities' preferred labels.

Poor tribal health status in India mirrors a global pattern of worse-off health status among indigenous populations. A comprehensive meta-analysis of health outcomes in 104 million global tribal populations found that health, education and development indicators of Indian tribal populations are consistently poorer across the country, despite overall improvements in population health across Indian states 6 , 11 . This reflects a complex interplay between the socio-political, economic, and cultural conditions that contribute to this situation 9 , 11 . There is a disparity in health outcomes of tribal communities compared to non-tribal populations, as well as disparities within and across tribal communities 6 . This is true even in otherwise better performing Indian states (in terms of health services performance and coverage) such as Tamil Nadu and Kerala 6 , 12 . Hence, research in tribal health needs to generate context-specific evidence that could be used to design and deliver locally targeted interventions to reduce inequities. Further, a better understanding of the heterogeneity of inequity patterns and of the processes driving these inequities must guide the framing of policies and programs related to tribal health at state and national levels. The aim of the project titled “Towards Health Equity and Transformative Action on tribal health” (THETA) is to generate context-specific local evidence to guide action, as well as generate wider theoretical explanations on drivers of inequities in tribal populations.

According to the World Health Organization, inequities are “unjust differences in health between persons of different social groups, (which) can be linked to forms of disadvantage such as poverty, discrimination and lack of access to services or goods” 13 , 14 . A related term, health inequality, needs to be distinguished from health inequity . Whereas inequalities in health are related to differences between population groups, arising from genetic, biological or other factors that may be randomly distributed, inequities, on the other hand, have a strong social causation and a non-random pattern of distribution they tend to aggregate in specific socially constructed groups due to underlying societal characteristics that mediate access to power and resources 14 , 15 . The conceptual underpinnings of inequities is in social justice and in line with this, health inequities are characterised by: (a) systematic and consistent patterns of advantage or disadvantage across specific population groups (pattern of consistent differences in access to health services between rural and urban populations) (b) social, rather than biological, processes (nearly global pattern of higher mortality among low-income groups, a pattern observed across countries and over time) and (c) originates from and sustained by unjust social arrangements, resulting in an unequal distribution of the resources essential to achieve or maintain good health 15 , 16 . Health inequity is a normative concept that does not lend itself to measurement. Hence, health inequities are assessed by monitoring health inequalities observable differences between subgroups within a population and identifying systematic patterns of these differences attributable to social phenomena 13 . Since social processes underlie these health differences, we can expect that these gaps can be closed or significantly narrowed through suitable social policies and programs.

Work leading up to this study

The THETA project builds upon two research projects implemented in three of the proposed study areas.

Poverty traps study : Velho et al. (2018) studied the relationship between forest dependence and socio-economic status of communities (including both ST and non-ST communities) living in and around forest areas (tiger reserves), namely Biligiriranga Swamy Temple (BRT), Kanha and Pakke, in Karnataka, Madhya Pradesh, and Arunachal Pradesh, respectively 17 . These states differ in terms of overall governance, health systems performance and implementation of forest rights for tribal populations. We found that within the same geographical area and at a finer scale than is available through nation-wide representative surveys, tribal and non-tribal communities differ in terms of access to and utilisation of healthcare. Besides, we were able to characterise inter-site (state) differences and hypothesise possible explanations for these differences.

Participation for local action (PLA) project : In 2013, as a part of the PLA project, the Institute of Public Health, Bengaluru (IPH) set up a research field station in BRT. The purpose was to explore possibility of embedded community-based research using health policy and systems research approaches to strengthen health systems in tribal populations. The PLA project used a participatory action research (PAR) approach 18 to identify barriers and strengthen implementation of maternal health programs for the Soliga tribal community in the district. It was conducted by an interdisciplinary team consisting of researchers, implementers, and members from the tribal community-based organisation of the Soliga people. The PLA project identified the need for a health navigator to facilitate care for tribal patients referred to higher centres. The project was funded by the WHO Alliance for Health Policy and Systems Research under their Implementation Research Platform and was piloted in BRT in association with the Zilla Budakattu Girijana Abhivruddhi Sangha (ZBGAS, the district’s indigenous people’s welfare association run by members of the Soliga people). Through multiple iterations of inquiry, we found the critical role played by social networks and various social determinants in determining whether tribal patients received timely and appropriate care 19 .

We aim to document inequality patterns in major tribal regions of India and generate, validate, and test theoretical explanations for how social disadvantage could be driving health inequity. We will combine epidemiological methods with multiple health policy and systems research (HPSR) approaches in line with the three study objectives (see Figure 1 ). The field of HPSR is a question-driven functional research tradition that leans on multiple social science disciplines researchers choose the method that is best suited for the purpose of achieving a socially relevant goal, which is set within a socially constructed health system setting, in the process acknowledging possibly diverse philosophical bases underlying the research methods 20 , 21 According to Sheikh et al. (2011), “the range of questions encompassed by HPSR is broad. there are different levels of analysis—macro-level analysis analyzes the architecture and oversight of systems, meso-level analysis focuses on the functioning of organizations and systemic interventions, and micro-level analysis considers the roles of individuals involved in activities of health provision, utilization, and governance, and how systems respectively shape and are shaped by their decisions and behaviour” 20 .

Figure 1. Objectives of THETA project in relation to methods. Research questions

For each objective, the hypotheses we examine are given below. Detailed methods, tools, and analysis for each objective follow in subsequent sections.

To describe and analyse the nature and extent of health inequalities among forest-dwelling tribal communities in three major tribal regions

a) Tribal communities have poor health and nutrition status indicators when compared to non-tribal people in the same area

b) Remoteness alone does not explain this difference in health and nutrition status indicators

To explain the underlying reasons for health inequity among tribal communities through a contextualized and empirically validated theory. Here, a theory is meant as an explanatory abstraction at the middle-level between micro-level working hypotheses and broad overarching social science theories 22 . Hypothesis building will be in the form of context-mechanism-outcome configurations that are derived from middle-range theories and working theories that we shall formulate based on empirically derived patterns and borrowing from wider social theories (see phase 2 under Methods).

To design and pilot an intervention to address health inequity in tribal communities. The design of the intervention shall be based on a (program) theory that draws from the refined theories from studying the processes for inequities, and hence the intervention is also an opportunity to validate/refine the program theories on processes driving inequities. This step shall follow a participatory action research approach with the ZBGAS, implementers (health managers and health workers from the district) and local non-governmental organisations, allowing for shared agenda-setting and co-production of the intervention along with communities and implementers. This objective builds upon previous experience of researcher-implementer-community engagement in the PLA project 19 .

For the three objectives, we shall use three distinct methodological approaches in line with best methodological approaches in relation to research question typology in health policy and systems research 20 . We use epidemiological methods for the descriptive and explanatory questions (objective 1), realist inquiry for the explanatory question (objective 2) and participatory action research and implementation research methods for objective 3. Overall, the study design is a multi-method interactive study in three phases, each sequentially mapping onto the three objectives. In phase 1, we will conduct cross-sectional survey (patterns), followed by realist inquiry in phase 2 (process), and participatory action research with health services and community partners (action) in phase 3. The detailed methods of the three phases are below. Although phase 3 was envisioned originally to follow phase 2, we propose to begin both phases simultaneously in order to ensure sufficient time for both their outcomes to be developed and disseminated before the end of the THETA project. We foresee a mutual learning between the people and the activities involved in the processes (phase 2) and the action (phase 3).

The three different methodological approaches we use have distinct ontological and epistemological foundations and shall hence be described in three different phases corresponding to the three objectives:

1. Phase 1: Cross-sectional survey for various tribal health indicators (Objective 1)

2. Phase 2: Realist evaluation (Objective 2)

3. Phase 3: Participatory action research with health services and community-based organisation (CBO) partners. (Objective 3)

Summary of phase 1 : We will conduct a household survey of tribal and non-tribal households in three areas with tribal populations Madhya Pradesh in central India (CI), Arunachal Pradesh in northeast India (NE) and Nilgiri forest area at the junction of three states in southern India ( Figure 2 ). We will select both tribal and non-tribal households in a representative manner using a geographical information system (GIS) based on a decreasing gradient of socio-geographical disadvantage index (SGDI), calculated using several village-level variables that combine social, environmental and geographic attributes. The survey questionnaire will include standardised and tested tools (see Extended data ) 23 to assess maternal and infant deaths (mortality), illness profile (morbidity), and diet and anthropometry (nutritional status indicators). We will also collect data on individual and household level variables for socio-demographic characteristics, access and utilisation of healthcare, healthcare expenditure and health-seeking behaviour.

Figure 2. Field areas for the surveys in three different tribal regions spanning five states.

Map tiles by Stamen Design , under CC BY 3.0 , Data from OpenStreetMap . OpenStreetMap is open data, licensed under the Open Data Commons Open Database License (ODbL) by the OpenStreetMap Foundation (OSMF). Modified using CartoDB software.

Study setting: We shall establish temporary field stations in central and northeast India, while in southern India, the field station at Biligiriranga hills (BR Hills) in Chamarajanagar district will oversee activities in the south Indian sites. In each of the five states, we will identify sites which correspond to protected area boundaries. We will choose seven sites: three sites in Karnataka, one site each in Tamil Nadu, Madhya Pradesh (Kanha Tiger Reserve), Kerala (Wayanad Tiger Reserve) and Arunachal Pradesh (Pakke Tiger Reserve).

The Chamarajanagar district of southern Karnataka lags behind most other districts in terms of development indicators. It also has a relatively large area classified as protected area under the Wildlife Protection Act 1972, including Bandipur and Biligiriranga Hills (BR Hills), both tiger reserves, and the Malai Mahadeshwara (MM Hills) wildlife sanctuary. Together, with contiguous forests areas in neighbouring states of Tamil Nadu and Kerala, these forests are part of the Nilgiri Biosphere Reserve with over 5000 sq. km of forests and at least 18 tribal communities. Table 1 is a list of tribal communities who will be included in the survey across all sites.

Sample size: A site is typically a single protected area (tiger reserve or wildlife sanctuary), which could span multiple administrative sub-divisions of a district or could be across few districts. For the purpose of sample size calculation, we considered prevalence of severe stunting among Adivasi children and anaemia among women in the 15� age group, as reported by the latest National Family Health Survey (NFHS) 24 , 25 . We considered ST status as being an attribute/explanator of poor health outcomes (an exposure). We used the standard sample size formula for cross-sectional/cohort study design on OpenEpi tool 26 . We calculated a sample size that will allow us to make assertions about a given health outcome of interest (say the proportion of stunted children or anaemic women) among tribal and non-tribal populations within and across sites. The difference in the prevalence of severe stunting between Adivasi tribal and non-tribal rural children was estimated to be 9% (29% among Adivasi vs 20% among rural children). From NFHS-4 data for India, we calculated overall samples size covering all sites for anaemia (rural women 54.2% vs ST women 59.9%, giving a risk ratio of 1.1). Between these two, we adopted the sample size from anaemia as our study sample size as it is the higher one. The higher risk ratio for severe stunting in the UNICEF report is more likely to be a realistic estimate of the difference between Adivasi and rural households than the NFHS, because Adivasi is not a separately defined group in NFHS-4 (ST includes both forest-dwelling and other communities). Hence, the sample size estimation from NFHS-4 is likely to be a higher one as we expect a higher risk ratio between the tribal and non-tribal population in our survey. Assuming a 95% confidence interval, 80% power and an alpha error of 0.05, an overall sample size of 2474 individual women is estimated. We added a 10% non-response rate to this and obtained a total sample size of 2722. We shall attain this sample size across seven sites, giving an average of 388 per site. The final dataset will include other individuals interviewed at the households (in addition to the primary respondent) and will hence be larger than the desired sample size (including the men who were present at the time of survey and children).

Sampling strategy: A multi-stage sampling strategy shall be used. At the first stage of sampling, a list of the tribal and non-tribal villages will be selected from 2011 Census enumeration areas in each site and mapped in a geographical information system (GIS) platform using QGIS software 27 . A vector layer of protected forest area boundary shall be imported into QGIS and an additional vector layer of buffer zones from the edge of the protected forest area will be created ( Figure 3 ). The buffer zone area shall vary from site to site depending on an assessment of forest dependency and perceived effects of nearby protected areas on livelihoods and other socio-economic characteristics. This shall be determined based on a discussion with local researchers and other stakeholders. We estimate a smaller buffer zone in southern Indian sites (between 5� km) and larger one in CI and NE sites (10� km).

Figure 3. Left panel shows three sites in Karnataka right panel is close-up of two southern Karnataka sites showing human settlements as red dots (both tribal and non-tribal) within green area (protected area of tiger reserves) and yellow area (buffer zone from the edge of the protected area that is included in our sampling) black lines show metaled roads.

Participant selection and recruitment: All villages and settlements within and outside the protected area boundary up to the designated buffer zone shall be selected. For all selected villages, we shall create an aggregate index score of socio-geographical disadvantage using a list of pre-identified variables. We will begin with correlated variables of geographical access. These include public transport travel time to nearest municipal administrative city, district administration, access to a high school/secondary school, primary health centre, tertiary hospital, walking travel time to all-weather motor road, sub-centre, population (government health facility below a primary health centre typically catering to a population of about 3000� people, usually across few villages), altitude, rainfall, forest thickness, proportion of houses with a supply of improved drinking water, proportion of houses with an electricity supply. We will identify strata (or groups) of settlements that have shared socio-geographical advantage/disadvantage parameters based on principal component analysis, striving for intra-strata homogeneity while ensuring inter-strata/group heterogeneity with respect to disadvantage. Such homogenous strata/group shall be the primary sampling units. We foresee 3𠄴 such strata per site, covering groups of villages in remote or core forest areas (typically exclusively tribal), villages at the edge/outside forest, but also groups of villages that may be inside the protected area but very well connected by all-weather roads (typically mixed tribal and non-tribal population), and groups of relatively well connected villages in the plains (tend to be typically non-tribal) (see Figure 4 ).

Figure 4. Choosing location of settlements based on a measure of socio-geographical disadvantage helps prevent us from treating settlements in Box B as if they are remote (well-connected despite being within tiger reserve).

In this figure, settlements marked in Box A and those in Box C are much more disadvantaged than those in Box B.

For each site, three strata corresponding to low, medium and high percentiles of an index of socio-geographic disadvantage (SGDI), determined based on their clustering together with respect to the index scores will be identified and one-third of the sample size for that site shall be allocated to each stratum. Then, we will list the villages/settlements in that stratum and randomly choose one-third of these as our secondary sampling unit. The number of households to be sampled within each selected villages/settlements will be calculated in proportion to the population size in that village/settlement.

In the case of tribal villages, the local tribal community-based organisation (such as ZBGAS in southern Karnataka) will be approached for household details in each settlement, whereas for census/revenue villages, the local gram panchayat (the lowest level of local government at the village level in the decentralized government structure established by law in many Indian states) will be approached for these details. Then, for each village, a sampling interval shall be calculated. Depending on the number of households to be recruited in a given village/settlement, a random starting point from the centre of the village shall be defined and then every n-th household (n being the sampling interval) will be approached. Wherever the approached household is unavailable/does not consent to participate, the next available and consenting household shall be chosen.

For each selected household, a trained team of two data collectors shall conduct the survey. Upon approaching the household, the team shall invite any household member who is able and willing to provide information and obtain an audio-recorded verbal consent. After obtaining consent, respondents from the household shall be identified for the questionnaires listed in Table 2 . Members who are ill or are unable to provide consent shall not be included in the study.

Procedures for data collection: We shall collect both primary and secondary data. Primary data shall include (1) survey responses using a questionnaire, (2) measurements for anthropometry and, (3) blood samples for clinical parameters. For phases 2 and 3, primary data shall include observational data, narratives captured during in-depth interviews, media files for case studies during the theory-driven inquiry, and intervention monitoring data under phase 3. Electronic tablets will be used for data collection. We will use a custom-made app using the Fulcrum cloud-based mobile data collection platform with offline data collection, multiple Indian language support and cloud-sync support to collect data during the survey. Alternately, Open Data Kit provides a suite of free and open-source tools that could be used to achieve comparable results. The survey questionnaire shall be administered using the mobile tablet-based app (see Figure 5 ).

Figure 5. Screenshot of the app showing automatic unique ID generation and language localisation.

Copyright 2019 Fulcrum Community / Spatial Networks Inc.

Data collectors will be recruited locally (from within/nearby districts) and trained in the use of the tablet and the mobile application as well as the administration of the survey questionnaire during a five-day workshop that will be separately held in each data collection site. Two rounds of piloting will be conducted the first one within the team and another in the field among non-sample households. Following the piloting, we anticipate that the questionnaires may need minor changes which will be incorporated.

Photographs of villages will be taken to document access, hygiene, living conditions, etc. without identifying information of houses/individuals. In the sampled households, data collectors will screen a video recording to the household members and then seek verbal consent. Verbal consent will be sought due to the low literacy rates in the study sites and our experience with people’s apparent comfort with discussing and clarifying consent orally rather than affixing their signatures onto previously printed text. Recorded verbal consent from the participant will be set as a prerequisite before administration of the questionnaires. The household questionnaire will begin with an initial module on socio-demographic household characteristics followed by other modules involving the following respondents: (a) youngest ever-married woman in reproductive age group (15�), (b) her partner/husband (failing that, another adult male from the household between 15�), (c) mothers of children below five and (d) all children for anthropometry. The questionnaire will have in-built skips, jumps and validity checks. At the end of the survey, the data collectors will seek consent for anthropometry tests from all members of the household who fulfil the inclusion criteria (assent from children in addition to mother/guardian consent). Permission for anthropometry shall be asked from children above 12 years of age. Anthropometry includes measuring the height (length for infants <1 year old), weight (measured using a standardised digital weighing scale), head circumference and mid-upper-arm circumference (for children <5 years), waist and hip circumference (using standardised measuring tape). Height will be measured using a stiff measuring tape.

Biological data: In two of the southern Karnataka sites, a trained health worker shall visit the households where the survey was conducted and invite an adult respondent of the household survey (identified randomly in advance using the Kish method 28 ) to participate in biological data collection. The reasons, procedure and benefits of blood tests will be explained to them and their consent re-established via a verbal consent process similar to the survey data obtained above. That the results of the tests will be made available to them will be explained clearly. Based on the information provided, they can choose to participate or refuse to do so in this component of the study. The health worker will measure the blood pressure of the participant, and ensure optimal general health status, enquire about any long-term medicine use and chronic disease. After explaining the procedure of drawing blood, under aseptic precautions, they will collect 5ml of blood in fasting state from superficial veins in the elbow of the participant. Of the total 5ml collected, 3ml of blood will be collected in plain vacutainer tubes and 2ml in ethylene diamine tetra-acetic acid (EDTA) tubes. The whole blood sample will be used to test for haemoglobin using a handheld point-of-care testing device (Hemocue) and for fasting blood glucose (FBS) using handheld glucometer (Accuchek). If FBS is >110mg, we will provide 75g sugar and test after two hours for post-prandial blood sugar (PPBS). For PPBS, we will obtain 2ml of blood and prepare 1 aliquot of serum after centrifugation. The team shall set up temporary work stations for field-level processing of samples prior to approaching settlements for data collection. Such sites need to have a safe space for processing as well as stable electric supply for the centrifuge. The results of Hb, FBS and PPBS will be delivered to the household.

The health worker will ensure clotting at the site of venipuncture before moving to the next household. All samples will be labelled with unique identification numbers at the site itself and transferred to vaccine boxes with ice packs for transport. Within one hour, the samples shall be centrifuged for separation of serum. Both EDTA samples and the serum shall be stored in vaccine carriers with ice packs and the serum sample will be transported to a deep freezer. An interim storage site (typically a primary health centre or government hospital) that has 24-hour deep freezer facility and is accessible within half hour (by vehicle) of the processing shall be identified in collaboration with the district health and family welfare department and government health services. Pooled samples will be sent to the laboratory for testing (transport time not exceeding one hour). A few (2𠄴) aliquots of plasma and serum shall be stored in a bio-repository for future analysis in 500μl cryovials. The reason for storage of biological material is to minimise potential discomfort and optimise research costs involved, in case there is a need to obtain biological data again for testing of other hypotheses in the future.

From serum : Lipid profile (total cholesterol, triglycerides, very low density lipoprotein, low density lipoprotein and high density lipoprotein cholesterol levels), FBS and PPBS will be analysed using a fully automated analyser using spectrophotometric principles.

From plasma : Genetic analysis will be conducted to assess mutation for sickle cell disease.

In all sites outside Karnataka, we will use non-invasive point-of-care testing devices for haemoglobin estimation.

Data collection tools: The data collection tool kit consists of three modules (see Extended data for the tools) 23 . The modules are adapted from widely used standardised household and woman’s survey questionnaires used in district level household survey and the NFHS 25 , 29 , Integrated Disease Surveillance Program (IDSP) non-communicable diseases risk factor survey questionnaire 30 based on the WHO-STEPS tool 31 .

The second and third phase will be conducted in Karnataka and Kerala sites with a more limited engagement in Arunachal Pradesh in northeast India. The focus will be on using realist inquiry 32 to build a plausible theory that explains tribal health inequality patterns. We shall begin with a set of hypotheses, drawing from the WHO’s social determinants of health framework, which includes various drivers of inequities in health, adapted to the south Indian regional context (see Figure 6 ). We will then identify theories from wider social science literature to explain overall tribal development in India and create a conceptual framework that integrates contextual information from study sites as well as theoretical insights 33 . We will then develop an initial middle range theory (MRT) from which sub-theories (program theories) and hypothetical frames in the form of context-mechanism-outcome statements (CMO configurations) can be formulated. We will purposefully select case studies that will use both quantitative data and qualitative data to develop, iteratively test and refine an explanatory theory in three to four cycles 33 – 36 . The preparation of initial MRTs as well as the CMO configurations will closely align with discussions within the research team across the three proposed sites, such that they will aid the design and implementation of the case studies.

Figure 6. On the left panel, the framework proposed by the WHO Commission on Social Determinants of Health adapted for our study (especially at the level of socioeconomic and political context and at the interface of structural and intermediary determinants illustration of possible hypotheses that we will test in objectives 1 & 2 drawing from the SDH framework.

Initial phase of objective 2 will add more hypotheses based on literature (initial MRT).

The case studies shall focus on testing/refining the initial MRTs through three to four iterations ( Figure 7 ). At this stage, based on ongoing literature synthesis and preliminary results from three of the phase 1 cross-sectional surveys, initial MRTs are likely to focus on mechanisms of inequity across multiple levels ranging from governance (macro), health services (meso), community processes (micro) and their interfaces.

Explaining the contribution of historical and social factors in determining current geographical remoteness of a village

Explaining poor healthcare experience in secondary/tertiary care for tribal communities

Explaining intra- and inter-tribal differences based on site-specific inequality patterns observed in the survey

Contrasting above MRTs and their results in an area with a tribal majority (Arunachal) to explain/test if remoteness affects tribal communities similarly there as well.

Site-based case studies: We will develop case studies that focus on one or more of the following: geography (village/settlement or entire site), phenomena/experience (healthcare seeking experience in secondary/tertiary centre), socio-political role, ethnicity (Adivasi group) or at the interfaces between the community and non-governmental organisations (NGOs) etc. in order to further refine/test the initial MRT. Qualitative data using in-depth narrative inquiry, field notes and observational data will be used to prepare case studies.

Refining the MRTs: While each case study will aim to deepen the testing of one or more CMOs, upon completion of each site-based case study, we will refine, strengthen or refute (elements of) the initial MRT.

For phase 2, the sampling strategy shall be purposive. Each case study shall try to achieve a diversity of participants in terms of age, sex, location and social roles played in that society. Sample sizes shall be typically 4� participants (for in-depth interviews) in each case study, depending on the nature of CMOs designed in the initial MRT. The number in each case study could vary depending on the refining process. After each round of data collection and preliminary analysis, the next round of participants shall be determined based on the type of inquiry to be initiated. Data collection for each case study will be considered complete either on achieving saturation in terms of themes/content or upon achieving sufficient refining of the CMO. We will use the critical comparison of cases to also test hypotheses drawn from the social determinants of health (SDH) framework earlier at various levels (a few are illustrated in right panel in Figure 6 ).

The third phase shall only be implemented in the Karnataka site. This phase will closely follow participatory approaches within the implementation research framework 37 . We shall conduct multiple rounds of consultations with the ZBGAS, local NGOs and implementers based on the results of the cross-sectional survey and realist evaluation. In this process, we shall identify willing partners (either NGOs and/or a district administration or state partner interested in enhancing equity of their tribal populations) for co-production of one or few interventions broadly aimed at enhancing health equity (equity-enhancing health system intervention EeHSI). An initial intervention design shall be offered for discussion among partners and will be adapted based on discussions with partners over a series of meetings/workshops. The MRT will allow the identification of entry points into addressing health inequities of the district’s tribal population. Some of the plausible entry-points are foreseen at three levels: health systems governance (agenda/priority setting at government/policy/institutional levels), health services (improving the interface between tribal communities and government health services), community (strengthening community-based platforms/structures to facilitate care or improve accountability of existing services). Current discussions with the ZBGAS based on initial exploratory workshops indicate the following possible directions for co-produced interventions.

Interventions focusing on health systems governance: Inter-sectoral action for health with gram panchayats covering tribal populations on implementing existing programs related to tobacco/alcohol use.

Interventions focusing on improving access and care with health workers and health services: Depending on insights from phase two, this could focus on interdisciplinary researcher-implementer-community platform for designing interventions for health problems specific to tribal populations and for health problems that are not yet being effectively responded to among tribal populations (cardiovascular disease care including stroke, chronic obstructive airway diseases, haemoglobinopathy, mental health including deaddiction for tobacco and alcohol). Interventions that strengthen care for non-communicable diseases at primary health centres catering to tribal populations or those that improve acceptability of care through improved cultural competence of health workers and hospital staff at distant/higher level hospitals where tribal patients are referred.

Community health and accountability: A package of interventions in partnership with the village health and sanitation committee and the ZBGAS to improve navigation across the health services and facilitation of benefits from various schemes designed for people below poverty line (such as the recently launched Ayushman Bharat scheme that seeks to provide free care at the point of service delivery but may not be easily accessible for marginalised populations such as tribal communities). Intervention could also focus on strengthening the leadership of the ZBGAS in engaging more effectively with the district and state governments to address specific health needs of tribal community.

The design and implementation of EeHSI through action-reflection cycles that are characteristic of participatory action research cycles (see Figure 8 ), will also serve the purpose of validating the MRT. The qualitative data collection that began for objective 2 will continue during this phase. This phase will end with formulating how the intervention worked, for whom and under what conditions 38 , 39 . Data collected in phase 3 shall be (a) notes of consultative meetings and workshops, (b) intervention implementation data, (c) in-depth interviews of people involved in the intervention and field notes from observations. Anonymised secondary data related to the intervention implementation will also be collected.

Figure 8. Action-reflection cycles in phase 3.

We hope that the health inequity patterns and other data revealed by our study may help characterise the population and establish a long-term cohort. We are unaware of well-designed long-term cohort studies among tribal populations and this will go a long way in understanding causality of poor outcomes in tribal populations over time. Along with this, we aim to initiate an embedded participatory research agenda involving community-based organisations, implementers and researchers in a collaborative platform to design and implement context-specific interventions to mitigate health inequities.

The analysis of the data shall be organized across multiple levels (individual, household, village, site/landscape).

At the village and landscape level, an index of socio-geographic remoteness for each village will be used to identify villages with similar socio-geographic characteristics using principal component analysis. Based on scores obtained from the principal component analysis, we will identify one or more indices that summarise different configurations of the input variables along multiple axes. This will allow comparison of villages in and around forests across and within sites (along these indices), as well as examine if and how village averages of health and nutrition outcomes vary along a gradient of socio-geographic remoteness. A geospatial analysis to examine correlates of geographical disadvantage in terms of poor health and nutrition outcomes, type of village (pre-dominantly tribal versus non-tribal) will be prepared for each site and patterns examined across sites to generate site-specific hypotheses to explain these patterns.

For the individual health parameters statistical analysis, we will track the following response variables: mortality rates (maternal and infant), nutritional status (body mass index, wasting and stunting), haemoglobin percentage, disease prevalence (hypertension). Intermediate outcomes such as access and utilisation of health services (outpatient care, inpatient care, maternal and child health), coverage rates (immunisation, select disease control programme indicators) and out-of-pocket expenditure will also be considered as response variables. We will model these response variables as a function of predictors such as protection regime (whether within protected area such as reserve forest/tiger reserve etc under the wildlife protection laws), generations/years since resettlement, distance to nearest road, nearest primary health centre, nearest town, proportion of forest cover. We will use Generalised Linear Mixed Models (GLMMs) to analyse data and apply an appropriate model-building approach to select a set of models that best explain health access and outcomes across tribal and non-tribal populations in relation to geographical access and/or social disadvantage.

The analysis approach has been explained earlier. Through iterative insights built from the case studies and the refining of the MRT, we will generate a refined MRT that explains the site-specific inequity patterns with analytical generalizability to similar tribal population contexts. The case study series coupled with quantitative data analysis from the cross-sectional survey will provide us with a systematically developed body of knowledge of the underlying causes of relative social disadvantage within and across tribal communities, as well as with nearby non-tribal communities.

Detailed documentation of the agenda-setting stage and the future iterations of participatory inquiry with the community-based organisation and other stakeholders shall be conducted. The intervention shall be monitored and a qualitative inquiry conducted to examine if, and how, the intervention addressed one or more drivers of inequities in this population. Multiple iterations of action-reflection cycle shall be attempted in line with the PAR approach.

Phase 1 has received ethics approval from the institutional ethics committee of Institute of Public Health, Bangalore (Study ID IEC-FR/03/2018 vide IEC letter number IPH/18-19/E/226 dated 5 th July 2018 valid till July 2019 renewed vide IEC IPH/19-20/E/183 valid till March 2020) . Relevant portions of phase 1 that relate to biomedical data collection have also received ethics clearance from Mysore Medical College and Research Institute (vide letter from ethics committee dated 2 August 2018). Ethics approval procedures are ongoing for phases 2 and 3. Extended data related to ancillary care, problems foreseen in the conduct of the study, data management and quality are available (see Extended data ) 40 . The ancillary care plan outlines course of action to be undertaken when particular health problems are either reported to or witnessed incidentally in/around households visited by the study team data collectors during the household survey or any other visit related to data collection in course of phases 2 and 3. Given that the project study sites are in remote locations with limited earlier efforts and experiences with conducting surveys, potential problems foreseen in the conduct of the study have been outlined. These pertain to the logistics of conducting study in sites that are very far from each other, preparations needed to obtain relevant permissions to enter protected forest areas and measures to be taken in case of high refusal rates at the sites.

The study involves close consultations, discussions and participation by several local (district and community-based) and state level actors in the southern Indian study site. Broadly, the dissemination will focus on public engagement and policy engagement.

Under our policy engagement strategy, we will organize multiple district-level meetings every year with implementers and community-based organisation representatives to share findings as and when they are available. This will include meetings to discuss survey findings (objective 1), to discuss the case study findings and challenge the middle range theories (objective 2), and finally co-create an intervention. State level engagement will include the preparation of a policy engagement plan that will begin with the formulation of a specific objective (based on the results) and strategies to achieve this policy change objective. We shall seek policymaker and implementer involvement early on in the project to avoid approaching them as passive consumers of research data, and rather invite them as active participants in the research (especially in objectives 2 and 3). We will work closely with tribal affairs, health, forests and women-child development departments at the state level.

Public engagement shall focus on making anonymised data and appropriate visualisations publicly available through open-data initiatives and platforms. We will also explore the possibility of involving local tribal youth in photo stories and facilitate opportunities for local folk art to engage with research findings related to health inequities through attempting to facilitate research themes to be integrated into local theatre/art forms.

The study phase 1 has now completed data collection in three sites and data collection is ongoing in two more sites phase 1 data collection is expected to be completed by December 2020. Ethics clearance for phases 2 and 3 are in final stages and data collection for phase 2 is expected to begin January 2020. Phase 3 is expected to begin in June 2020.

Current strategies for improving tribal health draw upon the experience of vibrant (but geographically limited) NGOs and civil society. There are inspiring examples of organisations that have done pioneering work both in service delivery as well as activism/advocacy in geographically remote areas and among socially vulnerable communities. While these are valuable and could inspire new thinking about the nature of engagement with communities, systematic and participatory research embedded within forested areas with tribal population is still limited to few locations and organisations. The THETA study aims to initiate a multi-institutional and multi-stakeholder tribal health research and action agenda in southern Karnataka.

Partly, poor health outcomes among Indian tribal communities can be attributed to poor availability and quality of information on access and utilisation of health services, illness profiles, and health-seeking behaviour 6 . However, the availability of information to plan and manage health services for designing contextually relevant public health interventions is lacking. Whether their poor health status is due to their remote location or if, and how, social disadvantage plays a role in this is less well understood. Wherever systematic and historical social disadvantages exist, they in turn create adverse societal conditions that prevent these populations or sub-groups from realising individual measures to overcome health or social inequalities. In this sense, the existence of any social disadvantage is an essential pre-requisite for inequity. Hence, it is important to understand the role of social disadvantage in driving poorer access, utilisation, and health outcomes among tribal communities in order to achieve equitable health.

In the current proposal, we seek to build upon these preliminary insights from the field from the literature on tribal health in India. Among the determinants of tribal health, environmental and social determinants are less well studied 14 . Further, there is limited research on “the pathways through which health inequities are created, and the political or policy environment that facilitates the processes” 41 . Similarly, research on interventions either in health systems or among communities that mitigates health inequities are scarce 5 , 6 , 41 . Among the social determinants of tribal health, geographical remoteness, proximity to forest areas, cultural distance from the “mainstream” population, historical isolation and social stratification have all been postulated to have a significant effect on their health outcomes. However, a global explanation, lumping together all these social determinants will not address the specific differences within and across tribal and non-tribal communities. Understanding the specific nature of these interactions within particular contexts helps implementers and planners in improving access and utilisation to health services and planning equitable interventions in tribal populations. Especially in tribal health, the social determinants related to land ownership and access to forest resources, roads, and other amenities also have implications for forest conservation and are expected to be outcomes of a negotiated dynamic between restrictive forest protection legislation on the one hand and enabling tribal development policies and initiatives on the other 2 , 9 . We seek to explore how tribal health is a negotiated outcome as a result of localised interaction between geographical and social factors. This includes examining local power dynamics within and across tribal and non-tribal communities and socio-political factors.

Data availability Underlying data

No underlying data are associated with this article.

Figshare: Comparison of demographic, health and nutrition indicators between scheduled tribe (ST) and non-ST population across six states in India complied from various sources. https://doi.org/10.6084/m9.figshare.10028804.v3 7

Figshare: THETA project: Ancillary care, problems foreseen and quality. https://doi.org/10.6084/m9.figshare.10292999.v1 40

Figshare: THETA tribal health survey questionnaire including list of modules in the survey tool, their sources and intended respondents. https://doi.org/10.6084/m9.figshare.10292963.v1 23

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).


Results

Spatial distribution of HIV

To determine the health status of a population, Demographic and Health Surveys (DHS) periodically organizes surveys to gather relevant data, focusing on specific countries. In our study we used the DHS data collected in the Ivory Coast during their 2012 campaign 3 . Based on the measurement, DHS provides estimates of HIV prevalence at sub-national level with a low spatial resolution, determined by 10 administrative regions (Fig. 1(a)). Estimates of the HIV prevalence rate range from 2.2 to 5.1% and reveal the spatial variability of the distribution of HIV infection across the country.

(a) HIV prevalence rate by administrative regions (b) HIV prevalence rate by departments for 15–49 year-olds population estimated values range between 0.6 and 5.7%. (We used open source QGIS software 59 to create maps from (a) DHS data 3 (b) UNAIDS estimates 25 ).

Due to initiatives to examine the spatial heterogeneity of HIV 24 , new methods emerged, aiming to provide HIV estimates at a finer resolution. An approach that employs kernel estimation based on spatial DHS measurements with an additional adjustment to UNAIDS data, made estimates for 50 departments of the Ivory Coast available 25 (see Methods). After redistributing disease frequencies across 50 departments, the HIV prevalence map (Fig. 1(b)) shows higher spatial variability in disease distribution from 0.6 to 5.7%. We can notice the hot spots of epidemics – departments severely hit by HIV. The map enables us to explore links between the connectivity and mobility patterns derived from D4D data and HIV prevalence with increased spatial resolution. Although the quality of HIV estimates (imposed by DHS measurement sampling) at department level varies from good and moderate to uncertain, the data has the highest spatial resolution currently available for studying the HIV epidemic in the Ivory Coast.

Along with HIV distribution map, we provide additional maps showing locations of the 10 largest cities of the Ivory Coast (Fig. 2(a)) and population density aggregated at the departments level (Fig. 2(b)), that are necessary for understanding of the results.

(a) 10 largest cities of the Ivory Coast (b) Population distribution across departments. (We used open source QGIS software 59 to create maps).

Communication and mobility patterns

Social interactions and mobility mediate the spread of infectious diseases 17,26,27 . When examined in a spatio-temporal context, they can uncover how a disease propagates and can explain the variability in the prevalence distribution. Epidemic patterns can be studied at different scales spanning from short range commuting flows to the long range intercontinental connections 28 . The level of detail in quantifying social interactions and mobilities can be chosen according to the scale of interest. While global epidemics patterns are mainly determined by the airline network, for country level epidemic we need finer resolution data sources. To better understand spatial epidemiology of HIV across 50 departments of the Ivory Coast, we analyzed the collective communication and mobility connections from mobile phone data. We estimated pairwise connections among departments by measuring communication and mobility flows. To accomplish that, we explored the “antenna-to-antenna data” (SET1) and the “long term individual trajectories” (SET3) in D4D dataset 23 .

SET1 provided us with insight into the communication flow between each pair of antennas on an hourly basis. The strength of the communication flow is expressed through the number of calls. We assigned each antenna to a corresponding department and then aggregated the number of calls at the department level during a 5-month observation period. SET3 shed light on the mobility of people, providing the geographic location of users while using their phone to make calls or send messages. Since records in SET3 contain the user ID, location at the sub-prefecture level and time stamps indicating when the phone was used, we were able to use them to estimate the location of the user’s home. Based on the most frequent location, we assigned each user to their home department. Then we counted the user’s movements from home to other locations over the entire 5-month observation period and aggregated users’ movements at the department level.

In the pairwise communication and mobility matrices, we identified strong ties for each department, which represent links to other departments with the connection strength significant at α = 0.01 (see Methods). Before searching for the strong ties, we normalized the matrices by the corresponding population sizes. SET1 encompasses 5 million of users. We distributed them into departments, using population frequencies provided by Afripop data 29 and used the per-department populations obtained to normalize the communication flows. To normalize the migration flows, we used estimates based on the derived home locations of the users to calculate the required population size per department. Each communication or mobility flow was normalized by the corresponding population size of originating department. The overall flow between two departments was then quantified as sum of normalized flows in both directions. This enabled us to eliminate the bias caused by the different population sizes when identifying the strong links.

The strong ties discovered in communication flows are shown in Fig. 3(a). This visualization emphasizes the strongest links further and communication hubs emerge. Remarkably, the hubs correspond to HIV hot spots and we can also notice that larger hubs have higher prevalence rates. The map at Fig. 2(a) helps us to reveal how identified hubs correlate with locations of urban centers. The largest hub corresponds to the department with two largest cities Abidjan and Abobo and it has degree of 46 significant links. The other highly connected hubs are located in the Southwest region in departments with IDs 38, 37, 42 (see Fig. S1) with degrees 22, 12 and 8 and all are severely hit by HIV. What is interesting is that departments 38 and 42 do not contain any of the top 10 cities. San Pedro, that is the fifth ranked is located in the department 37. There are also hubs around the cities Yamoussoukro and Bouake. We can notice that communication hubs usually correspond to the departments with large urban centers, but not necessarily, as we have also observed hubs without large cities.

Strong connectivity ties for (a) overall communication (b) night communication.

The hubs are labeled with the corresponding HIV prevalence rate shown in Fig. 1(b). Link thickness and color, ranging from yellow to red, are proportional to the strength of communication flow. (We used open source QGIS software 59 to create maps).

Additionally, we visualized the night communication, constrained to the time interval between 1 AM and 5 AM and obtained a similar structure of the connectivity graph - Fig. 3(b). The links of night communication are colored with the same palette as overall communication, but relatively to theirs maximum. Values of absolute flows are available in the legend. The largest hub is around Abidjan that has degree of 49. The size of hubs at the Southwest region additionally increased.

In both graphs we can notice how departments in the north part of the country have weaker links. The link’s strength, quantified as normalized communication flow between departments, includes both - residents of the departments and visitors. In this context, weaker links imply less social interactions or lower department’s attractiveness for visitors, or interplay of both. As social connections shape movement patterns and increase likelihood of contact between individuals 30 , presented graphs could help in understanding disease spatial distribution. Visually apparent sparser and weaker social connectivity in the north part of the country may have affected epidemic spread by making it harder for disease to propagate. This potentially explains smaller HIV prevalence in the north of the Ivory Coast.

The strong ties discovered in mobility flows (Fig. 4(a)) have an obvious localized character. They connect the departments that are geographically close, but, on a global scale, we can also observe strong migratory pathways. One connects the two largest hubs - the largest city Abidjan (5.1% prevalence rate) and the capital city Yamoussoukro (3.1% prevalence rate). From the center of country we can notice strong pathways to the region in the West (3.6% prevalence rate, Fig. 1(a)) and the North-central region (4.0% prevalence rate, Fig. 1(a)). The East-central region, with a prevalence rate of 4.0% is strongly connected to Abidjan. The map of the mobility flows revealed the pathways that connect regions with higher prevalence.

Strong mobility ties discovered through summarizing (a) all mobilities (b) mobilities with 3 days or longer stay at the destination.

The hubs are labeled with the corresponding HIV prevalence rates shown in Fig. 1(b). The link thickness and color, ranging from yellow to red, are proportional to the strength of mobility flow. (We used open source QGIS software 59 to create maps).

In addition to the observed general mobility of users, we explored the long-term mobility. We measured how long users stay at their destinations and in our migration analysis considered only those stays in which the users stayed longer than 3 days. The strong ties discovered in long-term mobility flows are shown in Fig. 4(b). The connectivity graph obtained, reveals how long-term migrations link departments. Abidjan emerged as the most prominent hub for those migrations, with the hub degree of 49. In this light, we can denote this city, with the largest prevalence rate and high connectivity, as a driver of epidemic in the Ivory Coast. As such, Abidjan needs careful monitoring of mobility flows, especially the high-risk longer-term mobilities, in order to prioritize interventions and control the further spread of HIV.

Extracted features

For each department of the Ivory Coast, numerous features were extracted during the course of the study presented, with the goal to quantify behavioral and mobility patterns potentially relevant to the measured HIV prevalence rate. Overall, we extracted 224 different features and grouped them into 4 categories: connectivity, spatial, migration and activity (phone use).

The connectivity features were obtained from the SET1. The communication flow is expressed through the number of calls and their duration in SET1. For each department we used the information on the originating and terminating antenna and aggregated its inner, originating, terminating and overall communication. The overall communication was further separated based on the type of day and time of day constraints. We considered two types of days: weekdays and weekends and used 1-hour time slots (00–01 h, 01–02 h, …, 23–24 h) and 8-hour time slots (00–08 h, 08–16 h, 16–24 h) to express the time within a day. For each of these discrete intervals, the features related to the number of calls represent the sum over the whole five-month observation period. Once extracted they were normalized by the corresponding department population size, estimated based on Afripop data 29 and rescaled to fit the 5 million of users monitored in our data set. Features related to the duration of calls represent average values. 120 connectivity features related to different time slots and type of days were extracted half to describe the number of calls and half to describe the average duration of calls.

Spatial, migration and activity features were derived from SET3. To craft spatial features we explored positions and the distribution of locations visited by users. We measured the radius of gyration, area and the perimeter of convex hull of users’ movements, as well as the diameter of their range 31,32,33 . The features were derived both for all locations visited by a user, as well as specific subsets of locations: visited at night, on weekdays, weekends, weekday and weekend nights. In addition, we calculated the total distance travelled by each user. In total, 25 spatial features were created, representing 95 percentile values across users matched to departments based on their home location. We first considered averaged instead of 95 percentile values for users in corresponding departments, but for predictive models better results are achieved when spatial features capture only the top five percent of users i.e. the patterns of users that cover larger regions through their mobility have higher predictive power on the prevalence of HIV.

To extract migration features we tracked the changes in locations. Every time a user changed department, we added a single migration link from his home to the observed department. We summarized all movements into a pairwise migration matrix by iterating this procedure for all users. Beside quantifying all movements, we also identified those where users were away from home for more than defined number of days (1, 2, …, 10) to explore longer-term migrations. The features were divided further according to the direction of the mobility into “in” or “out” migration, bringing total number to 22.

The activity features were extracted similarly to the connectivity features. However, in SET3, we cannot distinguish the direction of communication (in or out), nor do we have the duration of communication. Therefore, we refer to those features simply as activity since they can count only when and where users were active. As with the connectivity features we considered two types of days: weekdays and weekends. The time of day was again considered in 1-hour time slots, 8-hour time slots and whole days. The total number of activity features used was 57.

All the features capture the cumulative effect of human connectivity or mobility observed over a five-month period. We focused on this long-term perspective in our feature extraction, in order to understand the spatial distribution of HIV prevalence better.

Predictive models

HIV prevalence rates across the departments of the Ivory Coast range from 0.6 to 5.7%. Each of the 50 departments was represented with a vector of extracted features values and the corresponding prevalence rate. In this feature space, we built regression models and evaluated their performance when predicting a department’s prevalence rate. All features were normalized by dividing each feature with its mean value across the whole data set, before regression was attempted.

Experiments were conducted using two different regression methods: Ridge 34 and Support Vector Regression (SVR) 35 . The regression models were initially built using the four different groups of features separately. In order to select smaller subsets of most relevant features, both regression methods were subjected to recursive feature elimination RFE 36 method. In the final stage, we considered an ensemble approach – stacked regression 37 , through which we fused 4 heterogeneous feature sets, building a single integrated prediction model.

The prediction of disease levels needs careful evaluation 38 in order to avoid situations in which models built on randomly generated data work comparatively well to those created on possibly meaningful data. Therefore, to estimate the predictive capacity of a model, we measured the prediction errors and correlations between the predicted and actual values for the models built on real data and the same models created based on random data sets, obtained by randomly permuting values for each feature.

Experiments were divided into two parts: the first stage focused on the 15 departments with good and moderate estimates of HIV prevalence, while in the second we used data for all 50 departments. In Tables 1 and 2, we report the correlation coefficients (ρ) and relative root mean square errors (RRMSE) produced by the models during leave-one-out (LOO) cross-validation for two experimental setups (15 and 50 departments).

Leave-one-out (LOO) evaluation enabled us to select the best model among those we built. On the subsample of 15 departments, the models built with SVR, with Recursive Feature Elimination (RFE), perform best. In the best models RFE reduced the initial set of features at subset of 60, 6, 3, 4 for connectivity, spatial, migration and activity features, respectively. Selected features are highlighted in Tables S2 and S3 that includes all features and their descriptions. SVR models surpassed Ridge and reducing the size of the feature set with RFE improved performance of both, but the SVR method benefited more from the RFE procedure than Ridge. The highest correlation coefficient (0.753) between the predicted and actual values is achieved with the SVR on a reduced set of 6 most relevant spatial features. The lowest error of 0.287 is reached by combining regression learned on different sets of features. Through the linear combination of the four models, the ensemble approach predicts HIV prevalence values that are well correlated with actual (ρ = 0.710). All models built on the real features outperformed their random counterparts.

The second part of the experiments evaluated the proposed methods and extracted features on the full set of 50 departments, including those with uncertain estimates on HIV. Table 2 reports the obtained results. As expected, the performance declined. Predictions are moderately correlated with actual values. The best result ρ = 0.627, RRMS = 0.509 is achieved with the SVR model on a reduced subset of activity features. Ensemble approach that combines four SVR + RFE models results in ρ = 0.518 and RRMSE = 0.514. The models created on randomly permuted features predict HIV with higher errors and without correlation with actual values and underperform those built on real features.

Feature contribution

Once a regression model is built, we can use it to estimate the risk of disease in defined spatial units. Furthermore, we can examine what the model learned from the data. Model explanation techniques 39,40 can unveil black-box predictive models by estimating contributions of each feature over the whole range of its input values. For example, we can examine how changes in an activity feature affect the value of the HIV prevalence rate, obtained by the model built. The outcome is a plot of the contribution as a function of feature values. This model-explanation procedure provides us with the opportunity to identify specific features that impact prevalence rate most of all and to quantify their contribution. The features identified in this manner can later be continuously measured and leveraged for the monitoring of changes in the HIV prevalence rate and to create early warning signs for possible increase of the infected population.

To conduct the feature contribution analysis, we used the best model (SVR + RFE) built for each set of features, since the ensemble method is just an additive combination of models built on different sets. In the analysis we used models built on a subsample of departments (15 with good or moderate HIV estimation) and focused on the highly ranked features. By running the RFE procedure until only one features remains, we obtained ranks for all features and then selected top 3. For the selected features (ft,i, where t denotes set of features and i is index of feature in that set) we conducted contribution analysis. We calculated the contribution for each feature over the full range from its minimal to maximal value in m equally distributed points. The contribution analysis included the randomization process to create two instances as inputs to regression model. The first instance is a vector where each feature value is sampled at random from the data set t. The second instance differs in i th feature which is not random but takes a particular value from set of previously defined m values that are currently under contribution analysis. The contribution of the feature is the difference between the outputs of the regression model produced using the first and the second instance as input. Due to the randomization process this procedure is repeated for a defined number of iterations. By averaging the results from all iterations, we obtained the final value for contribution. In addition to this value, we also report the standard deviation of the values obtained in each iteration, which provides information on the contribution stability and quantify complex interactions among features. We created plots (Fig. 5) for 12 features - top 3 for each of four data sets, ordered from left to right according to RFE ranks, sampled in m = 12 points with contributions calculated through 100 iterations. In addition, the 12 graphs that correspond to features ranked from 4 th to 6 th place for each data set are provided in the supplement - Fig. S2. Models where RFE selected less than 6 features (migration, activity) were just extend for the purpose of visualisation. All graphs contain points of the mean contribution and error bars in the length of standard deviation. Red color indicates points with feature values that are associated with increased HIV prevalence and orange color indicates feature values that are associated with decreased HIV prevalence. The gray part of graph denotes the range where the standard distribution crosses zero, meaning that contribution is neither strongly positive nor negative.

Feature contribution graphs for 12 features top 3 features for 4 types of features.

Points correspond to the mean contribution and error bars correspond to standard deviation. Red color indicates strong association to higher HIV prevalence and orange to lower HIV prevalence.

Contributions of the three connectivity features are presented in Fig. 5(a–c). Top three features represent the communication flow expressed as the number of calls per resident of a department during the days of weekend in the time slots 01–02 AM, 02–03 AM and 03–04 AM, over a 5-month period. We can notice that the top connection features are related to weekend night-time communication and all have a positive slope. A similar graph (Fig. S2) is obtained for the 5 th ranked feature related to weekday 03–04 AM communication. According to the model, the departments with higher night-time communication have a higher prevalence rate. In further analysis of the contribution plot shown in Fig. 5(a), values higher than 0.2 can be seen as indicators of behavior increasing the risk of infection and thus critical for HIV. For example, for the department where this feature has the maximum value, the expectation of HIV prevalence is by 0.3 ± 0.15 higher than average. The plots for features ranked at 4 th and 6 th place (Fig. S2), refer to average call duration during the hours of early morning (06–07 AM) and contribute to HIV prevalence in a different way. The graphs with negative slope indicate that, for departments were people have longer talks early in the morning, we can expect lower HIV prevalence. We can observe this as a social signature 41 and may hypothesize that longer talks early in the morning could be an indicator of emotionally close relationships and lower-risk behavior.

In the contribution analysis of spatial features, area and gyration stand out as features with higher impact. Area is measured over weekdays and gyration over weekday and weekend nights. The model suggests that departments where people tend to cover a larger area, have a higher HIV prevalence rate (Fig. 5(f)). This is also confirmed by the 4 th ranked feature, which measures the area covered over weekends (Fig. S2). Gyration, a measure of standard deviation from the mean location, negatively impacts HIV (Fig. 5(d,e) and also Fig. S2). But it is no surprise that small gyration indicates higher HIV, since it has already been shown in other studies that there is a higher expectation of shorter movements in the denser urban areas 42 and those urban areas are usually more affected by HIV. When the area covered is tracked only during the hours of the night, the contribution graph has a negative slope as it does in the case of gyration (see graph for 5 th ranked feature - area covered during weekday nights, Fig. S2).

The contributions of overall in and out migration features are shown in Fig. 5(g,i). Both plots indicate that larger migration flows are associated to higher HIV prevalence. We can notice the strong impact of incoming migrations. For the department where this feature has the maximum value, the expectation of the HIV prevalence is by 1.0 ± 0.5 higher than the average. Among the top three features is the one that quantifies the number of outbound migrations per resident of a department, with the duration of staying for more than 10 days. Its contribution plot, presented in Fig. 5(h), shows negative impact. The plots for features ranked between 4 th and 6 th place (Fig. S2) further show that out migrations, with stays longer than one day have a positive slope and those with stays longer than 5 or 9 days exhibit a negative slope. The contribution analysis of the migration features uncovers an interesting phenomenon. The overall amount of migrations is linked to higher HIV prevalence and this positive slope remains true for migrations up to a few days, but beyond that, the slope becomes negative. The slope changes once the thresholds of 4 days for out migrations and 3 days for in migrations are reached. The model suggests that the risk comes from shorter stays at host departments and higher dynamics in migrations, while the longer stays are associated with lower HIV.

The contribution of the activity features, expressed through the number of calls and SMSs per residents of a department, are shown in Fig. 5(j–l). As with the connectivity features, night-time activity is strongly linked to HIV and higher activity implies higher prevalence rates. This is further confirmed by the 4 th - and 5 th -ranked features confirms that encompass activity during weekday nights, between 1 AM and 2 AM and weekend nights, between 4 AM and 5 AM. On the contrary, the feature ranked 6 th , which refers to early morning activity (07–08 AM) has a negative slope.

RFE method helps us to identify the subset of stronger factors that have highest impact on HIV prevalence prediction. Contribution analysis further uncovers what the trained models learned from the data and allows us to compare them, analyze features and to make decisions concerning final model. RFE ranking differs from the naïve approach that orders features based on their individual relevance (see Methods, Recursive feature elimination subsection for further explanations). We can observe from Fig. 5 that features impacts measured through contribution are not ordered exactly as with RFE due to the interactions between features. Selected subset work in synergy to provide the prediction of the HIV prevalence. If we use only SVR-RFE model learned on spatial features, that means that we have to measure 6 selected features, make predictions and further estimate corresponding contributions. In case that we want to rely on combination of models, then we need coefficients used in stacking regressors. Estimated values for combining SVR-RFE models learned on connectivity, spatial, migration and activity features are 0.24, 0.27, 0.24 and 0.22 respectively. Features selection and contribution analysis could also serve for a new iteration of feature engineering. For example, top 3 activity features include night hours intervals 00–01, 01–02 and 02–03 and having similar contribution graphs they can be grouped into one that covers 00–03h interval. Evaluation reveals that grouping produced model with similar performance, but with lower complexity. In this way, we can search for better models. The resulting contribution plots can be also used to create new hypotheses in epidemiology, when disease distribution and spread are concerned, and, subsequently, to quantify the risk of increase in the prevalence of HIV.


Using attribute data for legend labeling in QGIS? - Geographic Information Systems

You have requested a machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Neither BioOne nor the owners and publishers of the content make, and they explicitly disclaim, any express or implied representations or warranties of any kind, including, without limitation, representations and warranties as to the functionality of the translation feature or the accuracy or completeness of the translations.

Translations are not retained in our system. Your use of this feature and the translations is subject to all use restrictions contained in the Terms and Conditions of Use of the BioOne website.

A Taxonomic Update of Neotropical Pradosia (Sapotaceae, Chrysophylloideae)

Mário H. Terra-Araujo, 1,2,* Aparecida D. de Faria, 3 Ulf Swenson 2

1 1Instituto Nacional de Pesquisas da Amazônia, Programa de Pós-Graduação em Botânica (PPG-BOT), Av. A
2 2Department of Botany, Swedish Museum of Natural History, Box 50007, 104 05 Stockholm, Sweden.
3 3Universidade Estadual de Londrina, Departamento de Biologia Animal e Vegetal, Centro de Ciências Bi

* Author for correspondence ([email protected])

Includes PDF & HTML, when available

This article is only available to subscribers.
It is not available for individual sale.

We provide a systematic update of Pradosia (Sapotaceae, Chrysophylloideae), including overall morphology, a key to all species, comprehensive morphological descriptions, geographic distributions, and important characteristics for each species. Phylogenetic analyses based on molecular data demonstrated that the genus is monophyletic and includes three main clades. Twenty-three species of Pradosia are accepted, which are mostly distributed in lowland rainforests on either white-sand or clayish soils in tropical South America. A rotate corolla with a short tube, lack of staminodes, a drupaceous fruit with plano-convex cotyledons, an exserted radicle below the cotyledons, and the absence of endosperm are diagnostic for the genus. Two names are reduced into synonymy, viz. Pradosia atroviolacea Ducke, syn. of P. grisebachii (Pierre) T. D. Penn., and Pradosia verrucosa Ducke, syn. of P. glaziovii (Pierre) T. D. Penn. The affinity of P. argentea (Kunth) T. D. Penn., a species known only from the type collection, remains uncertain and for now excluded from the genus.