More

Add new field in PostgreSQL row while using ogr2ogr for shapefile to PostgreSQL


I am using the following ogr2ogr command for putting my shapefiles into PostgreSQL table. This command runs fine.

ogr2ogr.exe -append -lco GEOMETRY_NAME=geom -lco SCHEMA=public -f "PostgreSQL" PG:"host=localhost port=5433 user=postgres dbname=abc123 password=xxxxxx" myshapefile.shp -nln shapetable

Now I want is to add extra field/fields like (countryName=xyz / RoadName= street123/etc) while using the above command when exporting my shapefile to PostgreSQL table.

I was trying to find the answer in another link but it was for adding extra field while exporting GeoJSON into PostgreSQL table.


Read the ogr2ogr documentation http://www.gdal.org/ogr2ogr.html, especially section about -addfields

-addfields: (starting with GDAL 1.11) This is a specialized version of -append. Contrary to -append, -addfields has the effect of adding, to existing target layers, the new fields found in source layers. This option is useful when merging files that have non-strictly identical structures. This might not work for output formats that don't support adding fields to existing non-empty layers.

If this method does not suit you it is always possible to alter the schema with ogrinfo

ogrinfo PG:"dbname='test_db' host="localhost" port="5432" user="demouser" password="demopw"" -sql "ALTER TABLE test ADD COLUMN added_with_gdal TEXT"


5.3. Constraints

Data types are a way to limit the kind of data that can be stored in a table. For many applications, however, the constraint they provide is too coarse. For example, a column containing a product price should probably only accept positive values. But there is no standard data type that accepts only positive numbers. Another issue is that you might want to constrain column data with respect to other columns or rows. For example, in a table containing product information, there should be only one row for each product number.

To that end, SQL allows you to define constraints on columns and tables. Constraints give you as much control over the data in your tables as you wish. If a user attempts to store data in a column that would violate a constraint, an error is raised. This applies even if the value came from the default value definition.


13.3.2. Row-level Locks

In addition to table-level locks, there are row-level locks, which can be exclusive or shared locks. An exclusive row-level lock on a specific row is automatically acquired when the row is updated or deleted. The lock is held until the transaction commits or rolls back, just like table-level locks. Row-level locks do not affect data querying they block only writers to the same row .

To acquire an exclusive row-level lock on a row without actually modifying the row, select the row with SELECT FOR UPDATE. Note that once the row-level lock is acquired, the transaction can update the row multiple times without fear of conflicts.

To acquire a shared row-level lock on a row, select the row with SELECT FOR SHARE. A shared lock does not prevent other transactions from acquiring the same shared lock. However, no transaction is allowed to update, delete, or exclusively lock a row on which any other transaction holds a shared lock. Any attempt to do so will block until the shared lock(s) have been released.

PostgreSQL doesn't remember any information about modified rows in memory, so there is no limit on the number of rows locked at one time. However, locking a row might cause a disk write, e.g., SELECT FOR UPDATE modifies selected rows to mark them locked, and so will result in disk writes.

In addition to table and row locks, page-level share/exclusive locks are used to control read/write access to table pages in the shared buffer pool. These locks are released immediately after a row is fetched or updated. Application developers normally need not be concerned with page-level locks, but they are mentioned here for completeness.


Our Blog

PostgreSQL is a relational database with many great features. There are also many so called NoSQL databases, some of them, like CouchDB, are document databases. However the document in CouchDB is automatically enhanced with a “_id” field, if it is not present. When you want to get this one document, you can use this “_id” field—​it behaves exactly like the primary key from relational databases. PostgreSQL stores data in tables’ rows while CouchDB stores data as JSON documents. On one hand CouchDB seems like a great solution, as you can have all the different data from different PostgreSQL tables in just one JSON document. This flexibility comes with a cost of no constraints on the data structure, which can be really appealing at the first moment and really frustrating when you have a huge database and some of the documents contain bad values or there are missing some fields.

PostgreSQL 9.3 comes with great features which can turn it into a NoSQL database, with full transaction support, storing JSON documents with constraints on the fields data.

Simple Example

I will show how to do it using a very simple example of a table with products. Each product has a name, description, some id number, price, currency and number of products we have in stock.

PostgreSQL Version

The simple table in PostgreSQL can look like:

This table allows us to insert products like:

Unfortunately the above table also allows for adding rows missing some important information:

This should be fixed by adding constraints in the database. Assume that we want to always have unique not empty name, not empty description, non negative price and in_stock, and the currency should always be dollars. The table with such constraints is:

Now all the operations, like adding or modifying a row, which violate any of those constraints, just fail. Let’s check:

NoSQL Version

In CouchDB the inserted row in the above table, would be just a JSON looking like this:

The Trivial Solution

In PostgreSQL we can store this JSON as a row in the products table:

This works like most of the NoSQL datatabases, no checks, no errors with bad fields. As a result, you can modify the data the way you want, the problem begins when your application expects that the price is a number, and you get a string there, or there is no price at all.

Validate JSON

CouchDB validates JSON before saving the document into database. In PostgreSQL 9.2 there is the nice type for that, it is named JSON. The JSON type can store only a proper JSON, there is validation performed before converting into this type.

Let’s change the definition of the table to:

We can insert correct JSON into this table:

This works, but inserting not a valid JSON ends with an error:

The problem with formatting can be hard to notice (I’ve added comma after the last field, JSON doesn’t like it).

Validating Fields

OK, so we have a solution which looks almost like the first native PostgreSQL solution: we have data which validates. It doesn’t mean the data is sensible.

Let’s add checks for validating the data.

In PostgreSQL 9.3, which has not been released yet, there are some new great features for manipulating JSON values. There are defined operators for the JSON type, which give you easy access to the fields and values.

I will use only one operator “->>”, but you can find more information in PostgreSQL documentation.

I also need to validate the types of the fields, including id field. This is something Postgres just checks because of the types definitions. I am going to use some other syntax for the checks, as I want to name it. It will be easier to look at problem with specific field instead of searching through the whole huge JSON.

The table with the constraints looks like this:

The “->>” operator allows me to get the value of a specific field from JSON, check if it exists and validate it.

Let’s add a JSON without a description:

There is one more validation left. The id and name fields should be unique. This can be easily done with two indexes:

Now when you try to add a JSON document which id which already exists in database, then you will have an error like:

Id Generation

In NoSQL databases the id field is usually some UUID. This is an identifier generated with algorithms with a very small chance of generating the same value, even when you generate them on different machines. So I’m not going to touch it here.

Searching

You can search the JSON data normally like you were searching columns in a table. Let’s search for the most expensive product we have in stock:

The JSON version is very similar:

This query can be very inefficient. It needs to read all the rows, parse JSON fields and check the in_stock and price fields, convert into proper types and then sort. The plan of such a query, after filling the table with 100k rows, looks like this:

The “Seq Scan” line means that PostgreSQL needs to read the whole table. The time of 412 ms is not that bad, but can we make it better?

Fortunately PostgreSQL has a great feature: indexes on expressions, also named as functional indexes. It can store in the index sorted values of some expressions, and if the same expressions occur in a query, then the index can be used.

Notice the double parenthesis, they are required because of the non trivial expression.

The plan now looks a little bit different, after creating indexes and running analyze on the products table:

So it is 664k percent faster.

The JSON Advantage

The JSON solution has got one nice feature which the native PostgreSQL hasn’t. The application can add its own fields on the fly without altering any table. JSON field is just a text, however with some validation. The new field won’t be checked by the indexes and constraints I’ve shown you above.

What’s more, you can add a constraint for this field later. This way you can have the best from both worlds: easy data model changing and consistent JSON structure across the database.

On the other hand you could of course add a trigger checking the JSON, before saving it to database, to check the list of available fields. This way you could prevent adding new fields by the application.

Summary

So, I’ve shown you how you can use PostgreSQL as a simple NoSQL database storing JSON blobs of text. The great advantage over the simple NoSQL databases storing blobs is that you can constrain the blobs, so they are always correct and you shouldn’t have any problems with parsing and getting them from the database.

You can also query the database very easily, with huge speed. The ad-hoc queries are really simple, much simpler than the map-reduce queries which are needed in many NoSQL databases.


Memory Contention

With multiple, separate, heavy-hitting, DML applications hitting the same PostgreSQL instance, we start to see inherent problems when unrelated SQL workloads (separate databases, schemas, and tables) compete for the same memory resources. Normally with a single application, one thing good about it is that a lot of the disk to memory activity is with the same heavily used tables. So you usually get your 95% to 99% cache hit ratio. But when multiple, separate SQL Workloads are at work within a single PostgreSQL instance with their own set of tables, you may begin to see contention for memory-resident pages between the separate SQL workloads. In that case you have to make sure you have enough OS memory and shared_buffers memory to handle the surge when multiple SQL workloads compete for the same paging resources. Anticipating different degrees of load activity at any one point in time between the competing SQL workloads makes tuning shared buffers much harder and perhaps impossible to tune for both at the same time without adding significantly more reserved memory to shared_buffers even though it may not need it all most of the time.


Aggregates — Why are min(), max(), and count() so slow?

A common complaint against PostgreSQL is the speed of its aggregates. People often ask why count(*) or min/max are slower than on some other database. There are actually two problems here, one that's easy to fix and one that isn't so easy.

The ORDER BY / LIMIT Hack

Prior to version 8.1, the query planner didn't know that you could use an index to handle min or max, so it would always table-scan. Fortunately, you can work around this by doing

Of course that's a bit of a pain, so in 8.1 the planner was changed so that it will make that substitution on the fly. Unfortunately, it's not perfect while writing this article I discovered that SELECT max() on a field with a lot of NULL values will take a long time, even if it's using an index on that field. If you try the ORDER BY / LIMIT hack, it is equally slow. I suspect this is because the database has to scan past all the NULL values. In fact, if you create an index on the field and exclude NULL values from that index, the ORDER BY / LIMIT hack will use that index and return very quickly. But a simple max() on that field will continue using the index with NULLs in it.

COUNT(*)

The second problem isn't easy to solve. If you've read my past articles you'll recall that PostgreSQL's MVCC (Multi-Version Concurrency Control) does away with the need for expensive read locks by keeping multiple versions of table rows that have been updated, and not immediately removing deleted rows. This is done by storing 'visibility information' in each row. But for performance reasons, this information is not stored in indexes. This means that every time a row is read from an index, the engine has to also read the actual row in the table to ensure that the row hasn't been deleted.

On the other hand, many other databases do not have this requirement if a row is in the index then it's a valid row in the table. This allows those databases to do what's known as 'index covering'. Simply put, if all the information a query needs is in an index, the database can get away with reading just the index and not reading the base table at all, providing much higher performance.

If you are using count(*), the database is free to use any column to count, which means it can pick the smallest covering index to scan (note that this is why count(*) is much better than count(some_field), as long as you don't care if null values of some_field are counted). Since indexes often fit entirely in memory, this means count(*) is often very fast.

But as I mentioned, PostgreSQL must read the base table any time it reads from an index. This means that, no matter what, SELECT count(*) FROM table must read the entire table. Fortunately, there are plans in the works for 8.2 that will allow partial index covering. In a nutshell, the database will keep track of table pages that are known not to contain any deleted rows. With that information available, the engine will be able to tell very quickly if it needs to look at the base table for any given row that it reads out of an index. This means that tables that don't see a lot of updates or deletes will see index scan performance that is close to what you would get on databases that can do true index covering.

Aside from that nice performance improvement for 8.2, there are still ways you might be able to improve your performance if you're currently using count(*). The key is to consider why you are using count(*) in the first place. Do you really need an exact count? In many cases, you don't. count(*) is arguably one of the most abused database functions there is. I've seen it used in many cases where there was no need. Perhaps the worst is as a means to see if a particular row exists, IE:

There's no reason you need an exact count here. Instead, try

Or, if you're using an external language (though if you're doing this in an external language you should also be asking yourself if you should instead write a stored procedure. ):

Note that in this example you'll either get one row back or no rows back.

Maybe you're working on something where you actually need a count of some kind.

In that case, consider using an estimate. Google is a perfect example of this.

Ever noticed how when you search for something the results page shows that you're viewing "results 1-10 of about 728,000"? This is because there's no reason to provide an exact number.

How do you obtain estimates for count(*)? If you just want to know the approximate number of rows in a table you can simply select out of pg_class:

The number returned is an estimate of the number of tables in the table at the time of the last ANALYZE.

If you want an estimate of the number of rows that will be returned from an arbitrary query you unfortunately need to parse the output of explain.

Finally, if you have to have an exact count and performance is an issue you can build a summary table that contains the number of rows in a table. There are two ways to do this. The simplest is to create a trigger or rule that will update the summary table every time rows are inserted or deleted: http://www.varlena.com/varlena/GeneralBits/49.php is an example of how to do that. The downside to this approach is that it forces all inserts and deletes on a table you're keeping a count on to serialize. This is because only one transaction can update the appropriate row in the rowcount table at a time.

A variant of this that removes the serialization is to keep a 'running tally' of rows inserted or deleted from the table. Because that running tally only needs to insert into the tally table multiple transactions can update the table you're keeping a count on at the same time. The downside is that you must periodically clear the tally table out. A summary if this technique can be found at http://archives.postgresql.org/pgsql-performance/2004-01/msg00059.php.

Of course, neither of these tricks helps you if you need a count of something other than an entire table, but depending on your requirements you can alter either technique to add constraints on what conditions you count on.


PostgreSQL does not have time-based triggers.

If you want to store as much of your logic as possible in the database, you could create an ordinary function (not a trigger function) and then invoke it from your favorite job scheduler (cron for me). If you want to use the database for storing data, not logic, then you could put the queries you want to run in a .sql file and use your favorite job schedule to invoke that file using psql -f for example. Or, if you want all of your application logic in, for example, a Perl module, you could write a function in your Perl module which connects to the database (reusing whatever connection logic your module already has) and does what you want, then use your favorite job scheduler to invoke that function.

There are many ways to do it, the key is to pick a method which fits in with the your overall design.

It would probably be possible to put the logic into a trigger on a dummy table, then have your favorite job scheduler do a dummy operation on the dummy table to get the trigger to fire. Pointless, but possible.

What sort of thing would you be expecting this "trigger" to do?

Please tell me you're not storing people's Age and updating them all every night! :-)

There are far more scheduling options outside of the database so that might be a better place to initiate this process, however you wind up writing it.


Connecting remotely

To connect to your Postgres database, you need to change a configuration file and open a port in the firewall on Google Cloud.

Configure PostgreSQL remote access

By default, Postgres doesn't allow remote connections. To change this setting, you can change the file named pg_hba.conf .

Caution: On production systems, or any system that has an internet connection, use strong authentication methods and restrict traffic to only those users and IP addresses that you want to connect to each database.

Edit pg_hba.conf

In the SSH terminal window, edit pg_hba.conf . This tutorial uses the nano editor, but you can substitute your favorite editor. For PostgreSQL version 9.3, you can enter:

Navigate to ip4.me to get the IPv4 address of your local computer.

You need this IP address in an upcoming steps.

Scroll down to the bottom of the file and add the following lines:

Replace [YOUR_IPV4_ADDRESS] with the address of your local computer. Note that the CIDR suffix /32 is used for a single address, which is what you're providing in this tutorial.

Save the file and exit the editor. In nano, press Control+x and then use the Return key to accept the prompts to save the file. Note that nano might not clear the console screen properly, so if you have trouble reading the text in the console after closing nano, enter clear to clear the screen.

Edit postgresql.conf

For example, enter the following command:

Scroll down to the line that begins with #listen_addresses = 'localhost' .

Delete the # character to uncomment the line.

The '*' setting enables Postgres to listen on all IP addresses. This is a commonly used setting. When you set the IP address in hba.conf in the previous step, you restricted access to the database to only your computer.

Save the file and exit the editor.

Restart the database service. In the SSH terminal, enter:

Open the network port

PostgreSQL accepts remote connections on port 5432. Follow these steps to add a firewall rule that enables traffic on this port.

In the Name field, enter postgres-tutorial .

In the Network field, leave the network as default.

In the Direction of traffic field, select Ingress.

In the Action on match field, select Allow.

In the Target tags field, enter the network tag ( postgres-tutorial ) that you used for the instance.

In Source IP Ranges, enter the same IP address that you used in hba.conf .

This is the IP address of your local computer. Remember to include the /32 suffix, for example: 1.2.3.4/32 .

In Specified protocols and ports, check tcp, and enter 5432 for the value.

Click Create.

Firewall rules are a global resource, so you'll only need to create this rule once for all instances.

Connect using pgAdmin

Now you can connect to your PostgreSQL database from your computer. This tutorial uses pgAdmin, which is a popular client application for working with Postgres databases.

Install pgAdmin on your local computer.

(macOS only) Move pgAdmin to a location from which you can run it:

  1. Right-click the pgAdmin icon and copy it.
  2. Open the macOS Application folder, and paste pgAdmin into this folder.

Start pgAdmin by clicking its icon in the Application folder.

Add the server. In pgAdmin4, you can click the first icon on the left side of the toolbar. Alternatively, click File > Add server.

In the New Server Registration window, in the Name field, enter the following:

On the VM instances page, find the external IP address of your Compute Engine instance in the External IP column.

In pgAdmin, in the Connection tab, in the Hostname/address field, enter the external IP address of your Compute Engine instance.

Note: Enter only the address as it appears in the Cloud Console don't add any protocol specifiers, such as http:// or other characters.

In the Port field, enter 5432 .

In the Password field, enter the password that you set previously for the user named postgres .

Click Save to close the window.

You should now be connected to your PostgreSQL database that is hosted on your Compute Engine instance. You can use pgAdmin to browse and modify the database and other settings. PgAdmin also includes a PSQL console that you can use to administer the database remotely.


Usage

data source name (interpretation varies by driver — for some drivers, dsn is a file name, but may also be a folder)

layer name (varies by driver, may be a file name without extension). From rgdal 1.2.*, layer may be missing, in which case ogrListLayers examines the dsn, and fails if there are no layers, silently reads the only layer if only one layer is found, and reads the first layer if multiple layers are present, issuing a warning that layer should be given explicitly.

PROJ4 string defining CRS, if default NULL, the value is read from the OGR data set

logical: should character vectors be converted to factors? Default NA, which uses the deprecated default.stringsAsFactors() in R < 4.1.0 (see link[base] ). Before R 4, strings were converted to factors by default, as argument value TRUE. See https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/index.html for details of changes.

default FALSE, if TRUE skip fields other than String, Integer, and Real Date, Time and DateTime are converted to String

default FALSE, if TRUE, discard third coordinates for point geometries third coordinates are alway discarded for line and polygon geometries

default TRUE, drop both declared NULL geometries, and empty geometries with no coordinates if FALSE, return a data frame with the attribute values of the NULL and empty geometries. From 1.3-6, setting FALSE also works when there are no geometries at all, returning a data.frame including all FIDs

default TRUE, if FALSE use original interpreted code in a loop

default FALSE, if TRUE, and FID values are not unique, they will be set to unique values 1:N for N features problem observed in GML files

default TRUE, may be set FALSE for legacy behaviour used to indicate which interior rings are holes in which exterior rings in conformance with OGC SFS specifications

default NULL, if set to a character string, and the driver is “ESRI Shapefile”, and use_iconv is FALSE, it is passed to the CPL Option “SHAPE_ENCODING” immediately before reading the DBF of a shapefile. If use_iconv is TRUE, and encoding is not NULL, it will be used to convert input strings from the given value to the native encoding for the system/platform.

default FALSE if TRUE and encoding is not NULL, it will be used to convert input strings from the given value to the native encoding for the system/platform.

default FALSE, if TRUE, treat y coordinate as Easting, x as Northing, that is the opposite to the assumed order this may be needed if some OGR read drivers do not behave as expected

, default NULL, if one of: c("wkbPoint", "wkbLineString", "wkbPolygon") , then in input with multiple geometry types, the chosen type will be read

default “no.loss” (from rgdal 1.2.*). From GDAL 2, fields to be read may also take Integer64 values. As R has no such storage mode, three options are offered, analogous with type.convert for numeric conversion: “allow.loss” which clamps to 32-bit signed integer (default < rgdal 1.2), “warn.loss” - as “allow.loss” but warns when clamping occurs, and “no.loss”, which reads as a character string using the formatting applied by default by GDAL (default >= rgdal 1.2). The use of 64-bit integers is usually a misunderstanding, as such data is almost always a long key ID.

default FALSE, if TRUE, Integer64 fields are read as doubles

default NULL, morph from ESRI WKT1 dialect

dump SRS to stdout from inside GDAL to debug conversion - developer use only

(PROJ6+/GDAL3+) either use global setting (default NULL) or override policy for coordinate ordering easting/x as first axis, northing/y as second axis.

default NULL, driver found using ogrListLayers from the data source otherwise already known and passed through from a calling function

other arguments to print method


Managing and Maintaining indexes

Indexes in Postgres do not hold all row data. Even when an index is used in a query and matching rows where found, Postgres will go to disk to fetch the row data. Additionally, row visibility information (discussed in the MVCC article) is not stored on the index either, therefore Postgres must also go to disk to fetch that information.

Having that in mind, you can see how in some cases using an index doesn’t really make sense. An index must be selective enough to reduce the number of disk lookups for it to be worth it. For example, a primary key lookup with a big enough table makes good use of an index: instead of sequentially scanning the table matching the query conditions, Postgres is able to find the targeted rows in an index, and then fetch them from disk selectively. For very small tables, for example a cities lookup table, an index may be undesirable, even if you search by city name. In that case, Postgres may decide to ignore the index in favor of a sequential scan. Postgres will decide to perform a sequential scan on any query that will hit a significant portion of a table. If you do have an index on that column, it will be a dead index that’s never used - and indexes are not free: they come at a cost in terms of storage and maintenance.

For more on running production, staging, and other environments for your Heroku application, take a look at our Managing Multiple Environments article.

When tuning a query and understanding what indexes make the most sense, be sure to use a database as similar as possible to what exists, or will exist in production. Whether an index is used or not depends on a number of factors, including the Postgres server configuration, the data in the table, the index and the query. For instance, trying to make a query use an index on your development machine with a small subset of “test data” will be frustrating: Postgres will determine that the dataset is so small that it’s not worth the overhead of reading through the index and then fetching the data from disk. Random I/O is much slower than sequential, so the cost of a sequential scan is lower than that of the random I/O introduced by reading the index and selectively finding the data on disk. Performing index tuning should be done on production, or on a staging environment that is as close to production as possible. On the Heroku Postgres database platform it is possible to copy your production database to a different environment quite easily.

When you are ready to apply an index on your production database, keep in mind that creating an index locks the table against writes. For big tables that can mean your site is down for hours. Fortunately Postgres allows you to CREATE INDEX CONCURRENTLY , which will take much longer to build, but does not require a lock that blocks writes. Ordinary CREATE INDEX commands require a lock that blocks writes but not reads.

Finally, indexes will become fragmented and unoptimized after some time, especially if the rows in the table are often updated or deleted. In those cases it may be required to perform a REINDEX leaving you with a balanced and optimized index. However be cautious about reindexing big indexes as write locks are obtained on the parent table. One strategy to achieve the same result on a live site is to build an index concurrently on the same table and columns but with a different name, and then dropping the original index and renaming the new one. This procedure, while much longer, won’t require any long running locks on the live tables.

Postgres provides a lot of flexibility when it comes to creating B-tree indexes that are optimized to your specific use cases, as well as options for managing the ever-growing database behind your applications. These tips should help you keep your database healthy, and your queries snappy.