More

Can you force cartoDB to read a file as a synced table


I'm writing a script that downloads from a data source every hour. I then do some processing on it and convert it to a kml saved in my google drive. As the data is updating every hour I want to be able to sync it with cartodb. However, the sync options are all greyed out. Do I need to upgrade to have access to synced table functions or is there a way to force cartoDB to recognise it as a synced table?


Sync capabilities are only available for users with plans from John Snow up. If you have the feature, syncing a table can be done from the ui or from our Import API.

If you don't have the feature, you can use the SQL API instead. This means that you will need to take care of the data and retrieve it from the KML file by yourself so that you can include it in INSERT or UPDATE statements.

The complete documentation of the SQL API is available here. If you're planning to create tables directly from this SQL API, please check this guide as it requires to do some magic (like CartoDBfying the tables).


Can a node broadcast a valid signed raw transaction while still syncing?

Built and signed a native bech32 UTXO, but when I attempt to broadcast from my testnet node using the command sendrawtransaction it returns "missing inputs". I don't have the private keys in my wallet, and the testnet node is still syncing, but I was able to broadcast it fine with a testnet API service that lets you broadcast signed transactions.

Is it possible to broadcast a raw signed transaction to the network while still syncing?


1 Answer 1

You need to read up on how backup and recovery works. WAL doesn't let you go backwards in time, only forwards.

When restoring you would normally restore your data files from your backup (eg one taken with pg_basebackup ). That will get you back to the moment the backup was started. If you want to go further in time, you need to supply the WAL files for changes made since the backup. In your recovery_conf you can specify to restore to the time just before the table was dropped using recovert_target_time = as you have done.

Normally such a restore would be done in a secondary environment. You would then export the dropped table(s) or modified data and import that into your production database (or run update/insert/delete statements to change the state of the database). That way you don't lose other modifications made since the table was dropped.


2 Answers 2

There are really two concerns here, and they are mostly solved problems in the domain of distributed computing.

How do I detect data out of sync?

Your objects should have versioning built-in. Each time an object is saved to permanent storage (whether a relational DB, NoSQL, OODB, whatever it is) you increment the version.

If an app goes to synch its changes to the cloud and detects the version in the cloud does not match the version it is editing, it knows the data was edited in two locations without syncing up in between.

  1. Devices A and B both sync up to the cloud and download version 1 of an object.
  2. User edits the object on device A, then gets frustrated because WiFi is broken.
  3. User goes to device B and edits the data.
  4. Device B syncs with the cloud: both records are on version 1, so it uploads the data and increments the version to 2.
  5. User goes back to device A, which is now connected.
  6. App sees that it has version 1 of the record, but the cloud has version 2 - it would sync, except there are local changes. Oops! This is a problem and it was easily detected.

How do I resolve conflicts?

Depending on the nature of the data it may be easy or difficult to resolve this conflict.

Source control software works in a similar way, where multiple users may edit files independently. The server tracks revision IDs. When a user goes to sync up, it checks revisions to see if there is an update. Typically, if an update is not in conflict (i.e. the same lines were not changed by separate users) it can be resolved by the software automatically. Otherwise, it may require user intervention (different VCS strategies work slightly differently but the conceptual workflow is similar).

This is actually not too far different from the scenario you presented. There are two key factors here:

The same user is editing on both devices, so deciding what to do should be fairly simple for the user.

The data is discrete. Much like lines in a text file, these simple fields on a table are separate yet complete data entities.

One option would be to display a screen with the record, highlighting the conflicting fields that are different in the two versions. Give the following options:

Force the local edits to the cloud, overriding the more recent version (this would now be version 3 in the example above).


1 Answer 1

If I had to manage this, I'd look at what, if anything, modifies the data in the tables outside of the ETL process.

If nothing else besides your ETL modifies the data, I would simply update the ETL process to insert the finished data at both locations (and likewise carry out whatever index maintenance you're doing in both places).

If something else updates this data, but only one one server, then transactional replication is probably the most lightweight way to get the data to the secondary server. Even if the data isn't being modified outside of the ETL, then this wouldn't be a terrible alternative to modifying the ETL process to update two targets. It sounds like a relatively small percentage of data that's being inserted daily.

If the data is being modified on both servers, then you'll probably want to consider merge replication. The simplicity of this will largely depend on if the table has any identity columns.


Optimizing Impala Performance

If you come from a traditional database background, you might have engraved in your mind the notion that indexes are crucial for query speed. If your experience extends to data warehousing environments, you might be comfortable with the idea of doing away with indexes, because it’s often more efficient when doing heavy duty analysis to just scan the entire table or certain partitions.

Impala embraces this data warehousing approach of avoiding indexes by not having any indexes at all. After all, data files can be added to HDFS at any time by components other than Impala. Index maintenance would be very expensive. The HDFS storage subsystem is optimized for fast reads of big chunks of data. So the types of queries that can be expensive in a traditional database system are standard operating procedure for Impala, as long as you follow the best practices for performance.

Having said that, the laws of physics still apply, and if there is a way for a query to read, evaluate, and transmit less data overall, of course the query will be proportionally faster as a result. With Impala, the biggest I/O savings come from using partitioned tables and choosing the most appropriate file format. The most complex and resource-intensive queries tend to involve join operations, and the critical factor there is to collect statistics (using the COMPUTE STATS statement) for all the tables involved in the join.

The following sections give some guidelines for optimizing performance and scalability for queries and overall memory usage. For those who prefer to learn by doing, later sections show examples and tutorials for file formats (Tutorial: The Journey of a Billion Rows), partitioned tables (Making a Partitioned Table), and join queries and table statistics (Deep Dive: Joins and the Role of Statistics).

Optimizing Query Performance

The most resource-intensive and performance-critical Impala queries tend to be joins: pulling together related data from multiple tables. For all tables involved in join queries, issue a COMPUTE STATS statement after loading initial data into a table, or adding new data that changes the table size by 30% or more.

When a table has a column or set of columns that’s almost always used for filtering, such as date or geographic region, consider partitioning that table by that column or columns. Partitioning allows queries to analyze the rows containing specific values of the partition key columns, and avoid reading partitions with irrelevant data.

At the end of your ETL process, you want the data to be in a file format that is efficient for data-warehouse-style queries. In practice, Parquet format is the most efficient for Impala. Other binary formats such as Avro are also more efficient than delimited text files.

See Tutorial: The Journey of a Billion Rows for a sequence of examples that explores all these aspects of query tuning. For more background information, see the related discussions of joins and statistics (Deep Dive: Joins and the Role of Statistics), file formats (File Formats) including Parquet (Parquet Files: The Biggest Blocks of All), and partitioning (Working with Partitioned Tables).

Optimizing Memory Usage

This section provides guidelines and strategies for keeping memory use low. Efficient use of memory is important for overall performance, and also for scalability in a highly concurrent production setup.

For many kinds of straightforward queries, Impala uses a modest and predictable amount of memory, regardless of the size of the table. As intermediate results become available from different nodes in the cluster, the data is sent back to the coordinator node rather than being buffered in memory. For example, SELECT column_list FROM table or SELECT column_list FROM table WHERE conditions both read data from disk using modestly sized read buffers, regardless of the volume of data or the HDFS block size.

Certain kinds of clauses increase the memory requirement. For example, ORDER BY involves sorting intermediate results on remote nodes. (Although in Impala 1.4 and later, the maximum memory used by ORDER BY is lower than in previous releases, and very large sort operations write to a work area on disk to keep memory usage under control.) GROUP BY involves building in-memory data structures to keep track of the intermediate result for each group. UNION and DISTINCT also build in-memory data structures to prune duplicate values.

The size of the additional work memory does depend on the amount and types of data in the table. Luckily, you don’t need all this memory on any single machine, but rather spread across all the data nodes of the cluster.

Calls to aggregation functions such as MAX() , AVG() , and SUM() reduce the size of the overall data. The working memory for those functions themselves is proportional to the number of groups in the GROUP BY clause. For example, computing SUM() for an entire table involves very little memory because only a single variable is needed to hold the intermediate sum. Using SUM() in a query with GROUP BY year involves one intermediate variable corresponding to each year, presumably not many different values. A query calling an aggregate function with GROUP BY unique_column could have millions or billions of different groups, where the time and memory to compute all the different aggregate values could be substantial.

The UNION operator does more work than the UNION ALL operator, because UNION collects the values from both sides of the query and then eliminates duplicates. Therefore, if you know there will be no duplicate values, or there is no harm in having duplicates, use UNION ALL instead of UNION .

The LIMIT clause puts a cap on the number of results, allowing the nodes performing the distributed query to skip unnecessary processing. If you know you need a maximum of N results, include a LIMIT N clause so that Impala can return the results faster.

A GROUP BY clause involving a STRING column is much less efficient than with a numeric column. This is one of the cases where it makes sense to normalize data, replacing long or repeated string values with numeric IDs.

Although INT is the most familiar integer type, if you are dealing with values that fit into smaller ranges (such as 1–12 for month and 1–31 for day), specifying the “smallest” appropriate integer type means the hash tables, intermediate result sets, and so on will use 1/2, 1/4, or 1/8 as much memory for the data from those columns. Use the other integer types ( TINYINT , SMALLINT , and BIGINT ) when appropriate based on the range of values.

You can also do away with separate time-based fields in favor of a single TIMESTAMP column. The EXTRACT() function lets you pull out the individual fields when you need them.

Although most of the Impala memory considerations revolve around queries, inserting into a Parquet table (especially a partitioned Parquet table) can also use substantial memory. Up to 1 GB of Parquet data is buffered in memory before being written to disk. With a partitioned Parquet table, there could be 1 GB of memory used for each partition being inserted into, multiplied by the number of nodes in the cluster, multiplied again by the number of cores on each node.

Use one of the following techniques to minimize memory use when writing to Parquet tables:

  • Impala can determine when an INSERT . SELECT into a partitioned table is especially memory-intensive and redistribute the work to avoid excessive memory usage. For this optimization to be effective, you must issue a COMPUTE STATS statement for the source table where the data is being copied from, so that Impala can make a correct estimate of the volume and distribution of data being inserted.
  • If statistics are not available for the source table, or the automatic memory estimate is inaccurate, you can force lower memory usage for the INSERT statement by including the [SHUFFLE] hint immediately before the SELECT keyword in the INSERT . SELECT statement.
  • Running a separate INSERT statement for each partition minimizes the number of memory buffers allocated at any one time. In the INSERT statement, include a clause PARTITION( col1 = val1 , col2 = val2 , …) to specify constant values for all the partition key columns.

Working with Partitioned Tables

In Impala, as in large-scale data warehouse systems, the primary way for a schema designer to speed up queries is to create partitioned tables. The data is physically divided based on all the different values in one column or a set of columns, known as the partition key columns. Partitioning acts like indexes, instead of looking up one row at a time from widely scattered items, the rows with identical partition keys are physically grouped together. Impala uses the fast bulk I/O capabilities of HDFS to read all the data stored in particular partitions, based on references to the partition key columns in WHERE or join clauses.

With Impala, partitioning is ready to go out of the box with no setup required. It’s expected that practically every user will employ partitioning for their tables that truly qualify as Big Data.

Frequently tested columns like YEAR , COUNTRY , and so on make good partition keys. For example, if you partition on a YEAR column, all the data for a particular year can be physically placed together on disk. Queries with clauses such as WHERE YEAR = 1987 or WHERE YEAR BETWEEN 2006 AND 2009 can zero in almost instantly on the data to read, and then read that data very efficiently because all the rows are located adjacent to each other in a few large files.

Partitioning is great for reducing the overall amount of data to read, which in turn reduces the CPU cycles to test column values and the memory to hold intermediate results. All these reductions flow straight through to the bottom line: faster query performance. If you have 100 years worth of historical data, and you want to analyze only the data for 1 year, you can do that 100 times as fast with a partitioned table as with an unpartitioned one (all else being equal).

This section provides some general guidelines. For demonstrations of some of these techniques, see Making a Partitioned Table.

Finding the Ideal Granularity

Now that I have told you how partitioning makes your queries faster, let’s look at some design aspects for partitioning in Impala (or Hadoop in general). Sometimes, taking an existing partitioned table from a data warehouse and reusing the schema as-is isn’t optimal for Impala.

Remember, Hadoop’s HDFS filesystem does best with a relatively small number of big files. (By big, we mean in the range of 128 MB to 1 GB ideally, nothing smaller than 64 MB.) If you partition on columns that are so fine-grained that each partition has very little data, the bulk I/O and parallel processing of Hadoop mostly goes to waste. Thus, often you’ll find that an existing partitioning scheme needs to be reduced by one level to put sufficient data in each partition.

For example, if a table was partitioned by year, month, and day in pre-Hadoop days, you might get more efficient queries by partitioning only for year and month in Impala. Or if you have an older table partitioned by city and state, maybe a more efficient layout for Impala is only partitioned by state (or even by region). From the Hadoop point of view, it’s not much different to read a 40 MB partition than it is to read a 20 MB one, and reading only 5 MB is unlikely to see much advantage from Hadoop strengths like parallel execution. This is especially true if you frequently run reports that hit many different partitions, such as when you partition down to the day but then run reports for an entire month or a full year.

Inserting into Partitioned Tables

When you insert into a partitioned table, again Impala parallelizes that operation. If the data has to be split up across many different partitions, that means many data files being written to simultaneously, which can exceed limits on things like HDFS file descriptors. When you insert into Parquet tables, each data file being written requires a memory buffer equal to the Parquet block size, which by default is 1 GB for Impala. Thus, what seems like a relatively innocuous operation (copy 10 years of data into a table partitioned by year, month, and day) can take a long time or even fail, despite a low overall volume of information. Here again, it’s better to work with big chunks of information at once. Impala INSERT syntax lets you work with one partition at a time:

It’s easy to write a query that generates a set of INSERT statements like this by finding all the distinct values for the partition key columns. Then you can run the resulting statements in a SQL script. For example:

Pro Tip

When you run Impala queries to generate other SQL statements, start impala-shell with the -B option. That option suppresses the ASCII boxes around query results, making the output easier to redirect or copy and paste into a script file. See Tutorial: Verbose and Quiet impala-shell Output for examples.

Adding and Loading New Partitions

One of the convenient aspects of Impala partitioned tables is that the partitions are just HDFS directories, where you can put data files without going through any file conversion or even Impala INSERT statements. In this example, you create the partitions individually and use the LOAD DATA statement or some mechanism outside Impala to ingest the data.

See Anti-Pattern: A Million Little Pieces for some other tricks you can use to avoid fragmentation and excessive memory use when inserting into partitioned Parquet tables.


Enlarge the partition: fdisk -u /dev/sda .

p to print the partition table, take note of the number, start, end, type of sda1.

Recreate it using command n with same number (1), start and type but with a bigger end (taking care not to overlap with other partitions). Try to align things on a megabyte boundary that is for end, make it a multiple of 2048 minus 1. Change the type if needed with t (for partitions holding an extX or btrfs filesystem, the default of 83 is fine).

Then w to write and q to quit.

The partition table will have been modified but the kernel will not be able to take that into account as some partitions are mounted.

However, if in-use partitions were only enlarged, you should be able to force the kernel to take the new layout with:

If that fails, you'll need to reboot. The system should boot just fine.

Then, resize the filesystem so it spreads to the extent of the enlarged partition:

Which for ext4 will work just fine even on a live FS.

You can't do it safely while the partition is mounted, meaning you need to boot some other partition and do it from there.

gparted is a nice, easy GUI for this purpose. In our deleted comment exchange you mentioned it would not start because of "can't access display" -- this implies you aren't in X since it is a GUI it won't work without that.

Of course, if you don't have another partition to use, you'll need a live CD or something -- I think they usually come with gparted. Your best bet is probably the actual gparted live CD, which looks to have a reasonably recent latest stable version (and will fit on a CD, which is nice since the "live CD" is rapidly becoming the "live DVD").

I've never had gparted cause a problem but of course do back your important tish up first.

I know this is very old issue, but many people are looking for that resolve.

For this example you have the following typical situation. On the beginning is a single partition, and on the end a swap partition is located. it isn't good because swap can be heavily loaded, and end of rotated disk is the slowest part of it. what I suggest?

  1. Create boot partition at the begin. Why at the beginning? because many tools have got problems with end of large disk above 2TB barrier. After boot partition should be swap partition. This is for performance. Rest of disk should be used for other partitions.

But what with this situation? I don't recommend expanding sda1. I suggest creating other partitions after sda1 & sda2, mounted as /home and /usr . On that directories are stored the most user and system data, and it is possible to securely move data from that directories to new partitions.

But. (second "but") if you still want keep your current structure, you should first remove swap partition. You should do swapoff, remark this in /etc/fstab . if swapon tells you (see man) no swap is used, you can remove it by partition tool (fdisk or similar).

When you don't have this partition, you can enlarge your sda1. Using fdisk you should firstly print partition. It is for remembering where it starts. Secondly remove partition using 'd' key. Don't panic, nothing is written to hard drive :). After print you see, no partition exist on the disk.

Next create new partition, but be careful. See on what sector original partition started. Write the same number. After that look at the end. fdisk prompts you for last available sector. Use calculator and assign in your mind how much swap do you need in kilobytes. Multiple it by 2, and subtract this number from last sector number prompted by fdisk. Create partition. Print this (it is still in fdisk temporary memory), and look all is ok.

After this press 'w' key, this will tru write new partitioning to the drive. You will see information all is synced, or sync is failed. If failed, you can call command partx /dev/sda for doing sync. If it still fails, you should reboot your system. After reboot you have prepared larger partition, but filesystem on it is still smaller. You should grow it. EXT4 is growable on the fly :), use resize2fs /dev/sda1 for do it.

You don't need remounting rebooting etc. Last, you should restore swap partition. Simply use fdisk again, create new partition as swap. After writing 'w', device sda2 will be restored. Prepare swap structure on it using mkswap /dev/sda2 , remove remarks for swap on /etc/fstab and finally do swapon -a . Look using swapon or top command, swap is activated.

I know, it's very long explanation. I hope it will be usable for any. Notice, in my opinion xfs filesystem is much better, unfortunately this fs doesn't support shrinking without temporary copying to alternate place, but shrink is used rarely. It is nice, xfs takes extremely little place for his own metadata, and is faster in many ways in comparison to ext4.

Another hint, better use some LVM as middle layer for partitioning. after that any resizing is much easier. Performance is comparable, of course, you can use mix of HDD management. you can use in common raw partitions and LVM.


Updating calendar events linked to Outlook

I have a group calendar in SharePoint that is synced to show in Outlook. Because everyone in my office uses outlook we copy meetings from the SharePoint calendar to each persons outlook calendar. This lets us see who is busy in the SharePoint calendar and when using the Scheduling Assistant in Outlook.

My problem is that the events are not linked. If I update a meeting in Outlook, the update doesn't change the SharePoint event (and vice versa). I have to actually go into the calendar I didn't change, delete the event, and copy the new event to the calendar. What I am looking for is a way to link the events so that when one is updated, the other is updated as well.

I DO have my SharePoint calendar synced to show in Outlook. We use Outlook to schedule meetings, but we want scheduled events and meetings and training to show on the Calendar in SharePoint. So when there is a meeting, I create it in my personal Outlook calendar and then copy it to the SharePoint calendar that shows in Outlook. I do this because the Scheduling Assistant in Outlook doesn't show me as busy during meetings scheduled on the SharePoint calendar (in Outlook) and the same applies with the Group Calendar in SharePoint. But when a meeting time changes or is otherwise modified, the changes only apply to the event on either my personal calendar or the SharePoint calendar (wherever I made the change). I need to delete the existing event on the other calendar and then recreate the event.


Very slow sync on low end PC - what to upgrade?

I am a newbie to bitcoin. I am trying to sync a full node on my low end PC: 2g RAM, dual-core 2ghz, 250 gb regular HDD. Fresh install Linux. It is in pruned mode. The internet connection tests at 5mb download/1mb upload speed. The initial 15% synced in a few days but now the process is fluctuating wildly around 2-7 weeks. It's been on for a month & is only at 35% right now. Suffice to say it is very slow.

My main questions are: how much will adding some RAM improve the syncing speed? And will adding new RAM reset the blockchain download? Also, I have heard repeatedly that using a SSD HD will help. Is there a way to carry over my present 35% progress onto the SSD, or do I have to re-start the process?

I have already adjusted these settings: dbcache=900, banscore=10, listen=0, server=1 (upnp=0 also shows up on the bitcoin core gui/options, but not in the config file.). I don't really understand what these settings do, I am just going off recommendations. So hopefully they are good(?).

I am worried that putting in the RAM will "reset" the blockchain download and I will have to start syncing from 0% again. This already happened to me when I was tinkering with the dbcache setting - Which I will have to do again in order to take advantage of any new RAM. (right?)


2 Answers 2

Currently these are your only three options available however, after the release that was scheduled for today (October 27th, 2017) there will exist the added functionality of being able to filter records at the synchronization level based on any selected boolean fields.

Synchronized Data Sources can use a boolean field value to define the object rows that are synchronized to Marketing Cloud. Only rows that meet the filter requirement synchronize to Marketing Cloud. When an update to the boolean field longer meets the filter requirement, we remove the row from the synchronized data extension.

As far as I can tell this hasn't been released yet, but if you wait a little while longer it should be made available.