Во-первых, я искал и нашел несколько сообщений, касающихся этой ошибки, и большинство из них указывают либо на проблему с ОЗУ, либо на проблему с SSL, я попытался преодолеть возможность SSL, добавив в командной строке sslmode = disabled:
psql -U waypoint -d waypoint -W -c "alter table telemetria_data.historico alter clase type smallint, alter valor type real[], alter power type smallint, alter voltaje type real;" -h localhost -v sslmode=disable
Но появилось то же сообщение:
SSL SYSCALL error: EOF detected
connection to server was lost
Что касается возможной проблемы с памятью, я не знаю, как ее устранить.
Структура данных - та, которая описана в этом вопросе, и, как вы, возможно, обнаружите, это будет очень длительный запрос, чтобы завершить достижение полной таблицы изменения для всех унаследованных таблиц.
ОПЕРАЦИОННЫЕ СИСТЕМЫ:
Linux ip-10-1-0-9 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux
PostgreSQL:
PostgreSQL 9.4.9 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit
Обновление 2017-06-01 13:50 GMT
Изменена команда на (из-за рекомендаций @ Daniel Vérité):
time PGSSLMODE=disable psql -U waypoint -d waypoint -W -c "alter table telemetria_data.historico alter clase type smallint, alter valor type real[], alter power type smallint, alter voltaje type real;" -h localhost
Проблема фактически изменилась на следующее:
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost
Обновление 2017-06-01 15:34 GMT
Найдено несколько записей в журнале (в /var/log/postgresql/postgresql-9.4-main.log
), как эти:
2017-06-01 13:48:49 UTC [22899-357] LOG: checkpoints are occurring too frequently (19 seconds apart)
2017-06-01 13:48:49 UTC [22899-358] HINT: Consider increasing the configuration parameter "checkpoint_segments".
Поэтому я продолжу с предложенной подсказкой.
Также найдена эта группа записей, которые фактически ссылаются на сбой и последующее восстановление:
2017-06-01 13:49:04 UTC [4982-17] LOG: server process (PID 6569) was terminated by signal 9: Killed
2017-06-01 13:49:04 UTC [4982-18] DETAIL: Failed process was running: alter table telemetria_data.historico alter clase type smallint, alter valor type real[], alter power type smallint, alter voltaje type real;
2017-06-01 13:49:04 UTC [4982-19] LOG: terminating any other active server processes
2017-06-01 13:49:04 UTC [22902-2] WARNING: terminating connection because of crash of another server process
2017-06-01 13:49:04 UTC [22902-3] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2017-06-01 13:49:04 UTC [22902-4] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2017-06-01 13:49:04 UTC [16383-1] waypoint@waypoint FATAL: the database system is in recovery mode
2017-06-01 13:49:04 UTC [16384-1] waypoint@waypoint FATAL: the database system is in recovery mode
2017-06-01 13:49:04 UTC [16386-1] waypoint@waypoint FATAL: the database system is in recovery mode
2017-06-01 13:49:04 UTC [16385-1] waypoint@waypoint FATAL: the database system is in recovery mode
2017-06-01 13:49:04 UTC [4982-20] LOG: all server processes terminated; reinitializing
2017-06-01 13:49:05 UTC [16402-1] LOG: database system was interrupted; last known up at 2017-06-01 13:48:45 UTC
2017-06-01 13:49:05 UTC [16403-1] waypoint@waypoint FATAL: the database system is in recovery mode
2017-06-01 13:49:05 UTC [16404-1] waypoint@waypoint FATAL: the database system is in recovery mode
2017-06-01 13:49:05 UTC [16414-1] waypoint@waypoint FATAL: the database system is in recovery mode
2017-06-01 13:49:05 UTC [16415-1] waypoint@waypoint FATAL: the database system is in recovery mode
2017-06-01 13:49:06 UTC [16452-1] waypoint@waypoint FATAL: the database system is in recovery mode
2017-06-01 13:49:06 UTC [16453-1] waypoint@waypoint FATAL: the database system is in recovery mode
2017-06-01 13:49:06 UTC [16462-1] waypoint@waypoint FATAL: the database system is in recovery mode
2017-06-01 13:49:06 UTC [16463-1] waypoint@waypoint FATAL: the database system is in recovery mode
2017-06-01 13:49:06 UTC [16472-1] waypoint@waypoint FATAL: the database system is in recovery mode
2017-06-01 13:49:06 UTC [16473-1] waypoint@waypoint FATAL: the database system is in recovery mode
2017-06-01 13:49:06 UTC [16482-1] waypoint@waypoint FATAL: the database system is in recovery mode
2017-06-01 13:49:06 UTC [16483-1] waypoint@waypoint FATAL: the database system is in recovery mode
2017-06-01 13:49:09 UTC [16402-2] LOG: database system was not properly shut down; automatic recovery in progress
2017-06-01 13:49:09 UTC [16402-3] LOG: redo starts at 11EC/9960F440
2017-06-01 13:49:21 UTC [16402-4] LOG: unexpected pageaddr 11E6/52726000 in log segment 00000001000011EC000000C9, offset 7495680
2017-06-01 13:49:21 UTC [16402-5] LOG: redo done at 11EC/C9723D60
2017-06-01 13:49:32 UTC [16402-6] LOG: MultiXact member wraparound protections are now enabled
2017-06-01 13:49:32 UTC [4982-21] LOG: database system is ready to accept connections
Любые предложения по этой последней части журнала?
OOM Killer включен, и следующий вывод /var/log/messages
:
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.672817] psql invoked oom-killer: gfp_mask=0x2000d0, order=2, oom_score_adj=0
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.675652] psql cpuset=/ mems_allowed=0
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.677274] CPU: 1 PID: 16367 Comm: psql Not tainted 3.16.0-4-amd64 #1 Debian 3.16.36-1+deb8u2
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680406] Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] 0000000000000000 ffffffff815123b5 ffff88003dcda1d0 0000000000000000
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] ffffffff8150ff8d 0000000000000000 ffffffff810d6e3f 0000000000000000
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] ffffffff81516d2e 0000000000000200 ffffffff810689d3 ffffffff810c43e4
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] Call Trace:
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff815123b5>] ? dump_stack+0x5d/0x78
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff8150ff8d>] ? dump_header+0x76/0x1e8
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff810d6e3f>] ? smp_call_function_single+0x5f/0xa0
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff81516d2e>] ? mutex_lock+0xe/0x2a
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff810689d3>] ? put_online_cpus+0x23/0x80
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff810c43e4>] ? rcu_oom_notify+0xc4/0xe0
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff81153d1c>] ? do_try_to_free_pages+0x4ac/0x520
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff811427dd>] ? oom_kill_process+0x21d/0x370
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff8114239d>] ? find_lock_task_mm+0x3d/0x90
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff81142f43>] ? out_of_memory+0x473/0x4b0
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff81148e0f>] ? __alloc_pages_nodemask+0x9ef/0xb50
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff81065c86>] ? copy_process.part.25+0x116/0x1c50
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffffa00f6bba>] ? call_filldir+0x9a/0x160 [ext4]
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff81058301>] ? __do_page_fault+0x1d1/0x4f0
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff811ac3f9>] ? get_empty_filp+0xc9/0x1c0
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff81067990>] ? do_fork+0xe0/0x3d0
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff811c6a1c>] ? __alloc_fd+0x7c/0x120
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff815188f9>] ? stub_clone+0x69/0x90
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff8151858d>] ? system_call_fast_compare_end+0x10/0x15
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.734210] Mem-Info:
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.735151] Node 0 DMA per-cpu:
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.736512] CPU 0: hi: 0, btch: 1 usd: 0
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.738300] CPU 1: hi: 0, btch: 1 usd: 0
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.740072] Node 0 DMA32 per-cpu:
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.741335] CPU 0: hi: 186, btch: 31 usd: 0
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.743092] CPU 1: hi: 186, btch: 31 usd: 0
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.745210] active_anon:370484 inactive_anon:549110 isolated_anon:24
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.745210] active_file:240 inactive_file:1425 isolated_file:0
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.745210] unevictable:0 dirty:173 writeback:0 unstable:0
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.745210] free:20427 slab_reclaimable:9729 slab_unreclaimable:3425
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.745210] mapped:567547 shmem:587500 pagetables:4209 bounce:0
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.745210] free_cma:0
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.760872] Node 0 DMA free:15224kB min:184kB low:228kB high:276kB active_anon:228kB inactive_anon:188kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:144kB shmem:204kB slab_reclaimable:4kB slab_unreclaimable:80kB kernel_stack:80kB pagetables:4kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.777646] lowmem_reserve[]: 0 3757 3757 3757
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.779553] Node 0 DMA32 free:69956kB min:44868kB low:56084kB high:67300kB active_anon:1481708kB inactive_anon:2196252kB active_file:1772kB inactive_file:1748kB unevictable:0kB isolated(anon):96kB isolated(file):0kB present:3915776kB managed:3849676kB mlocked:0kB dirty:0kB writeback:0kB mapped:2267676kB shmem:2349796kB slab_reclaimable:38712kB slab_unreclaimable:13620kB kernel_stack:2032kB pagetables:16832kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:14 all_unreclaimable? no
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.795757] lowmem_reserve[]: 0 0 0 0
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.797650] Node 0 DMA: 11*4kB (EM) 8*8kB (EM) 1*16kB (E) 2*32kB (UE) 1*64kB (E) 1*128kB (E) 2*256kB (UE) 2*512kB (EM) 3*1024kB (UEM) 3*2048kB (EMR) 1*4096kB (M) = 15228kB
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.804011] Node 0 DMA32: 13752*4kB (UEM) 85*8kB (EM) 54*16kB (M) 43*32kB (M) 17*64kB (M) 15*128kB (M) 10*256kB (M) 3*512kB (M) 2*1024kB (M) 0*2048kB 1*4096kB (R) = 71176kB
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.811528] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.815461] 588017 total pagecache pages
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.816967] 0 pages in swap cache
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.818205] Swap cache stats: add 0, delete 0, find 0/0
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.820215] Free swap = 0kB
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.821475] Total swap = 0kB
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.822543] 982941 pages RAM
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.823640] 0 pages HighMem/MovableOnly
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.825110] 16525 pages reserved
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.826310] 0 pages hwpoisoned
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.827473] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.830523] [ 159] 0 159 8242 800 21 0 0 systemd-journal
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.833717] [ 162] 0 162 10200 135 22 0 -1000 systemd-udevd
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.836679] [ 316] 0 316 6351 1726 14 0 0 dhclient
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.839533] [ 351] 0 351 7181 72 18 0 0 cron
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.842112] [ 353] 0 353 4964 68 14 0 0 systemd-logind
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.844894] [ 362] 107 362 10531 96 26 0 -900 dbus-daemon
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.847504] [ 376] 106 376 8345 154 21 0 0 ntpd
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.850085] [ 377] 0 377 65721 457 30 0 0 rsyslogd
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.852755] [ 388] 0 388 3909 39 12 0 0 agetty
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.855305] [ 389] 0 389 3864 40 13 0 0 agetty
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.857914] [ 451] 0 451 13796 168 29 0 -1000 sshd
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.860562] [ 481] 1002 481 26362 5081 54 0 0 perfmon_loop.rb
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.863918] [ 486] 1002 486 15211 3146 31 0 0 cht_perfmon
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.867273] [ 625] 0 625 9560 144 22 0 0 master
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.869878] [ 630] 108 630 10164 234 24 0 0 qmgr
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.872356] [ 3443] 0 3443 20130 213 41 0 0 sshd
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.874903] [ 3445] 1000 3445 20164 222 39 0 0 sshd
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.877451] [ 3446] 1000 3446 3176 43 9 0 0 sftp-server
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.880261] [ 4982] 105 4982 614831 42946 132 0 -900 postgres
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.882988] [ 6634] 0 6634 1570 23 9 0 0 collectdmon
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.885784] [ 6635] 0 6635 174485 156 36 0 0 collectd
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.888347] [22899] 105 22899 615399 541666 1105 0 0 postgres
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.890970] [22900] 105 22900 615395 14251 88 0 0 postgres
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.893836] [22901] 105 22901 615088 4252 53 0 0 postgres
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.896365] [22902] 105 22902 615305 1316 60 0 0 postgres
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.899083] [22903] 105 22903 21336 378 40 0 0 postgres
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.901954] [ 2946] 108 2946 10076 137 22 0 0 pickup
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.904462] [ 6376] 0 6376 20130 213 42 0 0 sshd
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.907440] [ 6378] 1000 6378 20130 209 40 0 0 sshd
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.910534] [ 6379] 1000 6379 5795 151 16 0 0 bash
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.913144] [ 6382] 0 6382 11515 107 28 0 0 sudo
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.915635] [ 6383] 0 6383 11895 96 27 0 0 su
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.918149] [ 6384] 105 6384 5796 139 16 0 0 bash
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.920900] [ 6561] 105 6561 18289 236 40 0 0 psql
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.923394] [ 6569] 105 6569 925161 853454 1718 0 0 postgres
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.926097] [16319] 0 16319 10865 95 25 0 0 cron
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.928611] [16320] 0 16320 10865 95 25 0 0 cron
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.931063] [16321] 0 16321 10865 95 25 0 0 cron
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.933758] [16322] 0 16322 10865 95 25 0 0 cron
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.936197] [16323] 1000 16323 1084 20 7 0 0 sh
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.938762] [16324] 1000 16324 1084 20 7 0 0 sh
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.941450] [16325] 1000 16325 1084 21 7 0 0 sh
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.943849] [16326] 1000 16326 1084 21 6 0 0 sh
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.946409] [16327] 1000 16327 3612 54 12 0 0 telemetria.sh
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.949258] [16328] 1000 16328 3613 57 12 0 0 instantaneo.sh
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.952633] [16329] 1000 16329 21335 4808 48 0 0 mon-put-instanc
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.955744] [16330] 1000 16330 3612 54 12 0 0 conexiones.sh
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.958555] [16366] 1000 16366 10744 1513 26 0 0 psql
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.961216] [16367] 1000 16367 10735 1512 26 0 0 psql
Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.963722] [16368] 1000 16368 7725 1044 19 0 0 aws
Обновление 2017-06-01 16:19 GMT
Изменены настройки на:
checkpoint_segments = 100 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 30s # range 30s-1h
И я заполнил жесткий диск :( Я щедро увеличил checkpoint_segments, но сначала не проверил доступное пространство. К счастью, я тестирую эту процедуру в непроизводственной среде. Поэтому мне, возможно, придется снова клонировать рабочий сервер, или есть какой-нибудь способ освободить временное пространство, которое сейчас используется?
ERROR: could not extend file "base/16384/3940428": No space left on device
HINT: Check free disk space.
В соответствии с вопросом @ deszo, значения переполнения памяти следующие:
vm.nr_overcommit_hugepages = 0
vm.overcommit_kbytes = 0
vm.overcommit_memory = 0
vm.overcommit_ratio = 50
Обновление 2017-06-01 18: 107 GMT
Экземпляр сервера - AWS c4.large (2 vCPU, 3,75 ГБ ОЗУ)
Еще несколько параметров from postgresql.conf
:
shared_buffers = 2GB # min 128kB
work_mem = 32MB # min 64kB
max_connections =800 # (change requires restart)
shared_buffers
, work_mem
и max_connections
набор для?