[Eisfair] Fehlermeldung beim Einsatz von rsync - NEU

Matthias Prill m.prill at gmx.de
Mi Mär 13 17:24:54 CET 2019


Am 13.03.2019 um 04:33 schrieb Jürgen Witt:
> Hallo NG,
> 
> ich habe vor wenigen Tagen zu dem Thema im Betreff hier schon etwas
> geschrieben und war guter Hoffnung, daß das Thema erledigt wäre. Stimmt
> aber leider nicht.
> 
> Der rsync der Daten hat nur genau 1x funktioniert Und zwar nachdem ich
> den Ordner, in dem es zu dem Abbruch kam (TDAMP), auf dem
> Ziel/Backup-Server in TDAMP- umbenannt und danach den Ordner TDAMP neu
> angelegt habe. Am folgenden Abend ist es wieder mit der selben
> Fehlermeldung in die Hose gegangen. Aber auch nach dem Austausch der
> defekten Festplatte (1 pending sector) heute, ist die Sicherung per
> rsync wieder an gleicher Stelle/Datei abgebrochen.
> 
> Email vom Backup-Server eis mit dem Betreff "fcron <root at eis>
> /usr/bin/server-sichern"
> 
> rsync: connection unexpectedly closed (2266062707 bytes received so far)
> [receiver]
> rsync error: error in rsync protocol data stream (code 12) at io.c(235)
> [receiver=3.1.3]
> rsync: connection unexpectedly closed (32617488 bytes received so far)
> [generator]
> rsync error: error in rsync protocol data stream (code 12) at io.c(235)
> [generator=3.1.3]
> 
> Neue Platte /dev/sdb:
> 
> eis # smartctl -a /dev/sdb
> smartctl 7.0 2018-12-30 r4883 [i686-linux-3.16.62-eisfair-1-PAE] (local
> build)
> Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> === START OF INFORMATION SECTION ===
> Model Family:     Western Digital Red
> Device Model:     WDC WD20EFRX-68EUZN0
> Serial Number:    WD-WCC4M6NCH89F
> LU WWN Device Id: 5 0014ee 2bb0ff2cc
> Firmware Version: 82.00A82
> User Capacity:    2,000,398,934,016 bytes [2.00 TB]
> Sector Sizes:     512 bytes logical, 4096 bytes physical
> Rotation Rate:    5400 rpm
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   ACS-2 (minor revision not indicated)
> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
> Local Time is:    Wed Mar 13 02:22:50 2019 CET
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> General SMART Values:
> Offline data collection status:  (0x80) Offline data collection activity
>                                         was never started.
>                                         Auto Offline Data Collection:
> Enabled.
> 
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE UPDATED 
> WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> Always       -       0
>   3 Spin_Up_Time            0x0027   100   253   021    Pre-fail
> Always       -       0
>   4 Start_Stop_Count        0x0032   100   100   000    Old_age
> Always       -       1
>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> Always       -       0
>   7 Seek_Error_Rate         0x002e   200   200   000    Old_age
> Always       -       0
>   9 Power_On_Hours          0x0032   100   100   000    Old_age
> Always       -       17
>  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
> Always       -       0
>  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
> Always       -       0
>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> Always       -       1
> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always
>       -       0
> 193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always
>       -       1
> 194 Temperature_Celsius     0x0022   118   116   000    Old_age   Always
>       -       29
> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always
>       -       0
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always
>       -       0
> 198 Offline_Uncorrectable   0x0030   100   253   000    Old_age
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always
>       -       0
> 200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age
> Offline      -       0
> 
> SMART Error Log Version: 1
> No Errors Logged
> 
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining
> LifeTime(hours)  LBA_of_first_error
> # 1  Short offline       Completed without error       00%        17     -
> # 2  Short offline       Completed without error       00%        17     -
> 
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> 
> Vorhandene Festplatte /dev/sda, die mit /dev/sdb ein Software-Raid-1 bildet
> 
> eis # smartctl -a /dev/sda
> smartctl 7.0 2018-12-30 r4883 [i686-linux-3.16.62-eisfair-1-PAE] (local
> build)
> Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> === START OF INFORMATION SECTION ===
> Model Family:     Western Digital Red
> Device Model:     WDC WD20EFRX-68EUZN0
> Serial Number:    WD-WCC4M4FF0JJZ
> LU WWN Device Id: 5 0014ee 2b92310de
> Firmware Version: 82.00A82
> User Capacity:    2,000,398,934,016 bytes [2.00 TB]
> Sector Sizes:     512 bytes logical, 4096 bytes physical
> Rotation Rate:    5400 rpm
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   ACS-2 (minor revision not indicated)
> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
> Local Time is:    Wed Mar 13 02:26:24 2019 CET
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> General SMART Values:
> Offline data collection status:  (0x85) Offline data collection activity
>                                         was aborted by an interrupting
> command from host.
>                                         Auto Offline Data Collection:
> Enabled.
> Self-test execution status:      (   0) The previous self-test routine
> completed
>                                         without error or no self-test
> has ever
>                                         been run.
> 
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE UPDATED 
> WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
> Always       -       0
>   3 Spin_Up_Time            0x0027   177   177   021    Pre-fail
> Always       -       4125
>   4 Start_Stop_Count        0x0032   100   100   000    Old_age
> Always       -       10
>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> Always       -       0
>   7 Seek_Error_Rate         0x002e   200   200   000    Old_age
> Always       -       0
>   9 Power_On_Hours          0x0032   081   081   000    Old_age
> Always       -       13892
>  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
> Always       -       0
>  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
> Always       -       0
>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
> Always       -       10
> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always
>       -       3
> 193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always
>       -       292
> 194 Temperature_Celsius     0x0022   116   108   000    Old_age   Always
>       -       31
> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always
>       -       0
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always
>       -       0
> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always
>       -       0
> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
> Offline      -       0
> 
> SMART Error Log Version: 1
> No Errors Logged
> 
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining
> LifeTime(hours)  LBA_of_first_error
> # 1  Short offline       Completed without error       00%     13892     -
> # 2  Short offline       Completed without error       00%     13892     -
> # 3  Short offline       Completed without error       00%     13869     -
> # 4  Short offline       Completed without error       00%     13845     -
> # 5  Short offline       Completed without error       00%     13821     -
> # 6  Extended offline    Completed without error       00%     13803     -
> # 7  Short offline       Completed without error       00%     13797     -
> # 8  Short offline       Completed without error       00%     13773     -
> # 9  Short offline       Completed without error       00%     13749     -
> #10  Short offline       Completed without error       00%     13725     -
> #11  Short offline       Completed without error       00%     13701     -
> #12  Short offline       Completed without error       00%     13677     -
> #13  Short offline       Completed without error       00%     13653     -
> #14  Extended offline    Completed without error       00%     13635     -
> #15  Short offline       Completed without error       00%     13629     -
> #16  Short offline       Completed without error       00%     13605     -
> #17  Short offline       Completed without error       00%     13581     -
> #18  Short offline       Completed without error       00%     13557     -
> #19  Short offline       Completed without error       00%     13533     -
> #20  Short offline       Completed without error       00%     13437     -
> #21  Short offline       Completed without error       00%     13414     -
> 
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> 
> Per cron wird um 23:30 Uhr auf dem Server eis (10.20.3.100) täglich das
> Script server-sichern gestartet.
> 
> eis # cat /usr/bin/server-sichern
> #! /bin/bash
> ssh 10.20.3.101 touch /public/TDAMP/DS/end.txt
> ssh 10.20.3.101 sleep 45
> rsync -avzu --delete 10.20.3.101:/home /data >/var/log/rsync-home_log
> rsync -avzu --delete 10.20.3.101:/public /data >/var/log/rsync-public_log
> rsync -avzu --delete 10.20.3.101:/var/www/htdocs/webdav /var/www/htdocs
>>/var/log/rsync-webdav_log
> ssh 10.20.3.101 rm /public/TDAMP/DS/end.txt
> rm -f /public/TDAMP/DS/end.txt
> /usr/bin/server-sichern-report
> 
> Dieses Script sorgt für den Report, der per Email zugestellt wird.
> 
> eis # cat /usr/bin/server-sichern-report
> #!/bin/sh
> ( echo "To: root"
>   echo "Subject: Praxis Borgweg Abgleich Server + Backup-Server"
>   echo "Groesse und Datum der log-Dateien"
>   ls -l /var/log/rsync-*
>   echo "----------------------------------------"
>   echo "Sicherung von /home"
>   # cat /var/log/rsync-home_log
>   echo "-----------------------------------------"
>   echo "Zeitpunkt der Sicherung"
>   ls -la /var/log/rsync-home_log
>   echo "-----------------------------------------"
>   echo "Sicherung von /public"
>   cat /var/log/rsync-public_log
>   echo "-----------------------------------------"
>   echo "Zeitpunkt der Sicherung"
>   ls -la /var/log/rsync-public_log
>   echo "-----------------------------------------"
> 
>   ) | /usr/lib/sendmail root
> 
> Ergebnis einer unvollständigen Sicherung:
> 
> Anfang des zugesendeten Report:
> 
> Groesse und Datum der log-Dateien
> -rw-r--r-- 1 root root   3314 Mar 12 23:30 /var/log/rsync-home_log
> -rw-r--r-- 1 root root 186499 Mar 13 00:33 /var/log/rsync-public_log
> -rw-r--r-- 1 root root    174 Mar 13 00:33 /var/log/rsync-webdav_log
> 
> ...
> public/TDAMP/DS/daten/prax1/PSI.DBF
> public/TDAMP/DS/daten/prax1/QMETA.DBF
> public/TDAMP/DS/daten/prax1/QMETA.FPT
> -----------------------------------------
> Zeitpunkt der Sicherung
> -rw-r--r-- 1 root root 186499 Mar 13 00:33 /var/log/rsync-public_log
> 
> Hier endet der Report.
> 
> Der zugesendete Report endet wieder mit der selben Datei QMETA.FPT in
> einem Unterordner von public. Diese Datei ist 2700 MB groß. Der Report
> des dritten rsync-Prozesses (webdav) wird nicht gesendet, obwohl er ja
> vorhanden ist und erst nach dem Abruch des zweiten rsync-Prozesses
> (public) ausgeführt wird.
> 
> Auszug aus /var/log/messages auf dem Produktiv-Server debian (10.20.30.101)
> 
> Mar 12 23:30:01 debian sshd[30074]: Accepted publickey for root from
> 10.20.3.100 port 48907 ssh2: RSA
> SHA256:We20RCnHw5WERqLEu5nbmonNrhYB/AgkeIT+yGXBFLY
> Mar 12 23:30:01 debian sshd[30074]: pam_unix(sshd:session): session
> opened for user root by (uid=0)
> Mar 12 23:30:01 debian sshd[30074]: Received disconnect from 10.20.3.100
> port 48907:11: disconnected by user
> Mar 12 23:30:01 debian sshd[30074]: Disconnected from user root
> 10.20.3.100 port 48907
> Mar 12 23:30:01 debian sshd[30074]: pam_unix(sshd:session): session
> closed for user root
> Mar 12 23:30:01 debian sshd[30233]: Accepted publickey for root from
> 10.20.3.100 port 48908 ssh2: RSA
> SHA256:We20RCnHw5WERqLEu5nbmonNrhYB/AgkeIT+yGXBFLY
> Mar 12 23:30:01 debian sshd[30233]: pam_unix(sshd:session): session
> opened for user root by (uid=0)
> Mar 12 23:30:46 debian sshd[30233]: Received disconnect from 10.20.3.100
> port 48908:11: disconnected by user
> Mar 12 23:30:46 debian sshd[30233]: Disconnected from user root
> 10.20.3.100 port 48908
> Mar 12 23:30:46 debian sshd[30233]: pam_unix(sshd:session): session
> closed for user root
> Mar 12 23:30:47 debian sshd[31049]: Accepted publickey for root from
> 10.20.3.100 port 48910 ssh2: RSA
> SHA256:We20RCnHw5WERqLEu5nbmonNrhYB/AgkeIT+yGXBFLY
> Mar 12 23:30:47 debian sshd[31049]: pam_unix(sshd:session): session
> opened for user root by (uid=0)
> Mar 12 23:30:53 debian sshd[31049]: Received disconnect from 10.20.3.100
> port 48910:11: disconnected by user
> Mar 12 23:30:53 debian sshd[31049]: Disconnected from user root
> 10.20.3.100 port 48910
> Mar 12 23:30:53 debian sshd[31049]: pam_unix(sshd:session): session
> closed for user root
> Mar 12 23:30:53 debian sshd[31109]: Accepted publickey for root from
> 10.20.3.100 port 48911 ssh2: RSA
> SHA256:We20RCnHw5WERqLEu5nbmonNrhYB/AgkeIT+yGXBFLY
> Mar 12 23:30:53 debian sshd[31109]: pam_unix(sshd:session): session
> opened for user root by (uid=0)
> 
> 
> Mar 13 00:33:12 debian sshd[31109]: Received disconnect from 10.20.3.100
> port 48911:11: disconnected by user
> Mar 13 00:33:12 debian sshd[31109]: Disconnected from user root
> 10.20.3.100 port 48911
> Mar 13 00:33:12 debian sshd[31109]: pam_unix(sshd:session): session
> closed for user root
> Mar 13 00:33:14 debian sshd[25188]: Accepted publickey for root from
> 10.20.3.100 port 48976 ssh2: RSA
> SHA256:We20RCnHw5WERqLEu5nbmonNrhYB/AgkeIT+yGXBFLY
> Mar 13 00:33:14 debian sshd[25188]: pam_unix(sshd:session): session
> opened for user root by (uid=0)
> Mar 13 00:33:14 debian sshd[25188]: Received disconnect from 10.20.3.100
> port 48976:11: disconnected by user
> Mar 13 00:33:14 debian sshd[25188]: Disconnected from user root
> 10.20.3.100 port 48976
> Mar 13 00:33:14 debian sshd[25188]: pam_unix(sshd:session): session
> closed for user root
> Mar 13 00:33:15 debian sshd[25242]: Accepted publickey for root from
> 10.20.3.100 port 48977 ssh2: RSA
> SHA256:We20RCnHw5WERqLEu5nbmonNrhYB/AgkeIT+yGXBFLY
> Mar 13 00:33:15 debian sshd[25242]: pam_unix(sshd:session): session
> opened for user root by (uid=0)
> Mar 13 00:33:15 debian sshd[25242]: Received disconnect from 10.20.3.100
> port 48977:11: disconnected by user
> Mar 13 00:33:15 debian sshd[25242]: Disconnected from user root
> 10.20.3.100 port 48977
> Mar 13 00:33:15 debian sshd[25242]: pam_unix(sshd:session): session
> closed for user root
> Mar 13 00:39:32 debian su: pam_unix(su:session): session closed for user
> root
> 
> Auszug aus /var/log/messages auf dem Backup-Server eis (10.20.3.100)
> 
> Mar 12 23:30:00 eis fcron[9517]: Job '/usr/bin/server-sichern' started
> for user root (pid 9526)
> Mar 13 00:05:00 eis fcron[11911]: Job '/var/install/bin/smartmon-plot'
> started for user root (pid 119
> Mar 13 00:05:10 eis fcron[11911]: Job '/var/install/bin/smartmon-plot'
> completed
> Mar 13 00:25:02 eis su: (to root) root on none
> Mar 13 00:25:02 eis su: pam_unix(su:session): session opened for user
> root by (uid=0)
> Mar 13 00:33:16 eis fcron[9517]: Job '/usr/bin/server-sichern' completed
> (mailing output)
> 
> Während ich das hier schreibe, habe ich auf dem Backup-Server wieder den
> Ordner TDAMP in TDAMP- umbenannt, danach TDAMP wieder neu angelegt und
> mein Script server-sichern neu gestartet. Nun sind wieder knapp 86000 MB
> (so groß ist der Ordner TDAMP auf dem Produktiv-Server) per rsync auf
> den Backup-Server zu holen. Inzwischen sind 57000 MB übertragen und die
> Datei QMETA.FPT (die letzte Zeile im rsync-Report, wenn es bisher
> fehlgeschlagen ist) ist dabei. Sieht aus, als wenn es jetzt
> funktionieren könnte.
> 
> Verstehe aber nicht, weshalb die Sicherung per rsync auch mit heilen
> Festplatten im Backup-Server nicht funktioniert hat.
> 
> Hat jemand eine Idee dazu?
> 
Hallo,
ich erinnere mich dunkel, daß ich vor längerer Zeit (einige Jahre!) ein
ähnliches Problem mit Abbrüchen hatte. Der Fehler bzw. der Workarround
war damals, daß ich die "Richtung" geändert hatte, wie gesichert wurde!
Situation:

Wenn A das rsync initierte um Daten per rsync nach B zu "schieben"
klappte es nicht!

Wenn B das rsync initierte um die Daten von A zu "ziehen" klappte es!

Die Ursache lag wohl daran, daß A zu schnell die Daten nach B geschoben
hat und B damit überlastet wurde. Wenn der langsame/schwächere B sich
das gemütlich gezogen hat, dann hatte ich keine Probleme.

Gruß
Matthias


Mehr Informationen über die Mailingliste Eisfair