Skip to main content

54 posts tagged with "Maintenance"

View All Tags

· 3 min read

Publication date: June 24, 2024

Contents

  1. Notice of software update2. Notice of operating system change to Ubuntu Linux 22.04 for Medium node3. Notice of replacement to the next generation of the NIG supercomputer4. !!NOTICE!! Data of users who have not renewed the account by the end of June will be deleted from 1 July.

1.Notice of software update

Software updates are scheduled for the following dates and times.

Period

Monday 1 July 2024, 10:00 - Wednesday 3 July 2024, 21:00

Details of the software upgrade

Table: Software upgrade plan for development/analysis

#SoftwareBefore version upAfter version up
(1)Apptainer1.2.41.3.2
(2)SingularityCE4.0.04.1.3
(3)NVIDIA HPC SDK(former PGI compiler)23.724.3
(4)Intel OneAPI2023.2.02024.1.0
(5)AMD C compiler(AOCC)-4.2
(6)NVIDIA CUDA12.112.3
(7)Parabricks4.14.3.1

A Server restart is scheduled to occur for the GPU node. Users who occupy GPU nodes for billing services will be contacted separately with the date and time.

2. Notice of operating system change to Ubuntu Linux 22.04 for Medium node

The Medium node is currently running on CentOS 7.9, but as Cent OS 7 will be out of support (EOL) at the end of June.We will gradually change to Ubuntu Linux 22.04.

We will create a new medium-ubuntu.q queue in GridEngine and migrate the Medium nodes that have been changed to Ubuntu Linux 22.04 to this queue sequentially.When you submit a job to the medium-ubuntu.q queue, the job will be run in the Medium node on Ubuntu Linux 22.04, and when you submit a job to the Medium.q queue, the job will be run in the Medium node on Cent OS 7.9. One or two Cent OS 7.9 Medium nodes will also remain.

3. Announcement of replacement to the next NIG supercomputer system

At the end of FY2024, it will be replaced by the next NIG supercomputer system. The current NIG supercomputer is contracted until 28 February 2025. The next NIG supercomputer is scheduled to start operation on 1 March 2025. Further details will be announced around October 2024 after the opening of the tender.

4. **NOTE: ** Data of users who have not applied for a renewal of the account by the end of June will be deleted from 1 July onwards.

4.1. If you would like to continue using your account.

If you did not applyfor a renewal of your account of the end of the fiscal year between 4 Jan and 31 Mar, your account will be suspended from 1 Apr and you will not be able to log in to this account system.

If you wish to continue using your account, send us an email to lift the suspension of your account. After the suspension is lifted, click on the link below to complete the end-of-year renewal and performance report (progress report).

4.2. If you would like to discontinue using your account

Click on the link below to stop using your account.

NOTES.

  • 4.1. After account lock, the home directory will be deleted sequentially from 1 July.
  • 4.2. If you apply for account suspension, your account will be deactivated sequentially.

Thank you for your understanding and cooperation.

If you have any questions, contact us.

· One min read

Publication date: June 21, 2024

The restoration work was completed at 9:00am on Tuesday, 25 June 2024.

Also, the gateways (gw.ddbj.nig.ac.jp, gw2.ddbj.nig.ac.jp) have been restored and are available for logging in.

As of 12:00 on Friday, June 21, the high-speed storage system Lustre7 in the General Analysis Division is still experiencing issues. These issues appear to be related to the power outage on May 28, and similar issues occurred on June 5, June 17, and June 20.

To prevent similar issues from recurring, we will be conducting emergency maintenance to repair Lustre7 by halting the General Analysis Division for several days.

Schedule:

June 21, 2024 (Friday) - June 27, 2024 (Thursday) During this period, all computing nodes in the General Analysis Division will be halted.

Scope of Impact:

  • All computing nodes in the General Analysis Division will be halted. All currently running jobs will be stopped. Please restart your jobs from the beginning after the maintenance is completed.
  • There will be no impact on the Personal Genome Analysis Division.
  • There will be no impact on DDBJ services.

We apologize for any inconvenience caused and appreciate your understanding.

· 2 min read

Publication date: June 19, 2024

The restoration work was completed at 9:00am on Tuesday, 25 June 2024.

Also, the gateways (gw.ddbj.nig.ac.jp, gw2.ddbj.nig.ac.jp) have been restored and are available for logging in.

  • At 18:21:14 on Mon 17 Jun, a fault occurred on the Lustre7 high-speed storage system in the General Analysis division, resulting in a partial write failure. Specifically, one of the 88 RAID groups Lustre OST (OST0029) was not writable.
  • The recovery work started at around 14:00 on Tue 18 Jun and finished at around 20:00.
  • However, at 20:00, it was confirmed that some compute nodes were not able to access OST0029 (neither read nor write). Specifically, the following computation nodes.
    • at017,at025,at026,at028,at029,at030,at031,at032,at033,at034,at035,at036,at037,at043,at044,at045,at046,at047,at048,at050,at051,at052,at053,at054,at055,at057,at058,at059,at060,at061,at062,at063,at064,at073,at074,at083,at084,at085,at087,at090,at095,at096,at097,at098,at099,at100,at101,at102,at103,at126,at127,at128,at129,at130,at131,at132,at133,at134,at135,at136 (60 of 136 Thin compute nodes Type 1a, AMD EPYC 7501 CPU)
    • at139,at140,at141,at142,at143,at144,at145,at146,at147,at148,at149,at150,at151,at152,at153,at154,at155,at156,at157,at159,at160,at161,at162,at163,at164 (25 of 28 Thin compute nodes Type 1b, AMD ROMA CPU)
    • it001,it002,it004,it006,it007,it008,it009,it010,it013,it014,it015,it017,it024,it025,it026,it027,it028,it029,it031,it032,it034,it035,t036,it040,it041,it048,it049,it050,it051,it052 (30 of 52 Thin compute nodes Type 2a, Intel CPU)
    • igt001,igt003,igt005,igt006,igt007,igt008,igt011,igt012,igt013,igt014 (10 of 16 Thin compute nodes Type 2b, Intel CPU)
    • gw.ddbj.nig.ac.jp, gw2.ddbj.nig.ac.jp (all of 2 gateways for the general analysis division)
    • m01,m02,m03,m04 (4 of 10 medium nodes)
    • dtn4 (one of 4 data transfer nodes used for DDBJ services)

Scope of impact

  • From around 18:20 on 17 June to 14:00 on 18 June, any writing to OST0029 from all compute nodes is not possible. From around 14:00 to 20:00 on 18 June, any writing and reading to OST0029 from all compute nodes is not possible. The compute nodes listed above were also not read/write accessible as of 19 June. Please check your calculation results for any anomalies. (Jobs that do not use OST0029 are not affected, but whether or not using OST0029 is randomly determined.)
  • The personal genome analysis division will not be affected.
  • Communication breakdowns will occur for DDBJ services that use data transfer nodes dtn2 and dtn4.

· One min read

Publication date: June 18, 2024

The restoration work was completed at 9:00am on Tuesday, 25 June 2024.

Also, the gateways (gw.ddbj.nig.ac.jp, gw2.ddbj.nig.ac.jp) have been restored and are available for logging in.

At 18:21:14 on Mon 17 Jun, OST0029, one of the 88 RAID groups "Lustre OST" of the Lustre7 high-speed storage system in the General Analysis division, has failed and is currently partially un-writable.

The system is currently being restored.

It is the same as on 6 June, possibly due to the power outage on 28 May. We apologise for any inconvenience caused.

Scope of impact

  • In the General Analysis division, an area (OST0029 on Lustre7) has been partially un-writable since 17:46 on Monday 17 June. During the restoration of this area, read/write access to this area OST0029 will be unavailable.
  • The personal genome analysis division will not be affected.
  • The impact on DDBJ services is under investigation.

· One min read

Publication date: April 19, 2024

Due to maintenance work on SINET6 equipment, the network will be temporarily out of service during the following times.

Date and time

0:00 - 1:30, Monday, June 10, 2024

  • Communication breakdowns will occur a maximum of two times during the above time period for 15 minutes.

Scope of impact

  • During the disconnection, you will not be able to log in to the supercomputer or transfer data.
  • There will be no suspension of active jobs.

Thank you for your understanding and cooperation.

· 2 min read

Publication date: June 6, 2024

The restoration work was completed at around 12:00 (24 hours notation) on Thursday, 6 June 2024.

  • At 1:34:21 am on Wed 5 Jun, a fault occurred on the Lustre7 high-speed storage system in the General Analysis division, resulting in a partial write failure. Specifically, one of the 88 RAID groups Lustre OST (OST0031) was not writable.
  • The recovery work started at around 15:30 and finished at around 20:00.
  • However, at 20:00, it was confirmed that some compute nodes were not able to access OST0031 (neither read nor write). Specifically, the following computation nodes.
    • at017,at025,at054,at049,at051,at052,at047,at045,at050,at053,at085,at099,at102,at101,at132, (15 of 136 Thin compute nodes Type 1a, AMD EPYC 7501 CPU)
    • at140,at141,at149,at155, (4 of 28 Thin compute nodes Type 1b, AMD ROMA CPU)
    • it001,it040,igt003,it050,it049, (5 of 52 Thin compute nodes Type 2a, Intel CPU)
    • gw1,gw4, (2 gateways for the general analysis division)
    • m01 (one of 10 medium nodes)
    • dtn2,dtn4 (data transfer nodes used for DDBJ services)

Scope of impact

  • From around 1:30 to 20:00 on 5 June, any writing to OST0031 from all compute nodes is not possible, nor is any reading between 15:30 and 20:00. The compute nodes listed above were also not read/write accessible as of 6 June. Please check your calculation results for any anomalies. (Jobs that do not use OST0031 are not affected, but whether or not using OST0031 is randomly determined.)
  • As you cannot log in to the SSL-VPN, you may not be able to log in to the personal genome analysis division either.
  • Communication breakdowns will occur for DDBJ services that use data transfer nodes dtn2 and dtn4.

· One min read

Publication date: June 5, 2024

The restoration work was completed at around 12:00 (24 hours notation) on Thursday, 6 June 2024.

The Lustre7 high-speed storage system in the General Analysis division has experienced a failure and is currently partially un-writable.

The system is currently being restored. It is expected to take approximately two hours to recover.

Scope of impact

  • In the General Analysis division, an area (OST0031 on Lustre7) has been partially un-writable since 1:34am on Wednesday 5 June. During the restoration of this area, read/write access to this area OST0031 will be unavailable for about 2 hours (from 15:30 to 17:30).
  • The personal genome analysis division will not be affected.
  • DDBJ services and other services are not affected.

· One min read

publication date: June 28, 2024

Summary

On Tuesday, June 30, 2024, at around 21:30 and 23:15, brief power outage occurred in the Yata area of Mishima City, Shizuoka Prefecture, affecting the network and other facilities.

https://teideninfo.tepco.co.jp/day/teiden/index-j.html

Now Recovered.

Scope of impact

  • External network, etc.
    • The connection to the external network was interrupted during the following time period.
      • May 28, 2024 for approximately 5 minutes at around 21:30 (recovered)
      • May 28, 2024 for approximately 10 seconds at around 23:15 (recovered)
  • General analysis division
    • Not affected
  • Personal genome analysis division
    • Not affected
  • DDBJ Service
    • Not affected

· One min read

Publication date: March 8, 2023

Due to maintenance work on SINET6 device, the network will be temporarily out of service during the following times.

Date and time

00:00 - 01:00, Saturday, March 9, 2024

  • Communication breakdowns will occur within 5 minutes during the above time periods.

Scope of impact

  • During the disconnection, you will not be able to log in to the supercomputer or transfer data.
  • There will be no suspension of active jobs.

Thank you for your understanding and cooperation.

· One min read

Publication date: February 27, 2024

On Tuesday 27 February 2024, an issue occurred that the email containing the Token code for the SSL-VPN connection was not sent.

Date of occurrence: 10:00 - 15:45, Tuesday, 27 February 2024

Currently, the situation has been restored and the email is being sent.

We apologise for any inconvenience caused to users of the Personal Genome Analysis division.