• 0 Posts
  • 28 Comments
Joined 3 years ago
cake
Cake day: June 15th, 2023

help-circle
  • That’s genuinely great to hear, and I’m glad it worked out.

    You did the hard part here: you kept testing methodically, provided solid data, and were willing to slow down and verify assumptions instead of guessing. That’s why this ended in a clean recovery instead of a dead drive.

    For what it’s worth, I’ve hit more than a few of these bumps myself. I started out self-taught on an IBM XT back in 1987, when I was about six years old, and the learning process has never really stopped. Situations like this are just part of how you build real understanding over time.

    This is also a good example of how enterprise hardware behaves very differently from consumer gear. Nothing here was “obvious” as a beginner, and the outcome reinforces an important lesson: unusable does not mean broken. You handled it the right way.

    I’m especially glad if this thread is kept around. These kinds of issues come up regularly, and having a complete, factual troubleshooting trail will help the next person who runs into the same thing.

    Enjoy the RAIDZ2 setup, and good luck with the additional vdev. Paying this forward is exactly how these communities stay useful.

    Happy holidays, and all the best in the new year. 🥳


  • That’s good news — what you’re seeing now is the expected state.

    A quick clarification first:

    Power cycle means exactly what you did: shut the machine down completely and turn it back on. There is no command involved. You did the right thing.

    Regarding the current status:

    The drive showing up in Disks but marked as unknown is normal

    At this point the disk has:

    • No partition table

    • No filesystem

    “Unknown” here does not indicate a problem, only that nothing has been created on it yet

    About sg_readcap:

    sg_readcap -l is correct

    There is no direct “comparison” mode; running it separately on sda and sdb is exactly what was intended

    The important thing is that both drives now report sane, consistent values (logical block size, capacity, no protection enabled)

    Next steps:

    Yes, the next step is normal disk setup, just like with any new drive:

    1. Create a partition table (GPT is typical)

    2. Create one or more partitions

    3. Create a filesystem (or add it back into ZFS if that’s your goal)

    At this stage the drive has transitioned from “unusable” to functionally recovered. From here on, you’re no longer fixing a problem — you’re just provisioning storage.

    If you plan to put it back into TrueNAS/ZFS, it’s usually best to let TrueNAS handle partitioning and formatting itself rather than doing it manually on Linux.

    Nice work sticking with the process and verifying things step by step.


  • Thanks for the update, that’s helpful.

    Confirming that the controller is a Broadcom / LSI SAS2308 and that it’s the same HBA that was used in the original TrueNAS system removes one major variable. It means the drive is now being tested under the same controller path it was previously attached to.

    The device mapping you described is clear:

    sda = known-good identical drive

    sdb = the problematic drive

    Running:

    sudo sg_format --format --size=512 --fmtpinfo=0 --pfu=0 /dev/sdb

    as you did is the correct next step to normalize the drive’s format and protection settings.

    A few general notes while this is in progress:

    • Some drives report completion before all internal states are fully settled, which will cause reduced performance as the operation continues until finished in the background
    • A power cycle after completion is recommended before testing the drive again

    At this point it makes sense to pause any further investigation until the current sg_format has fully completed and the system has been power-cycled.

    Once that’s done, the next step will be a direct comparison between sdb and the known-good sda using:

    sudo sg_readcap -lla

    • Reported logical and physical sector sizes

    • Protection / PI status

    As a general note going forward: on Linux / FreeBSD it’s safer to reference disks by persistent identifiers (e.g. /dev/disk/by-id/ or UUID (this is safer but not so direct human readable) on Linux or glabel on FreeBSD) rather than /dev/sdX, as device names can change across boots or hardware reordering as you have had some experience with now.

    Post the results when you’re ready and the sg_format complete and we can continue from there.



  • One more hopefully happy update:

    Based on everything you’ve shown so far in the information you have given, the most probable cause is that the drive was formatted with T10 DIF / Protection Information enabled (PROTECT=1), and you are now accessing it through a controller path that does not support DIF.

    This is a very common failure mode with enterprise SAS drives and sg_format.

    (edit: oh, how I am in a love/hate relationship with my brain on delayed thoughts…)

    In your paste from sg_format you can see this flag:

    sudo sg_format -vv /dev/sda open /dev/sda with flags=0x802 inquiry cdb: [12 00 00 00 24 00] SEAGATE ST4000NM0023 XMGG peripheral_type: disk [0x0] PROTECT=1

    (end of edit)

    What this means in practice:

    • PROTECT=1 = the drive was formatted with DIF Type 1
    • Logical blocks are no longer plain 512/4096 bytes (e.g. 520/528 instead)
    • The HBA + driver must explicitly support T10 PI
    • If the controller does not support DIF, the drive may:
      • Be detected
      • But fail all I/O
      • Appear “dead” even though it is healthy

    This is not bricking. It is a configuration mismatch.

    How to fix it (most reliable path)

    You need to connect the drive to a DIF-capable SAS HBA (LSI/Broadcom, same type as originally used if possible).
    Best option is to do this on the original hardware, even via a USB live Linux environment.

    Once the drive is on a T10-capable controller, reformat it with protection disabled.

    Example (this will ERASE the drive and might take a LONG time to complete):

    sudo sg_format –format –size=512 –fmtpinfo=0 –pfu=0 /dev/sdX

    Key flags:

    • --fmtpinfo=0 → disables DIF / PROTECT
    • --size=512 (or 4096 if you prefer standard 4K)
    • --pfu=0 (disables PROTECTION flag, your GPT forgot to include this which actually disables the protection)
    • Use the correct /dev/sdX

    After this completes and the system is power-cycled, the drive should behave like a normal disk again on non-DIF controllers.

    Important notes

    • sg_format alone almost never permanently damages SAS drives
    • This exact scenario happens frequently when drives are moved between controllers
    • Until tested on a DIF-capable HBA, there is no evidence of permanent failure

    If you cannot access a T10-capable controller, the drive may remain unusable on that system, but still be perfectly recoverable elsewhere.

    A case of a user with another problem but where he needed to disable DIF, got it fixed after a new format with these parameters (from Google):

    https://www.truenas.com/community/threads/drives-formatted-with-type-1-protection-eventually-lead-to-data-loss.86007/


  • Thanks for the additional details, that helps, but there are still some critical gaps that prevent a proper diagnosis.

    Two important points first:

    The dmesg output needs to be complete, from boot until the moment the affected drive is first detected.
    What you posted is cut short and misses the most important part: the SCSI/SAS negotiation, protection information handling, block size reporting, and any sense errors when the kernel first sees the disk.

    Please reboot, then run as root or use sudo:

    dmesg -T > dmesg-full.txt

    1. Do not filter or truncate it. Upload the full file.

    2. All diagnostic commands must be run with sudo/root, otherwise capabilities, mode pages, and protection features may not be visible or may be incomplete.

    Specifically, please re-run and provide full output (verbatim) of the following, all with sudo or as root, on the problem drive and (if possible) on a working identical drive for comparison:

    sudo lspci -nnkvv

    sudo lsblk -o NAME,MODEL,SIZE,PHY-SeC,LOG-SeC,ROTA

    sudo fdisk -l /dev/sdX

    sudo sg_inq -vv /dev/sdX

    sudo sg_readcap -ll /dev/sdX

    sudo sg_modes -a /dev/sdX

    sudo sg_vpd -a /dev/sdX

    Replace /dev/sdX with the correct device name as it appears at that moment.

    Why this matters:

    • The Intel SATA controller you listed is not the LSI HBA. We need to see exactly which controller the drive is currently attached to and what features the kernel believes it supports.

    • That Seagate model is a 520/528-capable SAS drive with DIF/T10 PI support. If it was formatted with protection enabled and is now attached to a controller/driver path that does not expect DIF, Linux will report I/O errors even though the drive itself is fine.

    • sg_format -vv output alone does not tell us the current logical block size, protection type, or mode page state.

    Important clarification:

    • Formatting the drive under TrueNAS (with a proper SAS HBA) and then attaching it to a different system/controller is a very common way to trigger exactly this situation.

    • This is still consistent with a recoverable configuration mismatch, not a permanently damaged disk.

    Once we have:

    • Full boot-time dmesg

    • Root-level SCSI inquiry, mode pages, and read capacity

    • Confirmation of which controller is actually in use

    …it becomes possible to say concretely whether the drive needs:

    • Reformatting to 512/4096 with protection disabled

    • A controller that supports DIF

    • Or if there is actual media or firmware failure (less likely)

    At this point, the drive is “unusable”, not proven “bricked”. The missing data is the deciding factor.

    One more important thing to verify, given the change of machines:

    Please confirm whether the controller in the original TrueNAS system is the same type of LSI/Broadcom SAS HBA as the one in the current troubleshooting system.

    This matters because:

    DIF/T10 PI is handled by the HBA and driver, not just the drive.

    A drive formatted with protection information on one controller may appear broken when moved to a different controller that does not support (or is not configured for) DIF.

    Many onboard SATA/RAID controllers and some HBAs will enumerate a DIF-formatted drive but fail all I/O.

    If the original TrueNAS machine used:

    • A proper SAS HBA with DIF support

    then the best recovery path may be to put the drive back into that original system and either:

    • Reformat it there with protection disabled, or

    • Access it normally if the controller and OS were already DIF-aware

    If the original controller was different:

    • Please provide lspci -nnkvv output from that system as well (using sudo or run as root)

    • And confirm the exact HBA model and firmware used in the TrueNAS SAS controller

    At the moment, the controller change introduces an unknown that can fully explain the symptoms by itself. Verifying controller parity between systems is necessary before assuming the drive itself is at fault.

    (edit:)

    One last thing, how long did you let sg_format run for?

    It can take hours to complete one percent if the drive is large, probably a full day or more considering the capacity of your drive.

    I was just wondering if it might have been cut short for some reason and just needs to be restarted on the original hardware to complete the process and bring the drive back online.


  • Right now there isn’t enough information to conclude that the drive is “bricked”.

    sg_format on a SAS drive with DIF enabled can absolutely make the disk temporarily unusable to the OS if the format parameters no longer match what the HBA/driver expects, but that is very different from a dead drive.

    To make any determination, more data is required. At minimum (boot with a live Linux USB drive if you are unable to get to this information):

    Please provide verbatim output from:

    • dmesg -T (from boot and when the drive is detected)
    • sblk -o NAME,MODEL,SIZE,PHY-SeC,LOG-SeC
    • fdisk -l /dev/sdX
    • sg_inq /dev/sdX
    • sg_readcap -l /dev/sdX
    • sg_modes -a /dev/sdX

    Also specify:

    • Exact drive model
    • HBA model and firmware
    • Kernel version / distro
    • Whether the controller supports DIF/DIX (T10 PI)
    • Whether other identical drives still work in the same slot/cable

    Common possibilities (none can be confirmed without logs):

    • Drive formatted with DIF enabled but HBA/OS not configured for it
    • Logical/physical block size mismatch (e.g. 520/528 vs 512/4096)
    • Format still in progress or left the drive in a non-ready state
    • Mode pages changed that Linux does not like by default

    Things that are usually recoverable on SAS drives:

    • Re-formatting with correct sector size and DIF disabled
    • Clearing protection information
    • Power-cycling the drive after format completion
    • Formatting from a controller that fully supports the drive’s feature set

    Actual permanent bricking from sg_format alone is rare unless firmware flashing or vendor-specific commands were involved.

    Until logs are posted, all anyone can honestly say is:

    The drive is not currently usable, but there is no evidence yet that it is permanently damaged.

    If you can share this information it might be possible to get the drive back online, though I make no promises.

    (edit typos)





  • If you want a second attempt, this might help.

    To get USB devices working inside a container, you need to map the device into the container, which can be tricky—especially if you’re running rootless containers.

    If you’re on Linux and want to avoid complicated setups with user namespaces, groups, or messing with udev rules, the easiest way to start is by manually recreating the device node inside a folder you control (like where your config is stored) using mknod.

    For example, if your USB device is /dev/ttyUSB0:

    1. Run ls -l /dev/ttyUSB0 You should see output like: crw-rw---- 1 root dialout 188, 0 Jan 1 1970 /dev/ttyUSB0

    2. Note the major (188) and minor (0) numbers.

    3. Change directory to the folder where you want to create the “clone” device node, then run: sudo mknod -m 666 ttyUSB0 c 188 0 (Use the major/minor numbers from your device — they differ by device.) This will create a device readable and writeable by anyone on the system so perhaps consider changing the mode from 666 to 660 and/or chown the file afterwards to your user and group. As I said, this is HACKY and not a secure solution.

    You will now have a device file you can then pass into your container with the Docker/PODMAN option: –device /path/to/your/folder/ttyUSB0:/dev/ttyUSB0

    I realize this is a pretty hacky and insecure workaround—feel free to downvote or ignore if you want something cleaner. But it’s a quick way to get your USB device accessible inside the container to get started. Later on, you can look into proper handling with udev or other methods if security is important.

    If you use Windows, you are on your own unfortunately, I do not have experience with podman/docker in Windows environments.





  • I know this doesn’t directly solve your issue, and it might not help much now, but I wanted to share my experience just in case it’s useful.

    When I had a similar problem after switching phones, what ended up helping was that I had 2FA enabled beforehand. In that case, after selecting the option to recover my account suddenly allowed me to receive a verification code via SMS—something that didn’t appear on the usual login screen, it was greyed out before selecting this option.

    It probably won’t work if 2FA is disabled, but maybe it’s still worth checking if any recovery options that shows up helps. There might be a choice there that helps you resolve your problem as well.

    In any case, good luck—I hope you’re able to get it sorted soon!



  • sorry, since you asked a question I just felt the need to clarify 🙂

    The ISP products you mentioned really don’t seem consumer-friendly. I understand that ISPs might benefit from setting byte limits, since they incur costs for both inbound and outbound traffic to transit providers. However, from a consumer perspective, it’s a poor deal—especially since most people don’t have the tools to manage their usage effectively and can burn through their quota far too quickly, just like you pointed out.

    It all comes down to costs and earnings in the end for all products unfortunately.


  • I do not work there, just referenced the terms of conditions from their website, so you need to ask them the questions, but I think having a 1Gb connection with 30TB of seeding will eat up that pretty fast either way and also cause a mayhem of incoming connections, so it can hardly be considered private use (based on their definition)

    Again, I have no reference to the company, so all questions should be forwarded to them not me. I simply gave a possible reasoning of the ban from their terms.

    edit: added info about their definition of private use