Files
SC-F001/TODO.md
2026-03-30 11:39:04 -05:00

11 KiB
Raw Blame History

SC-F001 Firmware — TODO

    • [clauded] sdkconfig audit
    • [clauded] Enable CONFIG_ESP_TASK_WDT_PANIC=y — added to sdkconfig.defaults and sdkconfig
    • [clauded] Verify CONFIG_FREERTOS_CHECK_STACKOVERFLOW=2 — confirmed canary method active
    • [clauded] Verify CONFIG_ESP_SYSTEM_PANIC_PRINT_REBOOT — confirmed active
    • [clauded] Confirm brownout detector level — ~2.43V is correct (ESP32 rail protection; battery low-V handled by FSM's LOW_PROTECTION_V)
    • [clauded] Research sdkconfig management best practices — documented in CLAUDE.md "sdkconfig Management" section
    • [clauded] Fix managed_components: removed unused littlefs and tca95x5 deps, pinned mdns to ~1.9.1, bumped IDF min to >=5.0; documented in CLAUDE.md
    • [clauded] OTA rollback via consecutive-reset counter
    • [clauded] Add RTC_DATA_ATTR uint8_t ota_reset_counter — incremented on panic/WDT resets, cleared on power-on/ext reset
    • [clauded] On counter ≥ 5, call esp_ota_mark_app_invalid_rollback_and_reboot()
    • [clauded] After POST passes and FSM starts, call esp_ota_mark_app_valid_cancel_rollback() and clear counter
    • [clauded] Health check = POST passes + all critical inits + FSM task started + non-critical inits attempted
    • [clauded] Critical init failures (ADC, storage, log, I2C, FSM, UART) → init_critical() retries 3×, then esp_restart()
    • [clauded] Non-critical init failures (RF, BT, webserver) → log error, continue booting
    • [clauded] WiFi/BT/RF retry once on init failure at boot (200ms delay for RF/BT, 500ms for WiFi), then log and continue
    • [clauded] Power-on self-test (POST) — init_critical() wrapper + dedicated POST checks after init
    • [clauded] ADC: adc_post() reads all 4 channels twice with 5ms delay, warns if frozen
    • [clauded] I2C: i2c_post() verifies TCA9555 responds (read port 0)
    • [clauded] Flash: storage_post() write-read-verify on last sector of storage partition
    • [clauded] Parameter validation
    • [clauded] Add per-param bounds to PARAM_LIST macro (min, max) — extended PARAM_DEF 6-arg macro
    • [clauded] NaN/Inf → reset to default; out-of-range → clamp to min/max — validate_param() in storage.c
    • [clauded] Enforce validation in storage_init() (after flash load) and commit_params() (before flash write)
    • [clauded] Audit set_param_value_t calls outside comms.c — deleted dead code: rf_433_set_keycode(), FSM_CMD_CALIBRATE_*_FINISH handlers + FSM cases + fsm_set_cal_val() (web JS does cal math client-side, commits via standard param POST)
    • [clauded] Audit abandoned parameters — JACK_IS_DOWN marked deprecated (may duplicate JACK_I_DOWN); BOOT_TIME is informational-only
    • [clauded] Factory reset: erases params + log + post_test partitions, requires 10s button hold on cold boot, LEDs flash during hold → solid when triggered
    • [clauded] Ensure RTC_DATA_ATTR variables survive panics/WDT resets
    • [clauded] Verified sync_unix_us, sync_rtc_us, rtc_set — no init path zeroes them; rtc_restore_time() recovers via RTC HW counter
    • [clauded] Verified remaining_distance, fsm_errorfsm_init() does not touch them; only cleared by explicit user action
    • [clauded] Verified log_head_offset, log_tail_offsetlog_init() always recovers from flash scan; RTC_DATA_ATTR is historical/harmless
    • [clauded] Measure flash log write duration — test_log_write_timing() in log_test.c, runs 200 iterations of 39-byte writes, reports min/max/avg/sector-crossing times, compares to 5s WDT
    • [clauded] WiFi STA mode with event-group signaling
    • [clauded] STA-first with softAP fallback was already implemented in start_wifi()
    • [clauded] Added EventGroupHandle_t comms_event_group in comms_events.h with WIFI_READY_BIT / BT_READY_BIT
    • [clauded] Replaced blind 500ms vTaskDelay on alarm wake with xEventGroupWaitBits(COMMS_ALL_BITS, 5s timeout)
    • [clauded] soft_idle_exit()webserver_restart_wifi() / bt_hid_resume() set bits; webserver_stop() / bt_hid_stop() clear bits
    • [clauded] Bits set even on permanent init failure so alarm-wake never blocks forever
    • [clauded] Verify sensors_init() placement and ISR safety
    • [clauded] Moved sensors_init() to main.c as init_critical("SENSORS", sensors_init) — runs before FSM
    • [clauded] Removed dead commented-out sensors_init() / sensors_stop() from sensors.c
    • [clauded] Audited ISR: sensor_isr_handler is IRAM_ATTR, uses only esp_timer_get_time() (IRAM-safe), gpio_get_level(), xQueueSendFromISR() — no logging/malloc/flash
    • [clauded] sensors_init() failure is now critical (→ reboot via init_critical)
    • [clauded] External 32kHz crystal not needed (deep sleep disabled, soft idle instead) — removed crystal config from sdkconfig.defaults; rtc_xtal_init() already a no-op; crystal remains on PCB but unused
    • [clauded] Removed rtc_wakeup_cause() — was unused (informational only, never called)
    • [clauded] Confirmed rtc_check_shutdown_timer() uses unsigned TickType_t subtraction — wraps correctly; removed esp_timer overflow TODO comment from main.c
    • [test] Logtool GUI output (matplotlib)
    • [test] Verify naming convention adherence across codebase
    • [test] Verify WiFi SSID rename triggers comms reboot
    • [clauded] Documentation restructure
    • [clauded] Move project/hardware documentation from CLAUDE.md → README.md; keep CLAUDE.md for AI-specific instructions and conventions only
    • [clauded] Document all FreeRTOS tasks and priorities in README.md
    • [clauded] Add terse comments to FSM state transitions in control_fsm.c (focus on "why", not "what")
    • [clauded] Fix compile warnings — unused vars (uart_comms.c, rf_433.c), const-correctness (log_write signatures), fallthrough annotation (control_fsm.c)
    • [clauded] NVS is required: WiFi blob stores RF cal data (CONFIG_ESP_WIFI_NVS_ENABLED), Bluedroid stores bonding/GATT cache unconditionally, bt_hid.c stores last-connected BDA. Cannot remove nvs_flash_init().
    • [clauded] NVS vs custom params: NVS serves WiFi/BT internals + BDA storage; custom flash partition serves app params with CRC32 protection. Different purposes, no consolidation needed.
    • [clauded] BUG FIX: FSM_CMD_START fallthrough was overwriting this_move_dist = MIN(...) with unconditional DRIVE_DIST — replaced fallthrough with goto to shared start logic so leash limit is preserved
    • [clauded] General bug scan (FSM, power, sensors, storage, comms, RTC, peripherals)
    • Ran 4 parallel deep-dive reviews across entire codebase. Findings below.
    • False positives eliminated: override fallthrough (breaks present), soft idle during motor ops (FSM resets timer), JACK_DOWN_TIME uninitialized first move (jack_finish_us always set before use)

Suspected Bugs (from item 24 scan)

    • BUG [CRITICAL]: get_is_safe() hardcoded return true — safety sensor completely bypassed
    • sensors.c:182 — return true; with //return is_safe; commented out below
    • All FSM safety checks (STATE_JACK_UP_START, JACK_UP, DRIVE_START_DELAY, DRIVE, DRIVE_END_DELAY, calibration states) are no-ops
    • Safety break will NOT trigger STATE_UNDO_JACK_START — machine runs through hazard conditions
    • Debounce logic in sensors_check() still runs but output is discarded
    • BUG [CRITICAL]: E-fuse INOM params allow min=0.0 → division by zero
    • power_mgmt.c:380 — float I_norm = fabsf(channel->current / I_nominal);
    • storage.h EFUSE_INOM_1/2/3 bounds: min=0.0, max=200.0
    • If param=0 → I_norm=Inf → instant trip on any current (motor won't run)
    • If param=NaN (flash corruption) → I_norm=NaN → all comparisons false → e-fuse NEVER trips (motor can burn)
    • Fix: raise min bound to 0.1 or add explicit NaN/zero guard before division
    • BUG [HIGH]: No timeout on STATE_UNDO_JACK_START
    • control_fsm.c:486-493 — waits for !efuse_get(BRIDGE_JACK) with no max wait
    • If jack efuse never cools (hardware fault, thermal runaway), FSM stuck indefinitely
    • User CAN send FSM_CMD_STOP to escape, but no automatic recovery
    • Fix: add timeout (e.g. 30-60s) before forcing transition to IDLE with error
    • BUG [HIGH]: No e-fuse checks in calibration movement states
    • control_fsm.c:495-512 — STATE_CALIBRATE_JACK_MOVE and STATE_CALIBRATE_DRIVE_MOVE
    • Only check get_is_safe() and timer_done(), NOT efuse_get()
    • Relay outputs (lines 625-640) drive motors regardless of efuse status
    • Jack cal runs up to 3s, drive cal up to 6s without overcurrent protection
    • Fix: add efuse_get() check and abort calibration on trip
    • BUG [HIGH]: BLE HID scan task missing watchdog registration
    • bt_hid.c — bt_hid_scan_task() never calls esp_task_wdt_add(NULL)
    • Task blocks on xSemaphoreTake(s_scan_sem, portMAX_DELAY) — if GAP callback never signals, hangs forever
    • Unlike rf_433 task (which registers WDT), BT task has no WDT coverage
    • Fix: add esp_task_wdt_add(NULL) and periodic esp_task_wdt_reset() (or use timeout on semaphore)
    • BUG [HIGH]: ISR sensor queue full → events silently dropped
    • sensors.c:57 — queue size 16, xQueueSendFromISR() return value not checked
    • If sensors_check() consumer falls behind (4 sensors firing edges), events lost
    • Encoder counts become inaccurate → drive distance wrong
    • Fix: check return value, optionally increment a dropped-event counter for diagnostics
    • BUG [HIGH]: Params not validated on set, only on commit — FSM reads unvalidated values
    • storage.c:268-273 — set_param_value_t() writes directly to parameter_table[] with no bounds check
    • validate_param() only called in commit_params() (before flash write)
    • Between POST and commit, FSM can read out-of-range values (e.g. DRIVE_DIST=999999)
    • Fix: call validate_param() inside set_param_value_t(), or at least in comms.c after setting
    • BUG [MEDIUM]: Solar FSM timer uninitialized
    • solar.c:17 — RTC_DATA_ATTR int64_t timer; has no initializer
    • RTC memory may contain garbage on first cold boot before solar_reset_fsm() sets it to -1
    • solar_run_fsm() is called (main.c:253) before solar_reset_fsm() has run on first boot path
    • Fix: initialize to -1 in declaration: RTC_DATA_ATTR int64_t timer = -1;
    • BUG [MEDIUM]: E-fuse param bounds too loose
    • EFUSE_HEAT_THRESH min=0.0 — allows instant trip on any current draw (storage.h)
    • EFUSE_INRUSH_US max=10000000 (10s) — allows 10s of unlimited current with no e-fuse protection
    • Fix: tighten bounds (e.g. HEAT_THRESH min=1.0, INRUSH_US max=2000000)
    • BUG [MEDIUM]: No mutex on parameter_table[] — concurrent access from HTTP/UART/FSM tasks
    • storage.c — parameter_table[] read/written by HTTP POST handlers, UART handlers, and FSM task
    • 32-bit aligned reads/writes are atomic on ESP32, so u16/u32/i16/i32/f32 are safe
    • f64 (8 bytes) and str16 (16 bytes) could be torn reads — but no f64 or str params are read by FSM in hot path
    • Severity is low in practice but architecturally unsound
    • Extract pure logic (e-fuse thermal model, param serialization, sensor debounce) into host-testable modules with Unity/CMock?
    • UART integration test framework: Python runner + ESP-side test commands
    • Bug: WiFi won't want to connect to STA except at first boot