{
"query": "You are a super intelligent assistant. Please answer all my questions precisely and comprehensively.\n\nThrough our system KIOS you have a Knowledge Base named upload chatbot status with all the informations that the user requests. In this knowledge base are following Documents crawler-issues-19MAR2025.txt, crawler-issues-19MAR2025(1).txt, crawler-issues-19MAR2025(2).txt, apacare-primer.txt, apacare-primer(1).txt, dupes.txt, apacare-primer(2).txt, chatbot-error.txt, link.txt, gpt-vector-dimension-error.txt, gemini-quota-error.txt, crawler-issues-19MAR2025 - Copy.txt, crawler-issues-19MAR2025 - Copy (2).txt, crawler-issues-19MAR2025 - Copy (2) - Copy.txt, crawler-issues-19MAR2025 - Copy (3).txt, crawler-issues-19MAR2025 - Copy (4).txt, crawler-issues-19MAR2025 - Copy (5).txt, crawler-issues-19MAR2025 - Copy (6).txt, crawler-issues-19MAR2025 - Copy (7).txt, crawler-issues-19MAR2025 - Copy (2) - Copy(1).txt, crawler-issues-19MAR2025 - Copy (8).txt, crawler-issues-19MAR2025 - Copy (9).txt, crawler-issues-19MAR2025 - Copy (10).txt, crawler-issues-19MAR2025 - Copy (11).txt, crawler-issues-19MAR2025 - Copy (12).txt, crawler-issues-19MAR2025 - Copy (13).txt, crawler-issues-19MAR2025 - Copy (14).txt, crawler-issues-19MAR2025 - Copy (15).txt, crawler-issues-19MAR2025 - Copy (16).txt, crawler-issues-19MAR2025 - Copy (17).txt, crawler-issues-19MAR2025 - Copy (18).txt, crawler-issues-19MAR2025 - Copy (19).txt, crawler-issues-19MAR2025 - Copy(1).txt, crawler-issues-19MAR2025 - Copy (22).txt, crawler-issues-19MAR2025 - Copy(2).txt, crawler-issues-19MAR2025 - Copy (23).txt, crawler-issues-19MAR2025 - Copy (24).txt, crawler-issues-19MAR2025 - Copy (25).txt, crawler-issues-19MAR2025 - Copy (26).txt, crawler-issues-19MAR2025 - Copy(3).txt, crawler-issues-19MAR2025 - Copy (3) - Copy.txt, crawler-issues-19MAR2025 - Copy (4) - Copy.txt, crawler-issues-19MAR2025 - Copy (5) - Copy.txt, crawler-issues-19MAR2025 - Copy (2) - Copy(2).txt, crawler-issues-19MAR2025 - Copy (3) - Copy(1).txt, crawler-issues-19MAR2025 - Copy (4) - Copy(1).txt, crawler-issues-19MAR2025 - Copy (2) - Copy - Copy.txt, crawler-issues-19MAR2025 - Copy (4) - Copy(2).txt, crawler-issues-19MAR2025 - Copy (3) - Copy(2).txt, crawler-issues-19MAR2025 - Copy (20).txt, crawler-issues-19MAR2025 - Copy (21).txt, crawler-issues-19MAR2025 - Copy (3) - Copy(3).txt, crawler-issues-19MAR2025 - Copy (3) - Copy(4).txt, crawler-issues-19MAR2025 - Copy (2) - Copy - Copy - Copy.txt, crawler-issues-19MAR2025 - Copy (2) - Copy - Copy - Copy - Copy.txt, crawler-issues-19MAR2025 - Copy (2) - Copy - Copy - Copy - Copy - Copy.txt, crawler-issues-19MAR2025 - Copy (2) - Copy - Copy - Copy - Copy - Copy(1).txt\n\nThis is the initial message to start the chat. Based on the following summary/context you should formulate an initial message greeting the user with the following user name [Gender] [Vorname] [Surname] tell them that you are the AI Chatbot Simon using the Large Language Model [Used Model] to answer all questions.\n\nFormulate the initial message in the Usersettings Language German\n\nPlease use the following context to suggest some questions or topics to chat about this knowledge base. List at least 3-10 possible topics or suggestions up and use emojis. The chat should be professional and in business terms. At the end ask an open question what the user would like to check on the list. Please keep the wildcards incased in brackets and make it easy to replace the wildcards. \n\n Hier ist eine Zusammenfassung des gesamten Kontexts, einschlie\u00dflich einer Zusammenfassung f\u00fcr jede Datei:\n\n**crawler-issues-19MAR2025 - Copy (4) - Copy(1).txt:** Diese Datei listet mehrere Probleme mit dem Crawler auf. Statusaktualisierungen funktionieren nicht korrekt, wenn verschiedene Crawler-Jobs fehlschlagen. Die Abschlusslogik ist in mehreren Jobs dupliziert, was zu Inkonsistenzen f\u00fchren kann. S3-Dateioperationen haben minimale Fehlerbehandlung. Es wird vorgeschlagen, die `knowledgebase_crawler_imports`-Tabelle anstelle des Caches zu verwenden und Z\u00e4hlungen in regelm\u00e4\u00dfigen Abst\u00e4nden statt in Echtzeit zu aktualisieren.\n\n**crawler-issues-19MAR2025 - Copy (3) - Copy.txt:** Identisch mit der vorherigen Datei.\n\n**crawler-issues-19MAR2025 - Copy (4) - Copy.txt:** Identisch mit der vorherigen Datei.\n\n**crawler-issues-19MAR2025 - Copy (5) - Copy.txt:** Identisch mit der vorherigen Datei.\n\n**crawler-issues-19MAR2025 - Copy (2) - Copy(2).txt:** Identisch mit der vorherigen Datei.\n\n**crawler-issues-19MAR2025 - Copy (2) - Copy - Copy.txt:** Identisch mit der vorherigen Datei.\n\n**crawler-issues-19MAR2025 - Copy (3) - Copy(1).txt:** Identisch mit der vorherigen Datei.\n\n**crawler-issues-19MAR2025 - Copy (2) - Copy - Copy - Copy - Copy.txt:** Identisch mit der vorherigen Datei.\n\n**crawler-issues-19MAR2025 - Copy (2) - Copy - Copy - Copy.txt:** Identisch mit der vorherigen Datei.\n\n**crawler-issues-19MAR2025 - Copy(3).txt:** Identisch mit der vorherigen Datei.\n\n**crawler-issues-19MAR2025 - Copy (21).txt:** Identisch mit der vorherigen Datei.\n\n**crawler-issues-19MAR2025 - Copy (20).txt:** Identisch mit der vorherigen Datei.\n\n**crawler-issues-19MAR2025 - Copy (3) - Copy(2).txt:** Identisch mit der vorherigen Datei.\n\n**crawler-issues-19MAR2025 - Copy (3) - Copy(4).txt:** Identisch mit der vorherigen Datei.\n\n**crawler-issues-19MAR2025 - Copy (4) - Copy(2).txt:** Identisch mit der vorherigen Datei.\n\n**crawler-issues-19MAR2025 - Copy.txt:** Identisch mit der vorherigen Datei.\n\n**apacare-primer(1).txt:** Diese Datei enth\u00e4lt Anweisungen f\u00fcr einen digitalen Vertriebsmitarbeiter von ApaCare. Der Mitarbeiter soll Kunden bei zahn\u00e4rztlichen Fragen in deutscher Sprache unterst\u00fctzen, Produkte von ApaCare empfehlen und relevante Videos einbetten.\n\n**apacare-primer(1).txt:** Identisch mit der vorherigen Datei.\n\n**apacare-primer.txt:** Identisch mit der vorherigen Datei.\n\n**apacare-primer.txt:** Identisch mit der vorherigen Datei.\n\n**crawler-issues-19MAR2025(2).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025.txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025(1).txt:** Identisch mit der ersten Datei.\n\n**dupes.txt:** Diese Datei enth\u00e4lt ein JSON-Objekt mit Daten, die scheinbar Duplikate von Webseiten-URLs darstellen. Die Daten enthalten IDs, `knowledgebase_crawler_id`, `page_id`, Zeitstempel, UUIDs, URLs und Dateipfade.\n\n**apacare-primer(2).txt:** Identisch mit apacare-primer(1).txt\n\n**apacare-primer(2).txt:** Identisch mit apacare-primer(1).txt\n\n**apacare-primer(2).txt:** Identisch mit apacare-primer(1).txt\n\n**apacare-primer(2).txt:** Identisch mit apacare-primer(1).txt\n\n**chatbot-error.txt:** Diese Datei zeigt einen `IndexError` in der Anwendung, der auftritt, weil die Liste `config.settings.GEMINI_API_KEYS` leer ist. Dies f\u00fchrt zu einem 500er Fehler im Chatbot.\n\n**link.txt:** Enth\u00e4lt einen YouTube-Link zu Rebecca Blacks \"Friday\".\n\n**gpt-vector-dimension-error.txt:** Diese Datei dokumentiert einen 500er Fehler im Chatbot aufgrund eines `PineconeApiException`. Die Fehlermeldung zeigt an, dass die Vektordimension (3072) nicht mit der Indexdimension (1536) \u00fcbereinstimmt.\n\n**gemini-quota-error.txt:** Diese Datei zeigt einen 429er Fehler (Quota \u00fcberschritten) f\u00fcr Google Vertex AI Gemini an. Der Fehler tritt auf, selbst ohne Lasttests.\n\n**crawler-issues-19MAR2025 - Copy (25).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (26).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (2).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (2) - Copy.txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (3).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (5).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (6).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (2) - Copy(1).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (4).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (7).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (14).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (13).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (12).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (8).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (9).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (10).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (11).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (22).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (23).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (24).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (18).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy(2).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (15).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy(1).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (19).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (17).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (16).txt:** Identisch mit der ersten Datei.\n\n**crawler-issues-19MAR2025 - Copy (2) - Copy - Copy - Copy - Copy - Copy(1).txt:** Identisch mit der ersten Datei.\n\n\nZusammenfassend l\u00e4sst sich sagen, dass der Kontext haupts\u00e4chlich aus mehreren Kopien derselben Datei mit Problemen im Zusammenhang mit einem Webcrawler besteht, sowie aus Anweisungen f\u00fcr einen ApaCare-Vertriebsmitarbeiter und Fehlerprotokollen f\u00fcr einen Chatbot. Ein bedeutender Teil des Kontexts wiederholt sich.\n",
"namespace": "02aaa563-baaa-43bf-b820-2dddbf405a51",
"messages": [],
"stream": false,
"language_level": "",
"chat_channel": "",
"language": "German",
"tone": "neutral",
"writing_style": "standard",
"model": "gemini-1.5-flash",
"knowledgebase": "ki-dev-large",
"seed": 0,
"client_id": 0,
"all_context": true,
"follow_up_for": null,
"knowledgebase_files_count": 0,
"override_command": "",
"disable_clarity_check": true,
"high_consistency": false,
"custom_primer": "",
"logging": true,
"query_route": "",
"web_search": false
}
QUERY ROUTE
Query Route: summary
==================================================
**Elapsed Time: 1.67 seconds**
==================================================
RAG PARAMS
RAG Parameters: {'dynamically_expand': False, 'top_k': 120, 'actual_k': 120, 'satisfying_score': 0}
==================================================
**Elapsed Time: 0.00 seconds**
==================================================
VECTOR SEARCH RESULTS
Results: {'main_results': [{'id': '58fa642a-5787-4e3d-b93e-bbb22401d5ee',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%2821%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '65c26262-284f-4b9f-84fd-e54d21891c86',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%2820%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '9ff593a1-8c3d-41fa-afb8-0b59fb3ae6e4',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%283%29%20-%20Copy%282%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': 'a5256fdb-6511-400f-a60f-773ec1cf21da',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%283%29%20-%20Copy%284%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '0dd1ad7f-896b-463b-ae54-b246891a3695',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%284%29%20-%20Copy%282%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '2dffd0ae-05b5-4108-835e-7f4b7a5e93d7',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%282%29%20-%20Copy%20-%20Copy%20-%20Copy%20-%20Copy.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '87b2cdc1-dd18-4ed1-819c-830797299054',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%282%29%20-%20Copy%20-%20Copy%20-%20Copy.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '71442661-a788-473c-a84b-a7c9e41bae11',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%283%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': 'ca5dbe5a-f448-4afb-839d-92d439e37f96',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%284%29%20-%20Copy%281%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': 'cace0fce-afd5-461c-801c-81355a90e16d',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%283%29%20-%20Copy.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': 'bbad564e-be05-4efa-b789-29dc6fd18b82',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%284%29%20-%20Copy.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '98c459bf-ab60-419d-8e69-9ef5eb9ff116',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%285%29%20-%20Copy.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': 'bed67768-922b-4c7d-8165-eee3163909a0',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%282%29%20-%20Copy%282%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': 'e9692916-cddb-4f9b-b62a-3940420a1bcc',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%282%29%20-%20Copy%20-%20Copy.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '5f7830ee-054e-4360-8f52-115d00616401',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%283%29%20-%20Copy%281%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '76094636-cc16-4d62-82b3-12553d03f819',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%2825%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': 'b93ce251-56e3-4c7c-92e8-b9a2d1710560',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%2826%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': 'b092b516-4e4c-4dba-80ab-f0e487b57742',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%282%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '4610d233-61c0-4f5c-87e6-2a30dae73ed7',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%282%29%20-%20Copy.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '517e3caa-34e0-4e9d-a7f4-6fc6b2feb700',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%283%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': 'a0f6eb82-8f33-4811-8c3a-645f729b6c4d',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%285%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': 'ba0f3419-15bb-414a-99cc-d760354dcd52',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%286%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '9bffa128-c3b9-4a3d-a3eb-4dcbd0bbdda5',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%282%29%20-%20Copy%281%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '67323980-5aca-46e6-80f7-7f19801ac759',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%284%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '98bb48d5-27e0-499b-959f-247b6a74aa67',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%287%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '74b81977-5f8f-4689-97fe-0e4ef9a7ee67',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%2814%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': 'e8b50778-45cb-418f-9cc2-20639bb4a8c9',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%2813%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '48e5deca-e635-43dc-85cc-260ef35d3b60',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%2812%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '678d21ac-21ab-499b-8576-4244b9873985',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%288%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': 'fc331940-372e-440c-89f9-9915877e14eb',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%289%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': 'c592eebd-3f8f-415e-b1e0-e4d30770e85b',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%2810%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '6d964ff6-217e-4614-8184-2d51ed4e8ae9',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%2811%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': 'ab79ad77-cd4f-4305-a4dd-e3de020e2341',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%2822%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '90f12c01-525e-458d-833c-06db6855a5f5',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%2823%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': 'cd4abe04-426d-4568-9f45-3de5180de5a5',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%2824%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '7a1c397d-0a2d-4689-9560-2baefa40d5a2',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%2818%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '26cc55bd-a759-4b92-8636-aff67cf84a83',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%282%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '671736c2-3ad9-46de-9ef1-9755b80e26b7',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%2815%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '21680b9e-654f-4981-b5de-2e9e2dbc7c3d',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%281%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': 'd1c5ef31-a72c-4b52-887d-d74c940e6448',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%2819%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': 'c0d83539-fa99-4975-b7c1-f4fd1f677b28',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%2817%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '88c96501-c678-4566-9dd2-131b30d0948b',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy%20%2816%29.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': '469071c1-1bda-465b-b59f-bbe4c7621e98',
'metadata': {'chunk': 0.0,
'file_name': 'crawler-issues-19MAR2025%20-%20Copy.txt',
'is_dict': 'no',
'text': '- if CrawlerJob fails statues will never update, import '
'status wont update\r\n'
'(add failed() method -> create CrawlerProcess with '
'failed status, record last process time??)\r\n'
'- if CrawlerProcessJob fails before recording last '
'process time '
'("Cache::put($processCrawler->lastCrawlerProcessTimeCacheKey(), '
'now());") the status will never upate\r\n'
'- importing failed Crawler pages still marked '
'success\r\n'
'- if CrawlerFilesJob fails CrawlerProcess status wont '
'update\r\n'
'- if CrawlerPrepareKnowledgebaseTrainingJob fails '
'import status wont update\r\n'
'- CrawlerFilesProcessTrainingJob@handleProcessingError '
'-- failed items are marked as processed/success.\r\n'
'should be markItemAsFailed() same as in '
'CrawlerPageProcessTrainingJob?\r\n'
'\r\n'
'- Finalizing Logic Duplication\r\n'
'The completion checking and finalization logic is '
'duplicated across multiple jobs:\r\n'
'\r\n'
'CrawlerPageProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CrawlerFilesProcessTrainingJob::checkCompletionAndFinalize\r\n'
'CheckKnowledgebaseCrawlerImportCompletion::handle\r\n'
'\r\n'
'Each has subtle differences, creating opportunities for '
'inconsistent behavior.\r\n'
'\r\n'
'- Unreliable S3 File Operations\r\n'
'File operations on S3 have minimal error handling:\r\n'
'\r\n'
'$this->filesystem->put($s3Path, $newContent);\r\n'
'return $this->filesystem->url($s3Path);\r\n'
'\r\n'
'If the S3 put operation fails silently, subsequent code '
'would continue with a URL to a non-existent file.\r\n'
'\r\n'
'- try using knowledgebase_crawler_imports table instead '
"of cache for counting since it's already "
'implemented?\r\n'
'update counts every x seconds instead of realtime '
'updates?\r\n'
'\r\n'
'- CrawlerFileProcessTrainingJob and/or '
'CrawlerPageProcessTrainingJob failure not marking '
'KnowledgebaseCrawler as fail\r\n'
'- KnowledgebaseCrawlerImport fails getting deleted '
'after'},
'score': 0.0,
'values': []}, {'id': 'd39117c1-58d3-439c-aa3e-424b4b01a2d6',
'metadata': {'chunk': 0.0,
'file_name': 'apacare-primer%281%29.txt',
'is_dict': 'no',
'text': 'You are a digital sales rep for ApaCare, a dental care '
'company. Please assist clients with their '
'dental-related questions.\r\n'
'Use German in your responses.\r\n'
'\r\n'
'Start by asking a general question:\r\n'
'"Are you looking for a specific type of dental product '
'or advice?"\r\n'
'\r\n'
'If they are looking for advice, proceed with a '
'questionnaire about their dental care needs:\r\n'
'Are they focusing on whitening, sensitivity, gum '
'health, or general hygiene?\r\n'
'Try to ask a questionnaire to have clients describe '
'their problems.\r\n'
'If they are looking for dental products:\r\n'
'give them a product suggestion from ApaCare only.\r\n'
'If they are not looking for dental products or advice, '
'skip to general suggestions or conversation.\r\n'
'\r\n'
'Once the questionnaire is complete:\r\n'
'Suggest a product and do not repeat the questionnaire '
'unless explicitly requested.\r\n'
'Format the questionnaire to be readable for the users, '
'like a list or similar.\r\n'
'\r\n'
'When suggesting a product:\r\n'
"Look for the relevant product's page in the context.\r\n"
'Provide a detailed suggestion with an anchor tag link. '
'Ensure the target attribute is set to "__blank" and use '
'this format:\r\n'
'\r\n'
'[replace this with the product name]\r\n'
'\r\n'
'\r\n'
'All links should have "__blank" target attribute.\r\n'
"Don't translate links href to German.\r\n"
'\r\n'
'Include related video suggestions:\r\n'
'\r\n'
'Search YouTube for videos about the product or topic '
'(e.g., how to use an electric toothbrush, flossing '
'techniques).\r\n'
'Embed the video in an iframe using this format:\r\n'
'\r\n'
'\r\n'
'For Google Drive videos, append /preview to the link '
'and embed it:\r\n'
'\r\n'
'\r\n'
'For public URL video links, use the