RAG ATT&CK

This notebook illustrates the integration of Large Language Models (LLM) with the MITRE ATT&CK framework. Analysts can dynamically generate insightful and context-rich content tailored for threat intelligence, enhancing cybersecurity research and analysis.

Original notebook can be found here: https://otrf.github.io/GPT-Security-Adventures/experiments/ATTCK-GPT/notebook.html#generate-knowledge-base-embeddings

Initialization

NB: The foundational knowledge and associated markdown file pertaining to the ATT&CK group were pre-generated using attackcti courtesy of @cyb3rward0g.

In this context, we leverage LangChain's modular framework to seamlessly load our ATT&CK Markdown file, setting the stage for subsequent data-driven and interactive tasks.

In [2]:
# Import
import os
# Define local variables
current_directory = os.path.dirname("__file__")
knowledge_directory = os.path.join(current_directory, "knowledge")
db_directory = os.path.join(current_directory, "db")
templates_directory = os.path.join(current_directory, "templates")
group_template = os.path.join(templates_directory, "group.md")
In [3]:
import glob
from langchain.document_loaders import UnstructuredMarkdownLoader
In [4]:
# Using glob to find all Markdown files in the knowledge_directory
# The "*.md" means it will look for all files ending with .md (Markdown files)
group_files = glob.glob(os.path.join(knowledge_directory, "*.md"))

# Initializing an empty list to store the content of Markdown files
md_docs = []

# Start of the Markdown file loading process
print("[+] Loading Group markdown files..")

# Loop through each Markdown file path in group_files
for group in group_files:
    # print(f' [*] Loading {os.path.basename(group)}')
    
    # Create an instance of UnstructuredMarkdownLoader to load the content of the current Markdown file
    loader = UnstructuredMarkdownLoader(group)
    
    # Load the content and extend the md_docs list with it
    md_docs.extend(loader.load())

# Print the total number of Markdown documents processed
print(f'[+] Number of .md documents processed: {len(md_docs)}')
[+] Loading Group markdown files..
[+] Number of .md documents processed: 134
In [5]:
# Display one of the page content
print(md_docs[5].page_content)
APT-C-36 - G0099

Created: 2020-05-05T18:53:08.166Z

Modified: 2021-05-26T20:17:53.085Z

Contributors: Jose Luis Sánchez Martinez

Aliases

APT-C-36,Blind Eagle

Description

APT-C-36 is a suspected South America espionage group that has been active since at least 2018. The group mainly targets Colombian government institutions as well as important corporations in the financial sector, petroleum industry, and professional manufacturing.(Citation: QiAnXin APT-C-36 Feb2019)

Techniques Used

APT-C-36 obtained and used a modified variant of

Imminent Monitor.(Citation: QiAnXin APT-C-36 Feb2019)|
|mitre-attack|enterprise-attack|Linux,macOS,Windows|T1105|Ingress Tool Transfer|

APT-C-36 has downloaded binary data from a specified domain after the malicious document is opened.(Citation: QiAnXin APT-C-36 Feb2019)|
|mitre-attack|enterprise-attack|Windows,macOS,Linux|T1059.005|Visual Basic|

APT-C-36 has embedded a VBScript within a malicious Word document which is executed upon the document opening.(Citation: QiAnXin APT-C-36 Feb2019)|
|mitre-attack|enterprise-attack|Windows,Linux,macOS|T1036.004|Masquerade Task or Service|

APT-C-36 has disguised its scheduled tasks as those used by Google.(Citation: QiAnXin APT-C-36 Feb2019)|
|mitre-attack|enterprise-attack|Linux,macOS,Windows|T1571|Non-Standard Port|

APT-C-36 has used port 4050 for C2 communications.(Citation: QiAnXin APT-C-36 Feb2019)|
|mitre-attack|enterprise-attack|Linux,macOS,Windows|T1204.002|Malicious File|

APT-C-36 has prompted victims to accept macros in order to execute the subsequent payload.(Citation: QiAnXin APT-C-36 Feb2019)|
|mitre-attack|enterprise-attack|Windows|T1053.005|Scheduled Task|

APT-C-36 has used a macro function to set scheduled tasks, disguised as those used by Google.(Citation: QiAnXin APT-C-36 Feb2019)|
|mitre-attack|enterprise-attack|Linux,macOS,Windows|T1027|Obfuscated Files or Information|

APT-C-36 has used ConfuserEx to obfuscate its variant of

Imminent Monitor, compressed payload and RAT packages, and password protected encrypted email attachments to avoid detection.(Citation: QiAnXin APT-C-36 Feb2019)|
|mitre-attack|enterprise-attack|macOS,Windows,Linux|T1566.001|Spearphishing Attachment|

APT-C-36 has used spearphishing emails with password protected RAR attachment to avoid being detected by the email gateway.(Citation: QiAnXin APT-C-36 Feb2019) |

Tokenization

Tokenization is the process of converting a sequence of text into individual units, known as "tokens." These tokens can be as small as characters or as long as words, depending on the specific requirements of the task and the language of the text. Tokenization is a crucial pre-processing step in Natural Language Processing (NLP) and text analytics applications.g models.

How Tokenization Works:

  1. Input Text: The process starts with a raw text string.
  2. Token Identification: The tokenizer identifies the boundaries that separate tokens. These boundaries could be spaces, punctuation marks, or specific characters.
  3. Token Extraction: Once the boundaries are identified, the text is split into individual tokens.
  4. Optional: Token Encoding: In some cases, tokens are further encoded into numerical values, which are more easily processed by machine learning models.
In [6]:
# Import the tiktoken library
import tiktoken

# Initialize the tokenizer for the GPT-4 model
# The function encoding_for_model returns a tokenizer configured for the specified model ('gpt-4' in this case)
tokenizer = tiktoken.encoding_for_model('gpt-4')

# Tokenize the content of the first Markdown document in the md_docs list
# The encode method converts the text into a list of integers, each representing a token
# disallowed_special=() ensures that certain special tokens are not included in the output
token_integers = tokenizer.encode(md_docs[0].page_content, disallowed_special=())

# Count the number of tokens generated
# This is useful for understanding the size of the text and for cost estimation if using OpenAI's API
num_tokens = len(token_integers)

# Decode the integer tokens back to bytes
# This is done using the decode_single_token_bytes method
# This step is optional and is generally used for debugging or analysis
token_bytes = [tokenizer.decode_single_token_bytes(token) for token in token_integers]

# Print the results
# Display the total number of tokens, the integer representation of tokens, and their byte representation
print()
print(f"token count: {num_tokens} tokens")
print(f"token integers: {token_integers}")
print(f"token bytes: {token_bytes}")
token count: 532 tokens
token integers: [2953, 31, 18633, 482, 480, 4119, 23, 271, 11956, 25, 220, 679, 22, 12, 2304, 12, 2148, 51, 1691, 25, 2148, 25, 4331, 13, 24847, 57, 271, 19696, 25, 220, 2366, 15, 12, 2839, 12, 972, 51, 777, 25, 4370, 25, 2946, 13, 4364, 57, 271, 54084, 9663, 25, 350, 1900, 45644, 423, 1339, 16900, 11, 34711, 16777, 10181, 11, 4953, 382, 96309, 271, 2953, 31, 18633, 271, 5116, 271, 2953, 31, 18633, 374, 264, 5734, 6108, 21516, 6023, 1912, 13, 1102, 706, 8767, 1511, 502, 2332, 34594, 4455, 439, 326, 1439, 311, 6493, 40831, 323, 706, 15871, 17550, 11351, 6532, 304, 6020, 11, 7100, 11, 323, 6696, 4947, 11, 11383, 1701, 17880, 2561, 98980, 82, 1778, 439, 52212, 40, 14029, 11, 439, 1664, 439, 1063, 2536, 57571, 1203, 28404, 13, 320, 34, 7709, 25, 6785, 51158, 4074, 31, 18633, 696, 29356, 8467, 12477, 271, 2953, 31, 18633, 706, 3288, 14633, 449, 39270, 5210, 8410, 9477, 12673, 13127, 34, 7709, 25, 6785, 51158, 4074, 31, 18633, 8, 7511, 91, 1800, 265, 12, 21208, 91, 79034, 12, 21208, 91, 47424, 11, 12214, 3204, 11, 13466, 91, 51, 4364, 19, 13, 6726, 91, 30700, 9824, 2958, 44838, 2953, 31, 18633, 706, 17644, 311, 636, 12697, 311, 7195, 39270, 5210, 9506, 34779, 12886, 4669, 41963, 764, 11218, 14633, 13127, 34, 7709, 25, 6785, 51158, 4074, 31, 18633, 8, 7511, 91, 1800, 265, 12, 21208, 91, 79034, 12, 21208, 91, 47424, 11, 13466, 11, 12214, 3204, 91, 51, 4364, 18, 91, 8193, 385, 7709, 369, 8589, 32028, 44838, 2953, 31, 18633, 706, 51763, 3016, 3241, 52227, 369, 11572, 11, 1778, 439, 5210, 9506, 46869, 12, 679, 17, 12, 16037, 23, 13127, 34, 7709, 25, 6785, 51158, 4074, 31, 18633, 8, 7511, 91, 1800, 265, 12, 21208, 91, 79034, 12, 21208, 91, 47424, 11, 12214, 3204, 11, 13466, 91, 51, 6640, 22, 13, 4119, 91, 7469, 8785, 44838, 2953, 31, 18633, 20142, 1511, 279, 2768, 11545, 2768, 40761, 315, 264, 5780, 449, 271, 9628, 79580, 40831, 311, 13555, 1217, 9815, 1473, 2953, 31, 18633, 20142, 1511, 279, 2768, 3290, 311, 30174, 832, 315, 872, 7526, 311, 264, 65309, 1052, 836, 1473, 2953, 31, 18633, 20142, 1511, 279, 2768, 3290, 2768, 40761, 315, 264, 5780, 449, 271, 9628, 79580, 40831, 311, 1160, 2254, 5315, 1473, 2953, 31, 18633, 20142, 1511, 279, 2768, 11545, 1306, 71701, 264, 5780, 449, 271, 9628, 79580, 40831, 311, 6994, 2038, 922, 279, 10293, 1473, 2953, 31, 18633, 20142, 1511, 279, 2768, 3290, 1306, 71701, 264, 5780, 449, 271, 9628, 79580, 40831, 311, 21953, 2038, 922, 2254, 14488, 1473, 2953, 31, 18633, 20142, 1511, 279, 2768, 3290, 2768, 40761, 315, 264, 5780, 449, 271, 9628, 79580, 40831, 311, 6994, 2038, 922, 3600, 1473, 2953, 31, 18633, 20142, 1511, 279, 2768, 3290, 2768, 40761, 315, 264, 5780, 449, 271, 9628, 79580, 40831, 311, 3113, 4009, 13537, 1473, 2953, 31, 18633, 20142, 1511, 279, 2768, 11545, 1306, 71701, 264, 5780, 449, 271, 9628, 79580, 40831, 311, 6994, 2038, 922, 3626, 323, 29725, 1473, 9628, 79580, 40831, 3638, 2953, 31, 18633, 20142, 3549, 264, 1052, 8649, 264, 1160, 315, 11545, 311, 387, 16070, 389, 279, 44500, 6500, 13127, 34, 7709, 25, 6785, 51158, 4074, 31, 18633, 18419]
token bytes: [b'admin', b'@', b'338', b' -', b' G', b'001', b'8', b'\n\n', b'Created', b':', b' ', b'201', b'7', b'-', b'05', b'-', b'31', b'T', b'21', b':', b'31', b':', b'53', b'.', b'579', b'Z', b'\n\n', b'Modified', b':', b' ', b'202', b'0', b'-', b'03', b'-', b'18', b'T', b'19', b':', b'54', b':', b'59', b'.', b'120', b'Z', b'\n\n', b'Contrib', b'utors', b':', b' T', b'ats', b'uya', b' D', b'ait', b'oku', b',', b' Cyber', b' Defense', b' Institute', b',', b' Inc', b'.\n\n', b'Aliases', b'\n\n', b'admin', b'@', b'338', b'\n\n', b'Description', b'\n\n', b'admin', b'@', b'338', b' is', b' a', b' China', b'-based', b' cyber', b' threat', b' group', b'.', b' It', b' has', b' previously', b' used', b' new', b'sw', b'orthy', b' events', b' as', b' l', b'ures', b' to', b' deliver', b' malware', b' and', b' has', b' primarily', b' targeted', b' organizations', b' involved', b' in', b' financial', b',', b' economic', b',', b' and', b' trade', b' policy', b',', b' typically', b' using', b' publicly', b' available', b' RAT', b's', b' such', b' as', b' Poison', b'I', b'vy', b',', b' as', b' well', b' as', b' some', b' non', b'-public', b' back', b'doors', b'.', b' (', b'C', b'itation', b':', b' Fire', b'Eye', b' admin', b'@', b'338', b')\n\n', b'Techn', b'iques', b' Used', b'\n\n', b'admin', b'@', b'338', b' has', b' sent', b' emails', b' with', b' malicious', b' Microsoft', b' Office', b' documents', b' attached', b'.(', b'C', b'itation', b':', b' Fire', b'Eye', b' admin', b'@', b'338', b')', b'|\n', b'|', b'mit', b're', b'-', b'attack', b'|', b'enterprise', b'-', b'attack', b'|', b'Linux', b',', b'mac', b'OS', b',', b'Windows', b'|', b'T', b'120', b'4', b'.', b'002', b'|', b'Mal', b'icious', b' File', b'|\n\n', b'admin', b'@', b'338', b' has', b' attempted', b' to', b' get', b' victims', b' to', b' launch', b' malicious', b' Microsoft', b' Word', b' attachments', b' delivered', b' via', b' spear', b'ph', b'ishing', b' emails', b'.(', b'C', b'itation', b':', b' Fire', b'Eye', b' admin', b'@', b'338', b')', b'|\n', b'|', b'mit', b're', b'-', b'attack', b'|', b'enterprise', b'-', b'attack', b'|', b'Linux', b',', b'Windows', b',', b'mac', b'OS', b'|', b'T', b'120', b'3', b'|', b'Exp', b'lo', b'itation', b' for', b' Client', b' Execution', b'|\n\n', b'admin', b'@', b'338', b' has', b' exploited', b' client', b' software', b' vulnerabilities', b' for', b' execution', b',', b' such', b' as', b' Microsoft', b' Word', b' CVE', b'-', b'201', b'2', b'-', b'015', b'8', b'.(', b'C', b'itation', b':', b' Fire', b'Eye', b' admin', b'@', b'338', b')', b'|\n', b'|', b'mit', b're', b'-', b'attack', b'|', b'enterprise', b'-', b'attack', b'|', b'Linux', b',', b'mac', b'OS', b',', b'Windows', b'|', b'T', b'108', b'7', b'.', b'001', b'|', b'Local', b' Account', b'|\n\n', b'admin', b'@', b'338', b' actors', b' used', b' the', b' following', b' commands', b' following', b' exploitation', b' of', b' a', b' machine', b' with', b'\n\n', b'LOW', b'BALL', b' malware', b' to', b' enumerate', b' user', b' accounts', b':\n\n', b'admin', b'@', b'338', b' actors', b' used', b' the', b' following', b' command', b' to', b' rename', b' one', b' of', b' their', b' tools', b' to', b' a', b' benign', b' file', b' name', b':\n\n', b'admin', b'@', b'338', b' actors', b' used', b' the', b' following', b' command', b' following', b' exploitation', b' of', b' a', b' machine', b' with', b'\n\n', b'LOW', b'BALL', b' malware', b' to', b' list', b' local', b' groups', b':\n\n', b'admin', b'@', b'338', b' actors', b' used', b' the', b' following', b' commands', b' after', b' exploiting', b' a', b' machine', b' with', b'\n\n', b'LOW', b'BALL', b' malware', b' to', b' obtain', b' information', b' about', b' the', b' OS', b':\n\n', b'admin', b'@', b'338', b' actors', b' used', b' the', b' following', b' command', b' after', b' exploiting', b' a', b' machine', b' with', b'\n\n', b'LOW', b'BALL', b' malware', b' to', b' acquire', b' information', b' about', b' local', b' networks', b':\n\n', b'admin', b'@', b'338', b' actors', b' used', b' the', b' following', b' command', b' following', b' exploitation', b' of', b' a', b' machine', b' with', b'\n\n', b'LOW', b'BALL', b' malware', b' to', b' obtain', b' information', b' about', b' services', b':\n\n', b'admin', b'@', b'338', b' actors', b' used', b' the', b' following', b' command', b' following', b' exploitation', b' of', b' a', b' machine', b' with', b'\n\n', b'LOW', b'BALL', b' malware', b' to', b' display', b' network', b' connections', b':\n\n', b'admin', b'@', b'338', b' actors', b' used', b' the', b' following', b' commands', b' after', b' exploiting', b' a', b' machine', b' with', b'\n\n', b'LOW', b'BALL', b' malware', b' to', b' obtain', b' information', b' about', b' files', b' and', b' directories', b':\n\n', b'LOW', b'BALL', b' malware', b',\n\n', b'admin', b'@', b'338', b' actors', b' created', b' a', b' file', b' containing', b' a', b' list', b' of', b' commands', b' to', b' be', b' executed', b' on', b' the', b' compromised', b' computer', b'.(', b'C', b'itation', b':', b' Fire', b'Eye', b' admin', b'@', b'338', b')|']
In [7]:
# Define a function called tiktoken_len to calculate the number of tokens in a given text
def tiktoken_len(text):
    # Use the tokenizer's encode method to tokenize the input text
    # The disallowed_special=() parameter ensures that special tokens are not included in the tokenization
    tokens = tokenizer.encode(
        text,
        disallowed_special=()  # To disable this check for all special tokens
    )
    # Return the number of tokens generated
    return len(tokens)

# Create a list called token_counts to store the number of tokens for each Markdown document in md_docs
# The tiktoken_len function is used to calculate the token count for each document's content
token_counts = [tiktoken_len(doc.page_content) for doc in md_docs]

# Print the statistics related to token counts
# Calculate and display the minimum, average, and maximum number of tokens across all Markdown documents
print(f"""[+] Token Counts:
Min: {min(token_counts)}  # Minimum number of tokens across all documents
Avg: {int(sum(token_counts) / len(token_counts))}  # Average number of tokens across all documents
Max: {max(token_counts)}  # Maximum number of tokens across all documents
""")
[+] Token Counts:
Min: 176  # Minimum number of tokens across all documents
Avg: 1619  # Average number of tokens across all documents
Max: 7346  # Maximum number of tokens across all documents

Split document

The goal of the "Recursively split by character" method is to split a text into smaller chunks based on a list of characters. The method tries to split the text on these characters in order until the resulting chunks are small enough. The default list of characters used for splitting is ["\n\n", "\n", " ", ""]. This method aims to keep paragraphs, sentences, and words together as much as possible, as these are typically semantically related pieces of text. The chunk size is measured by the number of characters in each chunk.

In [8]:
# Import the RecursiveCharacterTextSplitter class from the langchain library
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Print a message indicating the initialization of RecursiveCharacterTextSplitter
print('[+] Initializing RecursiveCharacterTextSplitter..')

# Create an instance of RecursiveCharacterTextSplitter with specified parameters
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # Maximum number of tokens in each chunk
    chunk_overlap=50,  # Number of tokens that will overlap between adjacent chunks
    length_function=tiktoken_len,  # Function to calculate the number of tokens in a text
    separators=['\n\n', '\n', ' ', '']  # List of separators used to split the text into chunks
)
[+] Initializing RecursiveCharacterTextSplitter..
In [9]:
print('[+] Splitting documents in chunks..')
chunks = text_splitter.split_documents(md_docs)

print(f'[+] Number of documents: {len(md_docs)}')
print(f'[+] Number of chunks: {len(chunks)}')
[+] Splitting documents in chunks..
[+] Number of documents: 134
[+] Number of chunks: 534
In [10]:
print(chunks[1])
page_content='LOWBALL malware to obtain information about services:\n\nadmin@338 actors used the following command following exploitation of a machine with\n\nLOWBALL malware to display network connections:\n\nadmin@338 actors used the following commands after exploiting a machine with\n\nLOWBALL malware to obtain information about files and directories:\n\nLOWBALL malware,\n\nadmin@338 actors created a file containing a list of commands to be executed on the compromised computer.(Citation: FireEye admin@338)|' metadata={'source': 'knowledge\\admin@338.md'}

Embedding

Embedding

What it is: Embedding is a way to convert words or phrases into numbers (vectors) so that a computer can understand and work with them.

Why it's useful: Once text is converted into numbers, it's easier to see how similar different words or sentences are, and to perform tasks like searching and classification.

FAISS (Facebook AI Similarity Search)

What it is: FAISS is a tool developed by Facebook that helps you quickly find items that are similar to a given item, based on their numerical (vector) representation.

Why it's useful: Imagine you have a huge library of books, and you want to find the ones most similar to a particular book. FAISS helps you do this very quickly, even if your library is enormous.

Vectors

What they are: A vector is just a list of numbers. In the context of embeddings and FAISS, each number in the vector represents some feature or characteristic of the text.

Why they're useful: Vectors make it easy for computers to understand and compare things. For example, the vector for the word "apple" might be closer to the vector for "fruit" than to the vector for "car," helping the computer understand that apples are more related to fruits than to cars.

So, in summary:

Embedding turns text into vectors. Vectors are lists of numbers that computers can easily work with. FAISS uses these vectors to quickly find similar items in a large dataset.

In [11]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
import openai
import os
In [12]:
# Get your key: https://platform.openai.com/account/api-keys
openai.api_key = os.getenv("OPENAI_API_KEY")
In [13]:
print("[+] Starting embedding..")
embeddings = OpenAIEmbeddings()

# Send text chunks to OpenAI Embeddings API
print("[+] Sending chunks to OpenAI Embeddings API..")
db = FAISS.from_documents(chunks, embeddings)
[+] Starting embedding..
[+] Sending chunks to OpenAI Embeddings API..

Vector

In [14]:
retriever = db.as_retriever(search_kwargs={"k":5})
In [15]:
query = "What are some phishing techniques used by threat actors?"
In [16]:
print("[+] Getting relevant documents for query..")
relevant_docs = retriever.get_relevant_documents(query)
relevant_docs
[+] Getting relevant documents for query..
Out[16]:
[Document(page_content='TA505 has used spearphishing emails with malicious attachments to initially compromise victims.(Citation: Proofpoint TA505 Sep 2017)(Citation: Proofpoint TA505 June 2018)(Citation: Proofpoint TA505 Jan 2019)(Citation: Cybereason TA505 April 2019)(Citation: ProofPoint SettingContent-ms July 2018)(Citation: Proofpoint TA505 Mar 2018)(Citation: Trend Micro TA505 June 2019)(Citation: Proofpoint TA505 October 2019)(Citation: IBM TA505 April 2020)|', metadata={'source': 'knowledge\\TA505.md'}),
 Document(page_content="APT33 utilized PowerShell scripts to establish command and control and install files for execution. (Citation: Symantec March 2019) (Citation: Dragos)|\n|mitre-attack|enterprise-attack,ics-attack|Engineering Workstation,Human-Machine Interface,Control Server,Data Historian|T0865|Spearphishing Attachment|\n\nAPT33 sent spear phishing emails containing links to HTML application files, which were embedded with malicious code. (Citation: Jacqueline O'Leary et al. September 2017)\n\nAPT33 has conducted targeted spear phishing campaigns against U.S. government agencies and private sector companies. (Citation: Andy Greenburg June 2019)|", metadata={'source': 'knowledge\\APT33.md'}),
 Document(page_content='Nomadic Octopus has targeted victims with spearphishing emails containing malicious attachments.(Citation: Security Affairs DustSquad Oct 2018)(Citation: ESET Nomadic Octopus 2018)|', metadata={'source': 'knowledge\\Nomadic_Octopus.md'}),
 Document(page_content='APT39 has maintained persistence using the startup folder.(Citation: FireEye APT39 Jan 2019)|\n|mitre-attack|enterprise-attack|Linux,macOS,Windows,Office 365,SaaS,Google Workspace|T1566.002|Spearphishing Link|\n\nAPT39 leveraged spearphishing emails with malicious links to initially compromise victims.(Citation: FireEye APT39 Jan 2019)(Citation: FBI FLASH APT39 September 2020)|\n|mitre-attack|enterprise-attack|macOS,Windows,Linux|T1566.001|Spearphishing Attachment|\n\nAPT39 leveraged spearphishing emails with malicious attachments to initially compromise victims.(Citation: FireEye APT39 Jan 2019)(Citation: Symantec Chafer February 2018)(Citation: FBI FLASH APT39 September 2020)|', metadata={'source': 'knowledge\\APT39.md'}),
 Document(page_content='DarkHydrus leveraged PowerShell to download and execute additional scripts for execution.(Citation: Unit 42 DarkHydrus July 2018)(Citation: Unit 42 Playbook Dec 2017)|\n|mitre-attack|enterprise-attack|macOS,Windows,Linux|T1566.001|Spearphishing Attachment|\n\nDarkHydrus has sent spearphishing emails with password-protected RAR archives containing malicious Excel Web Query files (.iqy). The group has also sent spearphishing emails that contained malicious Microsoft Office documents that use the “attachedTemplate” technique to load a template from a remote server.(Citation: Unit 42 DarkHydrus July 2018)(Citation: Unit 42 Phishery Aug 2018)(Citation: Unit 42 Playbook Dec 2017)|', metadata={'source': 'knowledge\\DarkHydrus.md'})]

ATT&Chatter: Your Own Mitre Assistant

In [18]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
In [19]:
chain = load_qa_chain(OpenAI(temperature=0), chain_type="stuff")
chain.run(input_documents=relevant_docs, question=query)
Out[19]:
' Threat actors have used spearphishing emails with malicious attachments, links to HTML application files embedded with malicious code, password-protected RAR archives containing malicious Excel Web Query files, and malicious Microsoft Office documents that use the “attachedTemplate” technique to load a template from a remote server.'

ATT&Chatter: Your Own Mitre Assistant

In [21]:
import ipywidgets as widgets
from ipywidgets import interact_manual, Layout

text_layout = Layout(
    width='80%',  # Set the width to 80% of the container
    height='50px',  # Set the height
)

retriever = db.as_retriever(search_kwargs={"k":3})

def execute_query(query):
    print(f"Your query: {query}")
    print("[+] Getting relevant documents for query..")
    relevant_docs = retriever.get_relevant_documents(query)
    
    from langchain.chains.question_answering import load_qa_chain
    from langchain.llms import OpenAI
    chain = load_qa_chain(OpenAI(temperature=0), chain_type="stuff")
    result = chain.run(input_documents=relevant_docs, question=query)
    print(result)

interact_manual(execute_query, query=widgets.Text(value='', placeholder='Type your query here', description='Query:', layout=text_layout));

ATT&Chatter with Memory

In [99]:
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.llms import OpenAI
from langchain.prompts.prompt import PromptTemplate
import json

# Initialize your Langchain model
model = ChatOpenAI(model_name="gpt-4", temperature=0.3)

# Initialize your retriever (assuming you have a retriever named 'db')
retriever = db.as_retriever(search_kwargs={"k": 8})

# Define your custom template
custom_template = """You are an AI assistant specialized in MITRE ATT&CK and you interact with a threat analyst, answer the follow up question. If you do not know the answer reply with 'I am sorry'.
Chat History:
{chat_history}
Follow Up Input: {question}
Answer: """
CUSTOM_QUESTION_PROMPT = PromptTemplate.from_template(custom_template)

# Initialize memory for chat history
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

# Initialize the ConversationalRetrievalChain
qa_chain = ConversationalRetrievalChain.from_llm(model, retriever, condense_question_prompt=CUSTOM_QUESTION_PROMPT, memory=memory)

def execute_conversation(question):
    # Load conversational history from file
    try:
        with open('conversational_history.json', 'r') as f:
            conversational_history = json.load(f)
    except FileNotFoundError:
        conversational_history = []
    
    # Update conversational history with the user's question
    conversational_history.append(("User", question))
    
    # Use the ConversationalRetrievalChain to get the answer
    result = qa_chain({"question": question})
    
    # Extract the 'answer' part from the result
    response_text = result.get('answer', 'Sorry, I could not generate a response.')
    
    # Update conversational history with the bot's response
    conversational_history.append(("Bot", response_text))
    
    # Limit the history to the last 10 turns
    if len(conversational_history) > 10:
        conversational_history = conversational_history[-10:]
    
    # Save conversational history to file
    with open('conversational_history.json', 'w') as f:
        json.dump(conversational_history, f)
    
    # Save conversational history to file
    with open('conversational_history.json', 'w') as f:
        json.dump(conversational_history, f)
    
    # Print only the last message in the conversational history
    last_message = conversational_history[-1]
    print(f"Discussion:\n{last_message[0]}: {last_message[1]}")
In [100]:
# Call the function with a question
execute_conversation("Who is Lazarus?")
Discussion:
Bot: Lazarus Group is a North Korean state-sponsored cyber threat group that has been active since at least 2009. It has been attributed to the Reconnaissance General Bureau of North Korea. The group is known for various cyber campaigns and attacks, including the destructive wiper attack against Sony Pictures Entertainment in November 2014. It uses various techniques such as custom hashing methods, shellcode, and spearphishing, among others. It is also known by other aliases such as Labyrinth Chollima, HIDDEN COBRA, Guardians of Peace, ZINC, and NICKEL ACADEMY.
In [101]:
execute_conversation("List all the techniques used by this group")
Discussion:
Bot: The Lazarus Group uses a variety of techniques in their cyber operations. Some of these include:

1. Social Engineering: They create new Twitter accounts and use platforms like LinkedIn to conduct social engineering against potential victims.

2. Spearphishing: They send spearphishing messages via social media and email, often tailoring their efforts to specific departments or individuals within a targeted organization.

3. Server Compromise: They have been known to compromise servers to stage malicious tools.

4. Use of Tools: They obtain a variety of tools for their operations, including Responder and PuTTy PSCP.

5. Email Operations: They create new email accounts for spearphishing operations and collect email addresses belonging to various departments of a targeted organization for use in phishing campaigns.

6. Malware Execution: They use methods like rundll32 to execute malicious payloads on a compromised host.

7. Code Signing: They digitally sign malware and utilities to evade detection.

8. Network Connections Discovery: They use tools like nmap to scan ports on systems within the restricted segment of an enterprise network.

9. Use of Macros: They use VBA and embedded macros in Word documents to execute malicious code.

10. PowerShell: They use PowerShell to execute commands and malicious code.

11. Internal Proxy: They use a compromised router to serve as a proxy between a victim network's corporate and restricted segments.

12. SSH: They use SSH and the PuTTy PSCP utility to gain access to a restricted segment of a compromised network.

This list is not exhaustive and the Lazarus Group's techniques can evolve over time.
In [102]:
execute_conversation("Tell me more about the third point you mentionned")
Discussion:
Bot: The Lazarus Group, a cybercrime group, has been known to compromise servers to stage their malicious tools. This means they gain unauthorized access to servers and use them to store and launch their malicious software. This is a significant threat as it can lead to widespread damage and data loss.