Last post we analyzed a technique of doing amplified DDoS attacks using Quake 3 servers through spoofing UDP requests to get some game statistics info. In this post I show potential ways of mitigation as well as how to detect this kind of attack at a network level and how to try to automatically parse the attack’s traffic and generate some firewalling rules.
If we search a bit about Quake 3 servers being used to carry on DDoS attacks we will find this kind of attack is known since some years ago and, in fact, not only Quake 3 are prone to this type of attack but others games based on Quake 3 engine as well (as COD).
I decided to dig into ioq3 server code to see if there is any kind of mitigation for this type of attack, grep in hand:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
It seems that ioq3 developers have integrated some mitigating mechanisms against DDoS attacks, both when Q3 server is being used as an amplifier and when Q3 is directly attacked with a traffic flood, so take a deeper look into those mechanisms:
1 2 3 4 5 6 |
|
When an IP address sends a “getstatus” command some checks are done prior of let command passing, “SVC_BucketForAddress( from, burst, period )” call will look for associated data to “getstatus” sender IP address:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Now ioq3 will check if sender IP address has exceeded established rate limit, being it 10 commands in just one second period (remember previous call “if ( SVC_RateLimitAddress( from, 10, 1000 ) )”):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
As seen, ioq3 server implements some mitigation techniques to avoid using servers as amplifiers but, because they are based on rate limits, an attacker could use them sending at lower rates to avoid being filtered by amplifiers servers, circumventing this protection. A good approach to this type of attack could be implementing challenge-response methods in the game protocol to avoid sending big answers to requests that doesn’t contain a valid challenge token. Because of the nature of this kind of protection, an attacker shouldn’t be able to spoof the token request and get it to use in spoofed “getstatus” query nor predict a valid token to avoid token request phase and just use a pre-generated token in spoofed “getstatus” request (as well as being unable to doing a replay attack using previously used tokens), probably I am going to write another more detailed post about this and other stuff I found while doing some research in the future.
On a similar way to ioq3 server implementation mitigation techniques we could set up an iptables rate limiting policy to automatically drop any traffic from spoofed IP addresses (victim or victims) at layer three and avoid wasting resources on their processing.
I have just totally ripped off this iptables rules from here, so credit goes to RawShark:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
Only lef to say we only filtered “getstatus” command with those iptables rules, remember the others commands as well.
Once we know the ins and outs of this type of DDoS attack and analyzed generate network traffic, as well as readed tool code, we are closer of being able to spot this way of flood and trying to mitigate it. We need to have in mind the fact that, lower TCP/IP layer used to detect anomalous traffic patterns, lower use of resources; it will be much easier to stop a datagram at network layer - maybe based in IP addresses of known Quake 3 servers ;) - than going up to application layer trying to stop a datagram based on its payload and, when dealing with attacks of dozens or hundred of Gbs, the difference will be crucial.
tshark is a terminal based version of Wireshark for doing powerful and quick network packet capturing/analysis and is really useful when doing network forensics because we can use Wireshark’s DisplayFilters including a lot of supported protocols.
Also, if you are interested in tshark/network analysis, I highly recommend this ebook called “Instant Traffic Analysis with Tshark How-to” and written by Borja Merino, it offers a quick and really useful set of recipes for analyzing traffic with tshark, totally worths it.
For example, let’s specify tshark to show Quake 3 datagrams (using quake3 dissector) with a UDP length of 22 bytes (we could set more specific options):
1 2 3 4 5 6 7 8 9 10 11 12 |
|
By default tshark will print info with this format “frame number; relative time; source IP; destination IP; dissected protocol; frame size (bytes); protocol dissected info” as shown above but it isn’t well formatted for an easy processing, so let’s say tshark to show output as formatted CSV:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Now we could make a script to consume tshark output and deploy firewall rules in almost real-time or, maybe, make some pretty statistics for the unavoidable report once the attack has finished / been mitigated.
Probably you are asking yourself the reason I specified an UDP length of 22 bytes, so take a look to the structure of an UDP datagram:
As we saw when analyzing udp.c code and his mistakes, an UDP header size is 8 bytes plus any payload, because in this case we have a payload of 14 (“….disconnect”) bytes it does a total sum of 22 bytes for a triggered response of “disconnect” (response provoked by bad seted UDP length in original udp.c code) so it would be useful against this specific bad coded version of attackers’ tool, despite of it should be improved and/or adapted for others versions of scripts or for a well carried spoofed attack in which Quake 3 servers will answer with server info and no with a “disconnect” command.
At last but not least, tshark also allow to use Wireshark’s ”contains” and “match” filters to show only those packets with a specific pattern:
1 2 3 4 5 6 7 8 |
|
When comparing these results against the previous ones we can observe more amplifiers servers because we are not relying on UDP source port but in UDP payload content to detect them.
ngrep is a network troubleshooting tool that allow us to analyze previously captured traffic in a pcap file or a life sniffing session to debug traffic in a similar way like “grep” Unix tool, his primary goal is to parse and display plaintext protocols like HTTP or SMTP.
In this case we are going to “grep” for a “….disconnect” string specifying to don’t print hash marks (-q) :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
While this format is easy to read by a human we would need to parse it prior to doing any kind of filtering. For example, we could parse this output to just show Quake 3 amplifiers servers being used in the attack to generate some type of firewall rule:
1 2 3 4 5 6 7 8 9 10 |
|
We could go a step ahead and create a set of DROP rules for a Linux router with iptables beyond parsing ngrep output:
1 2 3 4 5 6 7 |
|
Ok, analyzing network traffic and spotting attack patterns is fun, but analyzing traffic looking for previously spotted pattern and automatically blocking attacking IP addresses at perimetral routers is far better, so I’m going to explain how to make such easy but powerful script in a few lines with python.
We are going to need scapy again as well as exscript module to interact with Cisco routers. Then we just need to analyze UDP datagrams and look for “….disconnect” or “….statusResponse” in payload content to list Quake 3 servers being used as amplifiers, once done only remains to create access-list entries for those IP address.
Here is an example for doing this process:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
|
Time to execute it and await, it’s going to take his time when processing a real DDoS capture (millions of packets), so it’s highly recommended to make a prior filter with tshark and adapt this script to use multiple CPUs (or programming it in C):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
Now connect to our router and check if everything went ok:
1 2 3 4 5 6 7 8 9 |
|
Well, so it seems access-list was created ok, attack traffic should begin to be dropped (at filtering router) in seconds, time to figure out next attack vector that will try being exploited, attackers will move to another technique for sure.
If we have a snort sensor or IPS system we could create specific rules based on detected attack pattern to protect us against analyzed DDoS technique. Anyway, going up to application layer is highly discouraged to mitigate a real DDoS attack because it will require more CPU and RAM to process each packet not only because of unwrapping more layers but because specific “simple” filtering actions like filtering on IP addresses and/or ports are performed through packet forwarding hardware (ASIC) and will be far better than a CPU filtering approach done in most majority of appliances.
For testing purposes I have used a security onion virtual machine with snort and snorby running for capturing and visualizing alerts respectively. To create a snort rule to detect inbound DDoS amplification attack using Quake 3 servers we are going to look for “…disconnect” (again it works only for analyzed script and should be extended to the others already analyzed caseloads) in UDP payload, now it’s time to read “Writing Snort Rules”:
1 2 3 |
|
If we create this rule and use scapy as shown before to send a UDP datagram with this pattern an alert will be triggered and a new event will be shown in our snorby interface:
If we analyze with attention the previous image we are can see IP ToS, TTL, UDP length and payload as previously analyzed, so it seems our patterns works fine (despited of it should be improved).
After spending some weeks researching about this kind of attack vector -using several games servers as amplifiers- I’m sure it’s an attack that can be really powerful to launch storms of spoofed UDP datagrams with almost no cost or effort at all, it’s really easy to get an almost real time updated list of online servers without having to make any kind of port scanning but just parsing online gaming directories and, to make matters worse, amplification factors can be up to several dozens original throughput and, because this kind of attack is less known, IT people will be less aware and ready to face off such techniques.
The fact that this kind of attack is being actively used in DDoS as a service platforms to launch attacks from several web booters makes important to know this attack, how to detect it and how to try to defend against him, so stay alert and see you at next post!
]]>Lately has been growing in popularity those DDoS attacks based on DNS Amplification, specifically due to the attack to Spamhaus. While this kind of attack is becoming more and more popular at DDoS scenarios there are others types of DDoS techniques being used not so common and which should be known before being hitted by them. In this post i want to introduce amplification attacks using Quake 3 network protocol - UDP based - as well as how to analyze it in several ways to really understand it in depth to find a pattern and create a fingerprint for trying mitigating them.
This kind of DDoS is very similar to a DNS Amplification Attack, an attacker send thousands of UDP datagrams pretending to be a legitimate Quake 3 client asking for game status with source IP address spoofed using the one wanted to be flooded, then, queried Quake 3 servers will answer with game status - including some server configuration options and user list - to spoofed source IP address, flooding it with thousands of unsolicited UDP datagrams.
I have done a basic draw for illustrating it:
As shown, amplifiers servers - Quake 3 ones - will flood victim with an aggregated throughput much higher than the used by attacker (hence it “amplifier” term); lets see some traffic generated if we make this “getstatus” request:
1 2 3 4 5 6 7 8 9 10 |
|
If we calculate amplification ratio we find that sending an UDP datagram of 56 bytes will trigger a response of 1373 bytes, achieving about x24,5 amplification ratio, not bad after all.
We are going to need a Quake 3 server for being used as “amplifier” to attack our victim when doing some local tests, so we are going to need an original copy of Quake 3 and compiling ioquake3, an open source Quake 3 engine based on id Software source code (publicly released in 2005).
1 2 3 4 |
|
Now we need to copy Quake 3 original pak files (those with models and textures) from our cdrom to our hdd prior being able to run an ioq3 server:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
Our ioq3 server is ready to make some frags!, we can check it with quake3-info.nse script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
|
Commonly used scripts for this kind of attacks has been leaked repeatedly so i’m not going to hide it (regardless of the fact those who DDoS already have it or more powerful attack vectors), so here is a C Quake 3 amplification flooder made upon a generic UDP flooder.
It’s interesting the way UDP datagrams are assembled, so let’s go to analyze it (thanks to NighterMan‎ for helping me with my rusted C knowledge), i have made some comments below about found mistakes, particularly at networking knowledge (the tool doesn’t even work rigth to trigger amplified response):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
|
There isn’t much more to say about this, just highlight the fact that using htonl() instead of htons() avoid us using fixed IP ID value when finding an attack pattern. Also found interesting the specified IP ToS value to mess around with congestion detection and avoidance mechanisms, but contrasts with some mistakes that make easier to spot attacking datagrams generated by this tool and, even more unvelievable, incorrect UDP length transform amplification attack just in a plain UDP spoofed attack (probably made some copy paste from here and there), seems some guys need to read a bit about network protocols before playing with DDoS tools…
Probably the quick and easiest way to craft packets when doing network tests is Scapy, a python tool to create and manipulate network packets that can be used within his own interactive shell or just as a python package.
Below is an example for crafting this kind of attack with scapy, without spoofing IP address (we want to check answer) and with a correct UDP length value and checksum (scapy will automagically compute values like length and checksum prior of sending any packet):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
|
If you have never used scapy before take a look to this scapy-guide made by Adam Maxwell, it’s really useful as a first-steps guide.
Ok, so we have readed flooder code and referenced rfc sections (because we did, right? ;) ), time to sniff some attack traffic and analyze it with tshark/wireshark and observe described behaviour pattern.
First, i have compiled and used “dns.c” flooder without making any kind of modification, this is a Wireshak screenshot while analyzing it:
I have marked in red important aspects like ToS/DS field, UDP length/checksum and “direction” (for Quake 3 protocol). As shown, Wireshark Quake 3 protocol dissector itself detect it as a malformed packet, due to UDP length = 8 application layer will receive an empty payload, fact that the Q3 server will treat as a client with connectivity errors and will send a “disconnect” message:
If we compare question size against response answer there is no amplification factor at all, this program would be useful only for provoking Q3 servers sending unsolicited traffic to a third host - the attacked one - with the intention of splitting originating AS-path attacking or something similar.
I have sent one “getstatus” request forged with scapy to a public Quake 3 server (for obvious reasons i have changed some response content):
The server now correctly decode UDP payload, process “getstatus” command and answer with server status, including several server options and config values as well as statistics (a response size of 1373 bytes for a 56 bytes request).
So far we have seen how this attack works as well as (bad coded) programs being used in the wild to launch DDoS attacks from web panels (the so called DDoS booters). We have also seen how to replay the amplification attack with Scapy and analized a bit of this network traffic with Wireshark.
At the next post we are going to see how to mitigate this kind of attack at the Quake 3 server - at application and network layer - side and also from the victim side being flooded. Also we are going to analyze it deeper with tshark to see potential ways to spot this attack and try to block it.
See you at next post!
]]>First, we need to erase indexed data and, optionally, also user data, to do this moloch includes a perl script for managing database:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
So, if we want to restore database state - users included - we have to do the following:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Now elastic search only have basic schema (with users database restored); to know more about what db.pl have done take a look to his source code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
|
Done this only remains to remove pcap data from “raw” directory:
1
|
|
That’s all folks, enjoy your fresh baked moloch!
]]>As we saw in last post it’s really easy to detect text language using an analysis of stopwords. Another way to detect language, or when syntax rules are not being followed, is using N-Gram-Based text categorization (useful also for identifying the topic of the text and not just language) as William B. Cavnar and John M. Trenkle wrote in 1994 so i decided to mess around a bit and did ngrambased-textcategorizer in python as a proof of concept.
To perform N-Gram-Based Text Categorization we need to compute N-grams (with N=1 to 5) for each word - and apostrophes - found in the text, doing something like (being the word “TEXT”):
When every N-Gram has been computed we just keep top 300 - William and John observed this range as proper for language detection and starting around 300 for subject categorization - and save them as a “text category profile”. For categorize a text we only have to make same steps and calculate the “Out-Of-Place” measure against pre-computed profiles:
Then, choose the nearest one - the one with lower distance - among them.
Procedure to create a text category profile is well explained at point “3.1 Generating N-Gram Frequency Profiles” and it’s really easy to implement it in python with the help of powerful nltk toolkit.
We need to tokenize text splitting by strings of only letters and apostrophes so we could use nltk RegexpTokenizer for this:
1 2 3 4 |
|
It’s just a proof of concept and should be tuned, but will be enough for now.
Now it’s time to generate n-grams (with N=1 to 5) using blank as padding, again nltk has a function for it:
1 2 3 4 |
|
ngram function will return a tuple so we need to join positions in ngrams itself:
1 2 |
|
The easiest way to do this is using a python dictionary, doing a sum when ngram has been seen before or creating a new key otherwise:
1 2 3 4 5 6 7 8 |
|
By last, we need to sort previously created dictionary in reverse order based on each ngram occurrences to keep just top 300 most repeated ngrams. Python dict’s can’t be sorted, so we need to transform it to a sorted list, we can easily achieve it using operator module:
1 2 3 4 5 |
|
Now it only remains to save “ngrams_statistics_sorted” to a file as a “text category profile” or keep just ngrams without occurrences sum when comparing them against others profiles.
To categorize a text first we need to load pre-computed categories into a list/dict or something similar and, when loaded, walk it and calculate distance with each previously computed profile:
1 2 3 4 5 6 7 8 9 10 |
|
N-Gram-Based text categorization is probably not the “state-of-art” in text categorization - almost ten years old and a bit simple compared with newer ways of categorizing text - but it could be useful in some situations and as a basis to build upon and, what the heck, i learned doing it and had great time, so it totally worth it to me ;)
See you soon!
]]>Most of us are used to Internet search engines and social networks capabilities to show only data in certain language, for example, showing only results written in Spanish or English. To achieve that, indexed text must have been analized previously to “guess” the languange and store it together.
There are several ways to do that; probably the most easy to do is a stopwords based approach. The term “stopword” is used in natural language processing to refer words which should be filtered out from text before doing any kind of processing, commonly because this words are little or nothing usefult at all when analyzing text.
Ok, so we have a text whose language we want to detect depending on stopwords being used in such text. First step is to “tokenize” - convert given text to a list of “words” or “tokens” - using an approach or another depending on our requeriments: should we keep contractions or, otherwise, should we split them? we need puntuactions or want to split them off? and so on.
In this case we are going to split all punctuations into separate tokens:
1 2 3 |
|
As shown, the famous quote from Mr. Wolf has been splitted and now we have “clean” words to match against stopwords list.
At this point we need stopwords for several languages and here is when NLTK comes to handy:
1 2 3 4 5 6 |
|
Now we need to compute language probability depending on which stopwords are used:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
First we tokenize using wordpunct_tokenize function and lowercase all splitted tokens, then we walk across nltk included languages and count how many unique stopwords are seen in analyzed text to put this in “language_ratios” dictionary.
Finally, we only have to get the “key” with biggest “value”:
1 2 3 |
|
So yes, it seems this approach works fine with well written texts - those who respect grammatical rules - (and not so small ones) and is really easy to implement.
If we put all the explained above into a script we have something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
|
There are others ways to “guess” language from a given text like N-Gram-Based text categorization so will see it in, probably, next post.
See you soon and, as always, hope you find it interesting and useful!
]]>These days i’m messing around with an application that index thousands of documents per day and perform hundreds of queries per hour, so query performance is crucial. The main aim is to provide detection of URLs and IP addresses (want to play a bit? take a look to a previous post) but full-text searching capabilities is also desired althought less used, so i have given a try to improve performance and, specifically, query times, and here is my tests results.
Actually the core’ schema.xml it’s something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
As can be seen, it only indexes and store given text, url and hash (used for avoid dupes), converting case to lower and tokenizing by whitespaces. This means that a document with content “SPAM Visit blog.alejandronolla.com” will be tokenized to “[‘spam’, ‘visit’, ‘blog.alejandronolla.com’]” so, if we want to search documents mentioning any subdomain of alejandronolla.com we would have to search something like “text:*alejandronolla.com” (it could vary based on decisions like looking for domains similar to alejandronolla.com.PHISINGSITE.com or just whatever.alejandronolla.com).
This kind of queries, using leading/trailing wildcars, are really expensive for solr because it can’t use just indexed tokens but perform some walking up to “n” characters more.
When dealing with a lot of documents concurrently probably you’re going to face heap space problems sooner or later so i strongly recommend to increase RAM asigned to java virtual machine.
In this case i use Tomcat to serve solr, so i needed to modify JAVA_OPTS in catalina.sh (stored at “/usr/share/tomcat7/bin/catalina.sh”):
1 2 3 4 5 |
|
Adding “-Xms2048m -Xmx16384m” we specify tomcat to preallocate at least 2048MB and maximum of 16384MB for heap space for avoiding heap space problems (in my tests i almost used about 2GB indexing about 300k documents in two differents cores, so there is plenty of RAM left yet):
We have to set some configuration at “/etc/tomcat6/server.xml”:
1 2 3 4 5 |
|
I have set up maxThreads to 10000 because i want to index documents through API REST with a python script using async HTTP requests to avoid loosing too much time indexing data (and i’m almost sure bottleneck here is python and not solr).
As previously said, most of the queries looks for domains and IP addresses through full document’s content, causing really heavy queries (and performance problems), so the first action i took was to create a new fields just with “domains look’s like” string and IP addresses to tie down queries just to potentially valuable info.
To extract domains, emails and similar strings solr already have a really powerful tokenizer called solr.UAX29URLEmailTokenizerFactory, so we only need to tell solr to index given document text using this tokenizer in another field.
To specify solr which and where field we want to copy we have to create two new fields and specify source and destination fields:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
We are going to use these fields only for searching, so we specify to index but not store (we already have full document content in “text” field) It’s important to have in mind the fact that solr copy fields before doing any kind of processing to document.
If you have noticed it, we specified an undeclared field type called “ip_addresses”, and we are going to use solr.PatternTokenizerFactory to make a regex for extracting IP addresses and CIDR network ranges (like 192.168.1.0/16)
1 2 3 4 5 6 7 8 9 |
|
It’s a really simple regex and should be improved before using it in a production environment for example, to extract only certain IP addresses (not RFC1918, not bogus, quad-octet validated, and so on) or even implement your own tokenizer extending existing ones, but will fit ok for our tests.
Now we can change queries from “text:*alejandronolla.com” to “text_UAX29:*alejandronolla.com” to walk much smaller subset of data, improving queries in a huge way.
I totally forgot to specify to filter out all tokens that are not email or url after tokenizing with UAX29 specification to just store emails and urls. To do this we need to set a token filter at fieldType “text_UAX29”:
1 2 3 4 5 6 |
|
In “allowedtypes.txt” file we need to put <EMAIL> and <URL> (one per line) as allowed token type and we should change IP addresses tokenizer to make a small hack and return only IP addresses or extending TokenFilterFactory for filtering after tokenizing process.
Really sorry and apologies for any inconveniences.
Solr is a really powerful full-text search engine and, as such, it is able to perform several kind of analysis for indexed data in an automated way. Obviously those analysis need resources to be made so we are wasting CPU cycles and RAM if we are not going to use them.
One of these features is related to solr capability for boosting some query results over others and is based on certain “weight”. For example, two documents mentioning “solr” keyword just one time - one with a length of just few words and the other having several thousands - will have different relevances for solr engine, being more important the smallest one. This is because of term frequency-inverse document frequeny (usually refered as tf-idf) statistic approach, if same keyword appear same number of time it represents a bigger percentage of the entire document in the smallest one.
Because we are not going to use this feature we can disable it and save some resources modifying schema.xml file:
1 2 3 4 5 6 7 8 9 |
|
By setting “omitNorms” to “true” we specify solr to not don’t care about length normalization or index-time boosting, you can check the wiki for more information.
Another feature we don’t need now is the solr ability to find similar documents to given one (feature called MoreLikeThis). To do this we can take several approaches as compare tf-idf values or, more accurate way, represent each document as a vector (vector space model) and find near ones (solr mix both).
Because we are not going to use this feature we can set it off by specifying following field options:
1 2 3 4 5 6 7 8 9 |
|
I have disabled them with these options ”termVectors=”false” termPositions=”false” termOffsets=”false”” and gain some performance boost.
If you want to know which field options to use based on your application aim take a look to official wiki:
When doing natural lenguage processing the term “stopwords” is used to refer those words that should be removed before processing text because of their uselessness. For example, when indexing a document with content like “Visit me at blog.alejandronolla.com” we don’t care about personal pronoun “me” and preposition “at” (take a look to part-of-speech tagging) so less indexed words, less used resources.
To avoid processing those words we need to specify solr where stopwords are located:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
We need to have a file called “stopwords.txt” at our “conf” directory for specified core containing these words and we can find some stopwords for several languages in the example configuration provided with solr package at “/PATH/TO/SOLR/CORE/conf/lang”:
1 2 3 4 5 6 7 8 9 10 11 |
|
Of course, we can also include as stop words common words that don’t give us any useful information like dog, bread, ROI, APT and so on…
Despite of haven’t used stemming yet in solr environments it’s possible to convert a given word to his morphological root through an stemming process:
1 2 3 |
|
Because we “reduce” words to his root probably few of them, per document, will share stem and this will result in a smaller index and more performance booster.
Depending on application data and workflow it could be really useful to cache “n” most common queries/filters/documents and avoid doing over and over the same query in a few minutes apart, i’m sorry but haven’t played around too much with it, so to read more about this go to the wiki.
After taking first two improvements actions did some performance test and comparisons, so here are some info for a “small” subset of about 300k documents:
Original schema | Modified schema | |
Indexing time: | 95 minutes | 101 minutes |
Index size: | 555.12 MB | 789.8 MB |
Field being queried: | text | text_UAX29 |
Worst query scenario: | 84766 milliseconds | 52417 milliseconds |
Worst query improvement: | – | 38,2% faster |
As shown in the above table, the “worst” query i’m now performing (dozens of logical operators and wildcards) will take about 38% time less per query hit and, in an application which performs hundreds of query per hour, it’s a great improvement without disrupting normal functioning (looking for domains and IP addresses) and, in the other hand, it will take almost no more time to index and more than reasonable index size increment that worth it.
Hope you liked it and can apply someway to your needs, see you soon!
]]>As his own website says: “Moloch is an open source, large scale IPv4 packet capturing (PCAP), indexing and database system. A simple web interface is provided for PCAP browsing, searching, and exporting. APIs are exposed that allow PCAP data and JSON-formatted session data to be downloaded directly.” it will be very useful as a network forensic tool to analyze captured traffic (moloch can also index previously captured pcap files as we will see) in case of a security incident or detecting some suspicious behaviour like, for example, some kind of alert in our IDS.
Thanks of indexing pcaps with elasticsearch, moloch provide us with the ability to perform almost real-time searches among dozens or hundreds of captured GB network traffic being able to apply several filtering options on the way. It isn’t as complete as Wireshark filtering system for example but will save us tons of work when dealing with some filtering and visualization as well as Moloch will provide us with some features Wireshark lacks, like filtering by country or AS.
I’m sure to not be the only who would have loved to rely on moloch when analyzing dozens of GB with tshark and wireshark, particularly each time you apply a filter to show some kind of data…
For deploying a moloch machine in a “all-in-one” setup i created a virtual machine with Ubuntu server 12.10 64bits and assigned about 100GB of HDD, 16GB of RAM and 4 CPU cores, moloch is a highly consuming platform, to have a more detailed info about this go to hardware requirements.
First step will be updating the box, installing java and cloning github repository:
1 2 3 |
|
Once cloned the repo we must install, at least, one of his components: capture, viewer or elasticsearch. Because we are going to mess up a bit with moloch to get an overview of functionalities and capabilities we will take the shortest path, installing moloch through provided bash script to setup everything in the same machine; if you prefer to install it manually or are going to build a distributed cluster check ”Building and Installing”:
1
|
|
Now the wizard will make us a few questions to configure moloch (capturer, viewer and elasticsearch instance) for us and everything will be running in a few moments (moloch will be installed by default at “/data/moloch/”) and we can access to web interface at “https://MOLOCH_IP_ADDRESS:8005”:
As can be seen, moloch have already started to index all traffic seen on eth0, included every request to moloch web interface. If we don’t want this then we have to specify a capture filtering in Berkeley Packet Filter (bpf) format at “/data/moloch/etc/config.ini”:
1
|
|
To change elasticsearch configuration and allow access from other IP address than moloch host itself (it could pose a security risk, using SSH tunneling would be a better aproach) go to “/data/moloch/etc/elasticsearch.yml” and edit network parameters (network.host), to view/change moloch configuration take a look to “/data/moloch/etc/config.ini”:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
We need to shutdown elasticsearch node and start it again, so here we go:
1 2 3 |
|
We can also start viewer and capturer from same dir “/data/moloch/bin/run_viewer.sh” and “/data/moloch/bin/run_capture.sh” respectively.
Now we have access to elasticsearch-head plugin to see elasticsearch cluster health and manage it at “https://MOLOCH_IP_ADDRESS:9200/_plugin/head/”:
To have some info indexed by moloch in a few minutes we are going to make some light random nmap scans, having in mind the interface assigned to virtual machine. If you want to use virtual interface and launch nmap scan from moloch box then you could need to change bpf filter to “bpf=not port (9200 or 8005)” (this isn’t, by far, the correct way, but will be enough for a quick test).
1
|
|
If we take a look again to moloch web interface now we will see some pretty info:
We can see more info about any session clicking on “green plus” icon:
A new dropdown will appear and will give us some interesting options like downloading pcap (for example, to make a deeper manual analysis with wireshark), downloading data in RAW format, and showing use a set of links to make some filtering.
Let’s click on “User-Agent link” and then make a search to show only those indexed packets using the NSE user-agent, now you know who have scanned your network with nmap’s HTTP plugins in just a second ;).
Moloch also have a useful “stats” menu to have realtime statistics about traffic being captured and indexed:
To index traffic captured in pcap format we have to use “moloch-capture” stored in “/data/moloch/bin/moloch-capture”:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
I’m going to index a sample of about 7,5GB from a DNS amplification DDoS attack i had to analyze and help to mitigate some months ago, but to quickly download some pcaps to play around NetreseC have a published a good list:
1
|
|
After some minutes i already had indexed some millions of packets and can view them just searching for tag ddos (i have stripped out map and some info to don’t disclose anything about customer / attack):
Let’s say we want to show every DNS datagram originating from port 53 by servers geolocated at Russia:
As can be seen, there were peaks of almost 60.000 packets per second (DNS answers) with an average of approximately 20.000 at regular intervals in this six minutes slot.
Moloch give us the chance to visualize indexed traffic from a graph’s theory point of view (“Connections” tab), using hosts as nodes and connections (with or without port) as edges:
This is really useful to get an idea at a glance of what event is being analyzed, in this case we can easily spot few targets and thousands of hosts targeting them.
At the beginning of this post i said that Moloch have an API to query and get some info about indexed pcaps and so on in JSON format. At this moment probably the best way to see which calls exists is directly reading the viewer code.
There is an example of python code to query moloch API and show some statistics:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
|
This simple code will show something similar to this:
1 2 3 4 5 |
|
That is all for now, hope you liked this and find it useful, i think moloch is a really powerful tool and will turn to a must-have in network forensics as well as saving us countless hours when dealing with big amounts of network traffic.
See you soon!
]]>Solr is a schema based (also with dynamics field support) search solution built upon Apache Lucene providing full-text searching capabilities, document processing, REST API to fetch results in various formats like XML or JSON, etc. Solr allows us to process document indexing with multiple options regarding of how to treat text, how to tokenize it, convert (or not) to lowercase automatically, build distributed cluster, automatic duplicates document detection and so.
There are a lot of stuff about how to install Solr so i’m not going to cover it, just specific core options for this quick’n dirty solution. First thing to do is creating core config and data dir, in this case i created /opt/solr/pdfosint/ and /opt/solr/pdfosintdata/ to store config and document data respectively.
To set schema up just create /opt/solr/pdfosint/conf/schema.xml file with following content:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
Just a quick review of config for schema.xml, i specified an id field to be unique (UUID), a text field to store text itself, timestamp to be setted to date when document is pushed into Solr, version to track index version (internal Solr use to replicate, and so) and a dynamic field named attr_* to store any no specified value in schema and provided by parser. At last, i specified how to treat indexing and querying, for tokenize i use whitespice (splice words based just on whitespace without caring about special punctuaction) and convert it to lowercase. If you want to know more about text processing i would recommend Python Text Processing with NLTK 2.0 Cookbok as an introduction, Natural Language Processing with Python for a more in-depth usage (both Python based) and Natural Language Processing online course available in Coursera.
Next step is notyfing Solr about new core, just adding to /opt/solr/solr.xml/
1 2 3 4 |
|
Now only left to provide Solr with binary document processing capabilities through a request handler, in that case, only for pdfosint core. For this create /opt/solr/pdfosint/solrconfig.xml (we can always copy provided example with Solr and modify when needed) and specify request handler:
1 2 3 4 5 6 7 8 |
|
A quick review of this, class could changed depending on version and classes names, fmap.content specify to index extracted text to a field called text, lowernames specify converting to lowercase all processed documents, uprefix specify how to handled field parsed and not provided in schema.xml (in that case use dynamic attribute with a suffix of attr_) and captureAttr to specify indexing parsed attributes into separate fields. To know more about ExtractingRequestHandler here.
Now we have to install required libraries to do binary parsing and indexing, for this, i have created /opt/solr/extract/ and copied solr-cell-4.2.0.jar from dist directory inside of Solr distribution archive and also copied to the same folder everything from contrib/extraction/lib/ again from distribution archive.
At last, adding this line to /opt/solr/pdfosint/solrconfix.xml to specify from where load libraries:
1 2 3 |
|
To know more about this process and more recipes, i strongly recommend Apache Solr 4 Cookbook.
Now we have a extracting and indexing handler at http://localhost:8080/solr/pdfosint/update/extract/ so only rest to send PDF to Solr and analyze them. The easyiest way once downloaded (or maybe fetched from a meterpreter session? }:) ) is sending them with curl to Solr:
1
|
|
After a while, depending on several factors like machine specs and documents size, we should have an index like this:
So now we try a query to find documents with phrase “internal use only” and bingo!:
It’s important to have in mind the fact that Solr split words and treat them before indexing when doing queries, to see how a phrase should be treated and indexed by Solr when submitted we can do an analysis with builtin interface:
I hope you find it useful and give it a try, see you soon!
]]>