I went through the forum and found a thread dated back to February about data mining, but it seemed to completely derail into something different entirely. I thought a new thread was in order.
I am running into a similar situation throughout testing that I would like to improve upon. I generally have a list of exploited boxes which can be actively connected to, browsed, and execute arbitrary code. The issue is that it is too labor intensive to manually rummage through shares and drives looking for relevant data and I'd like an automated process. I'm looking for something more robust than the dir command or grep, because I'd like to see more customized results.
For example, my original idea was to have regex structures with logical relationships to one another. The presence of these relationships would infer a higher degree of confidence than searching for terms independently. The hope is to reduce false positives and generate some sort of priority system. I was originally going to use meterpreter and make an extension using the existing framework and client blob, but I'm running into some issues with that approach.
Has anyone come across a tool like this before? I've been known to reinvent the wheel, so I'm reaching out here before I start development. Does anyone see problems/concerns with this strategy?
Thanks a lot!
Maybe/maybe not related, but I was recently reading a review of Metasploit Express which mentioned that it had a "Loot" feature to collect a bunch of standard information from 'sploited boxes. The review was a bit fuzzy on the exact details, but perhaps this is something you could use or extend if you get Metasploit Express. It may not enable the type of specific granular control you seem to be after, but it might be worth examining.
I am failing to think of a instance as to where mass data mining would be a legitimate use of a penetration test (at least from an ethical standpoint). I was hoping either the OP or someone else could shed some more light on this.
Aside from that, that the phrase "data mining" isn't what I use to describe seeking target data in a pen test. To me, data mining is more the process of retrieving previously unknown or undetermined data, by combining information from unrelated databases. For example, combining a database of car sales of a geographic area with a database for preschool-age children in the same area to come up with a data set of potential families looking to purchase new child car seats.
Stop the TSA now! Boycott the airlines.
Thank you very much Lupin and Thorn. I do remember a few years back when the term data mining started to become popular mostly as a form of (usually illegal) gathering information for things such as marketing and profiling. I can see how the term can be used both ways, since it's such a loosely defined term. I also have a better understanding as to where this fits into a pentest now, mainly according to the terms agreed by the client. I can see where a client may not understand how at risk their company is to losing mass amounts of private information.
Thank you for all the responses. I can tell that I need to be very careful with my wording here
My situation is exactly like Lupin described. In particular, I am testing a very large number of hosts from inside the firewall (with expressed permission from the owner(s)). The goal is to assess the level of impact that a certain compromised host can have. For example, if I can somehow exploit an apache server, then that is clearly not desirable. The real impact comes from what I can obtain from that server, if anything, and any additional capabilities branching from that. If that server had PII, then it would have a higher impact and more critical to address first.
I saw a few other questions that I will answer in separate posts.
Let me clarify what I meant by a list of exploited boxes as I may not have considered the ambiguity. That list refers to boxes that have identified through the vulnerability discovery phase. This is after the network mapping phase which are both done under the watchful eye of the IT and IDS teams.
As for the term data mining, that was probably careless on my part. I hope we can get past the semantics though and try to come up with some ideas.
Last edited by lupin; 06-24-2010 at 12:00 AM. Reason: Merging...
First of all, some purists might argue that what you're asking isn't really about pen testing per se, but more about the goal after you've penetrated the system/network.
Personally, however, I happen to think that it's a very important piece of what we do. Finding a given vulnerability might impress someone in the IT department that might be rectified (someday) when time and money are available, but tell the CEO you found a vulnerability on Port 173 on the server will make him start yawning in the middle of presenting your findings.
On the other hand, getting some information that is vital to the company (e.g. customers' credit card numbers) is the kind of thing that makes C-level people sit up and taken notice, and you can see them get heartburn right in front of you as they think about having to explain the potential loss to the board of directors. That's the kind of finding that will actually get things fixed.
However, my impression is that you have identified some potential vulnerabilities, but don't know exactly what you want to find.
What you need to find is can only be answered by determining the goal, and that is determined by asking "what kind(s) of things can the client not afford to lose without disastrous consequences?" It may be one type of data, say, the big proprietary company secret, (think of the formula for Coke-a-Cola) or multiple data types such as patient health data and/or patient credit cards, or could also be non-data such as the taking over or disrupting the process control for a chemical plant.
Of course, once you've determined what the goal is, you have to ask, "where does it live?" After all, looking at a secretary's PC and reading her tweets about how drunk she got last weekend and what she did with the fives sailors may be entertaining, (look for pictures!) but it isn't going to help you track down spreadsheets with the CFO's projections for the next year's secret plans for a potential stock split.
So ask yourself, are you looking at the CIO's workstation, or the workstation of an engineering team? Small servers running Windows or *nix? How about IBM I-series or even AS-400's mainframes? (Yes, there are still AS-400's out there holding a lot of data...) Or SCADA PLC's and RTU's?
Now that you've got those questions answered, you can determine what tools (if any) that you can use. It may be a matter of using a commercial tool such as Tripwire; you may be able to just do a simple command line wildcard search for something as simple as a particular file type; or perhaps you'll need to craft some custom packets using Scapy to make an RTU turn off a pump.
Once you answer those questions: "What is the goal?" and "Where does the data live that we need to find to achieve the goal?", you can start to determine what the tools you'll need. But until you have some direction, searching for any useful data is will be akin to searching for a black cat in a cellar at midnight without a flashlight.
Last edited by Thorn; 06-24-2010 at 12:24 PM. Reason: Typos; cleaned up some lines.
Stop the TSA now! Boycott the airlines.