Machine Learning & Analytics - More Than I Can Chew?

Deruvian

Lord Nagafen Raider
642
116
I often come to FoH/RR when making big decisions to get a reality check. I find myself once again in need of advice and was hoping to lean on the collective wisdom of you all.

I'm a first year MBA student. I've worked for a bunch of years in finance, can code somewhat well and know SQL back and forth. I've really enjoyed working with data in my past life and want to transition into a role where I manage data scientists or something similar. There seems to be a ton of demand for managers with hard data skills, so I have a leg up on nearly all of my classmates in this respect. In respect to ML, I have had classes covering the basics, but they are all superficial MBA courses. What I am thinking of pursuing, and why I am writing this, is a path that will get my hands dirty.

I want to understand, at least partially, the math that drives these ML algorithms. I don't want to be the person managing others who has no idea how to do their job. I do not have a background in linear algebra and I haven't seen calculus in ten years. I can cross enroll during my MBA and fill these holes from the university's math department, but I don't have any conception of how far down the rabbit hole I will need to go, or even if this is a feasible/useful idea. This stuff is all really fascinating to me and I think that will help me grind through a lot of it, but I am at a point now where I am about to enroll for Linear Algebra next semester and wanted to step back and examine.

Am I setting myself up for failure here? Are there other things I should be considering? Classes I should be considering? I'm also self learning Python, but that isn't something that requires the investment or brainpower that the above does. Any related comments, suggestions, anything would be much appreciated.
 

iannis

Musty Nester
31,351
17,656
I know absolutely nothing about it. My guess is that you take some courses with a focus so that you can get a rounded understanding of core principles. Without an intensive academic focus, and the experience of the work itself, you're probably not going to really understand what the team is working on in a specific way on day one anyway. So what you want right now are those core principles which will prepare you for the ability to understand the specifics of the work. It'll be constant on the job training.

Do they hire MBA's to head the department directly, or do they promote from the bottom rung of their current engineers? "Well Wally doesn't seem to produce much... but he does understand the work and couldn't be worse than nothing." I've known engineers in other fields that talk about "that's how it works. I don't wanna be a fucking department head". But I have no idea if logic/computer science engineers share that culture.
 

Palum

what Suineg set it to
23,446
33,700
Learn statistics. The hard math is only partially relevant, if you are smart enough to understand why you get X from Y there's no reason to memorize it all. You will learn the relevant stuff with a stats focus.

Data science (specifically big data) is using old programming concepts on new hardware to accomplish advanced statistics. You need to know how you get the data and how to leverage and interpret the stats.

I've done a few large modeling projects, using some standard industry tools, R and some custom stuff. The math is just cut and paste if you understand it. The most important part is understanding the analytics which is the business logic. All the math in the world can't tell you that a specific thing is correlative but not casual and you need to ignore it. That's deep understanding of the subject matter, data collection methods and statistics.

I'm short, there's nothing wrong with linear algebra, multi D, differential equations but without the heavy stats classes you'll be just as worthless.
 

Tenks

Bronze Knight of the Realm
14,163
606
I'm somewhat confused. There are tons, TONS of ML options out there. Like Palum said most run ontop of a big data infrastructure to chew through tons of statistics to accurately predict things. But you don't really need to know the internals because you're not on the project. You're just using it.
 

CnCGOD_sl

shitlord
151
0
The big one I see coming up is Spark MLLib, generally analytics/data science folks write R and Python (with SPSS being on the way out). Statistics depth is huge, take everything you can find.
 

Tenks

Bronze Knight of the Realm
14,163
606
I've heard rumblings around here of using Spark ML as a way to try and automatically scrub some of our incoming data. Basically we get data that has a standard for format but the content isn't always right. So all incoming records are given an error level (0-3) and if the error level is 0 that means it can be ingested if it is 1 it goes through an automatic clean-up process (that has to be added to manually if we start seeing reoccurences of the same error) and 2-3 get rejected. We can see where the error is but we're not sure how to fix it without human intervention and knowledge and we'd like to try and machine learn that process.

I don't think I'll be on that project but it sounds pretty cool
 

Deruvian

Lord Nagafen Raider
642
116
Good advice as always, guys. Most of the MBA internship and post MBA positions that I have encountered in this space are targeted at STEM undergrads and begin as practitioners to some degree. Many want or expect some level of competency in constructing or digesting mathematical models. I think 'fake it till you make it' will rule the day for some time.

Spark keeps popping up as something worth looking into. I have a few books on Hadoop that I'm working through, but I do wonder if it's too low-level considering the variety of more user friendly options built ontop of it. Do you guys have any experience with Amazon Redshift? I had a chat with an analyst at a local startup and he made it sound pretty nice, although it doesn't seem like it can scale too well at the high end.
 

Tenks

Bronze Knight of the Realm
14,163
606
Hadoop and Spark are really two separate things. Generally when people talk about Hadoop they mean Map/Reduce and the framework around that. Spark does not use Map/Reduce. It is completely independent. Map/Reduce has gone to a process container called YARN which divorces the running of parallel processes away from the Map/Reduce framework which allows for things like Spark to exist which leverages the parallelism of Hadoop but doesn't utilize Map/Reduce.
 

CnCGOD_sl

shitlord
151
0
Spark also runs on storage layers other than HDFS such as Cassandra and amazon S3 or even local mode. There is a newer framework called Flink that also has some big promise for HDFS workloads. If you want to go to the bleeding edge, stream processing frameworks for big data analytics are components of both Flink/Spark and others. They let you work on individual events or micro batches to make decisions on inflowing data (like MLLib results that were run on the whole data set applied on data ingest in RAM before it lands on spinning rust).

Spark also has some great GUI tools for notebooking and visualization but they aren't commercial offerings like traditional BI tools yet. Zepplin and Jupyter are the current favorites for that work.

Most big commercial tools you'll see at traditional enterprises can't handle modern data analytics but may be the only options at those companies. They tend to be based on EDW platforms such as Oracle, Teradata and IBM.