uptime conference
talk 1: bridget k overview notes in phone notes Bridg opening :/ Max chaos, intro two Laura bell @lady_nerd Conservation of complexity Recap presen, w tweets Dns prob lol Auto create twitter list of all people who tweet keyterm during day range Codewithtarget.com Cliche, excited about the part don’t have to people Mark McBride tweet Viking the saga of the system “Documentation isn’t someone that anyone wants to do” :( “Start by having health checks that aren’t sprained” “Heath checks should be able to have unambiguous answers’ Cap system, the basic basics overvew…
Uptime conf: talk 2: 24 ugust 2017: nell shamrell, chef limited wifi https://www.producthunt.com/topics/website https://www.producthunt.com/posts/slack-app-directory-2 https://www.reddit.com/r/Slack/comments/5brtsm/is_the_stacktodo_bot_dead/ http://signup.team/blog CLI design: http://whatisthor.com/ http://click.pocoo.org/5/ http://catb.org/esr/writings/taoup/html/ch11s06.html https://www.juliandunn.net/2016/08/09/designing-great-command-line-user-experiences/ i – habitat.sh wifi :/ 12m/row * 2 est 300 people in aud ? habitat pitch? encrypted gossip, monitor in rign leader/follower. auto-elect. leader gets write requests, followers get read requests. house left for theater nerds ;) update strategy “rocker” container containers in prod - learning cliff (coming from dev) “black boxes causes security issues” (unpack this) find green hair twitter person from strangeloop, SF sf party plans? docker-compose demos video haptic jerk fail sigh multi AZ just a big habitat pitch
todo blog: security tools you could put in your pipeline
Sigje talk 3 Binary.protect.io Medium.com/wardleymaps Fear management Anxiety Team state as part of risk assessment Company Dora, jez humble at Sea urchin analogy Intro about kids Regaining inspiraton Grog from red-eye Blm chetlin :) :)
in line for lunch: Momswhocode 5/mo for a year, courses Callback women ^
Lightning talks: Sean bartolin Loud doors Lightning talk finance - makeup joke
Chef Rob wash hands
talk 4: CEO https://www.heptio.com/ https://chris.bolin.co/offline/ SRE “serious engineer, badass engineer” sigh very generic presentation so far provision and run a cassandra cluster presenter’s calendar oops- plus slack operations functions “people are not generalists” optimize mttr instead of mttf (second presenation to say this) kubernetes is so totally the new hotness. TODO google “how to get started with kubernetes” use ksonnet for better yaml wrangling “if things get weird” reproducibility… “data locked into one location- “ have replicas Maybe it’s just that I’m groggy, but I feel like I could have given several of these talks. I want a harder conference. late-binding all dependencies managed through this contol channels LinkedD service meshes improve personal blog when you deply a container, its hermetically sealed todo lunchtime walkabout todo post-dinner board games or walkaround? “I’ve had hundreds of people try to sit down and draw their architecutre for me” deep federation - ? groggggy
@cmcluck twitter cncf.io play kube: kuberntees.io
talk 5: kareoke alcohol guy, PJ, loggly? runs mhprompt.org mental health in tech event http://enigmaathome.net/ “volumen, what does that mean” lake erie moon discussions, played golf drank coffee https://en.wikipedia.org/wiki/Yo_(app) slides :/ I miss the morning slides with black women stock photography 2.0 I mentioend bash testing to someone in the hall posted bash testing links give me the confidence of a mediocre white man shard analytics? shards and silos are very different, grr -mi “can I run an entire ewebsite runnin just redis?” “by the way you can” “there is no golden hammer. you need to go back to our team snd communciate”
loooong break, 25 min
finance lightning talk slides https://github.com/phillipgreenii/talks/blob/master/playing-with-fire-personal-finance-with-single-responsibility-principle-and-cron-jobs/README.md
Scrutinizing the Scrutiny - Jason Hand
pitch from post email company
Speaker 6: Jason Hand, victorops I wish someone had said “wht “is” uptime he looks diffferent from his twitter bio oreilly post-incident reviews mi - purposefully doxing between talks mi todo- harvest post-incident revies? Post software dev postmortems? radar recommendations? prezi :) “there is never a root cause” always a combination of factors systems constantly in states of partial failover degrdation field guide to understanding human error - book, recommended people cannot be the root cause we have to build systems around how people work, # (mi) not want people to work like the system says” https://smile.amazon.com/Field-Guide-Understanding-Human-Error-ebook/dp/B00BL0OZ0E/ref=mt_kindle?encoding=UTF8&me= https://victorops.com/oreilly-post-incident-review/ wins best slides mi - acid throat? timeline an “account of” what happened- promote openness and descriptiveness very nice graphics prezi and narrative are amazon book IDs the same fo different people? https://read.amazon.com/?asin=B00BL0OZ0E really nice in-detail from VictorOps https://smile.amazon.com/dp/B00AZRBLHO https://itrevolution.com/book/the-phoenix-project/ this is the first talk that I would recommend that my teammates maybe watch- mostly Saleem “how do we avoid those shoulder-tapping executives?” SNC- do we have a status page? must be PUBLIC - todo recommend paid(?) options? for each API sr vs jr grr, cathy jr grr “dinner with their wives” I hear that as a micro-sexism ¯_(ツ)/¯ not enough to call out (most days) but enough to feel it hit as it goes by. state of devops report 2016 - read and myTW post our cost of downtime - average purchase amount per minute: mbff vs order-api and dependenvies, vs store stuff SNC- we need to work more closely w Brett “it was someone in support, not EVEN someone in engingeering who could DO something about it” ugh dude what should and should not page? - todo “when everybody is on call, nobody is on call” universal/interactive runbook https://runbook.io/ basic checklist- responders have access? notice issues via customer support, clear escalation path- ask each human- “what do you do if?” for a scenatior- a friend pings you, you see a tweet… you see a weird spliunk graph Downtime in exchange for innnovation “When everybody’s on call, nobody’s on call.” @jasonhand #uptimeconf
last thursday talk Joyent @bcantrill loud and enthusiastic, no new news yet zebras tweets adominal pain, medical advice lol what I hope that you have permission zebras are left medical history children of doctors, stitches on the kitchen table “trollbait into a stupor” (amazon’s principles- bias for action) I should really post less tshirt stuff on #sparkly observability of system conf, JJ zookeeper “threw up in my mouth” was actually 2012 “getting it running is hard” http://www.pamelasdiner.com/location/oakland/ old notes: “she works at google” 7 concurrency patterns in7 weeks - add to read list? callbackhell.com “One goroutine per pixel, which is stupid and you should never do, and it took 20 seconds.” #fractals http://areyoufuckingcoding.me/ netchannels – back to conf- bcantrill omg such fasst “the Ms all turn into Bs” (meg to billion?) I am looking at the source code that they lost, rightnow- I can feel it. it’s warm to the touch firmware “The firmware is not going to page you. I could- or I could just not do this [the firmware says]” Raft preferred over zookeeper ZFS mis-changed polatiry of the head jtags on spindle. prevent actually destpring drive. 550 ms outlier. time for drive to reboot. (OMG) flashtastrophy fear lol SSD failure- so much that can fail- Joyent and Sun haven’t had serious problems- overengineered the heck out of them. latency outliers “when i concieved of this talk, I had a conga line of broken firmware” DRAM DIMM correctable vs uncorrectable- sent to firmware. Firmware First “firmware first it he wong model for error handling” trump jokes “there were not mistakes on both sides” “we added a feature called cloaking- you are confessing a crime to me right now?” “that’s very cute you think the chassis could be immune” (this is why “cute” is a terirble word when applied to me) “iif you tell it to reset itself, it ran out of memory such that it didn’t lknow who it was” NIC netrwrok interface card LACP link aggregation control protocol MLAG software new modes too complicated- dont walsy trade complicated for available. “you as firmware, a blow against humanity- starthere!” blast radius lol “it would constantly chuck all of its arc tables and “ ddos the system as a hobby omg this is the best, re-watch yes Jepsen shoutout yass “firmware is its own jepsen, its own chaos monkey” it’s a fact which is also a myth- google velcro motherboards reward complete understanding, not merely resolution! we need OSS firmware- that’s the post-singularity rapture https://twitter.com/scanlime
tomorrow: 930am, foors at 830 mi- walkaround before talk?
– DAY 2 (friday) Uptime thoughts: “Firmware First” https://www.flashrouters.com/learn/router-basics/benefits-of-open-source-firmware https://en.wikipedia.org/wiki/List_of_router_firmware_projects “I wouldn’t get any laptp that Netflix can’t stream on” (as reasoning to not get a linux laptop) I enjoy listenign tot the gentle chatter of geeks- alfred, flash, firmware…
talk1: “I derailed up to 77 businesses that day” threatstack some very company-pitch-sish talks TODO: read thru the links that cr sent yesterday https://pete.wtf/vasa/ “how does a security company stay secure? Well, some of our customers are other security companies” “new phone who dis” is this origin? things good to do? approproation/usage/fun? photo of linx syscalls link sax 0mq early scalability TODO me read src of rabbitmq? TODO write work consulting workshop “I don’t want to shit on the work of these people, it just did not work, for us , at that time, with our existing infrastrucutre” “sometimes it is just goodo to know with what you know (esp when youre in a hurry)” “kafka was way too new in 2014 for us to depend our entire livlihood and stack on it” TODO list of things to proof-f-concept because it’s importnt to be ablto to chose the right tool without having to choose the tool that you know best- know enought hat you can choose the right answer! rabbitmq does not do network partitiions well “a series of servers surrounded by network partitions” twitter handle from every presenter- some of them on every slide 3 yr - solid longevity how do you make RMQ highly available? by not clustering it! “data on the queue longer than milliseconds is an alertable condition for us” Rack Aware (AZ) writes - if one AWS ec2 szone goes away…. this talk is teching me things, yay kafka- lets you re-play transactions time series daya is so cool buy, then build? don’t build it if you .. if you don’t have the time, there are ways to buy your way out of this” “very large infra across multiple public clouds” how good you are now vs how much I can learn from you talking about what you have done before- related but not the same (mi) collectd- one of many similar tools people sitting on the floor in the balcony, looks like good places tho money vs time libratto collectd -> write_http -> (logo) blue square w two rounded corners, a hole in the middle, a dot the size of the hole right after the square collectd -> write_graphite -> graphite “write it into collectd” ? can run to instances of collectd opens ource https://prezi.com/developers/open-source-thanks/ hw-cookbooks (chef cookbook) jason dixon wrote syntehesize- “get a host up really quick” https://github.com/JohnMcLear/ep_slideshow https://alternativeto.net/software/prezi/?license=opensource carbon c relay words I don’t know, this is great! c based implemetaion of carbon relay, higher performing. (rewrite in golang?) consuming 40k metrics per second, good amount of runway.. i3 2xk “fancy IPE really fast discs thing” get devs to own applciations “want your devs to ops? build consumable services” “You want to know if your seb server is running netcat or mining bitcoin!” auditd logfile format - he dislikes itnot all lines in keyvalue format, events can help multiple lines events can be out of order he doesn’t like any current OSS security sofware (whaaat) slackhq/go-audit you still have to consume the events… ossec.github.io http://strut.io/ osquery.io - from facebook, kinda like sql queries (no change over time) dralos/falco - requires kernel modules “I’m kinda showing my neckbeard… I don’t like kernel modules, I’ve been burned by them, not allowed in my environments” if the attacker can see the alerting rules on the host, they can get around them SIEM, SIM, SEM - terms meansing sec event management SEM- real time analysis, take immediate action SIM- long term trending, make auditors happy they do dogfood wooden picture of Inception put watches on keyfiles, get notified if it changes catch poor operational practices to catch issues before breach put visibility into chat- devs can claim the event (can even add a slackbot to notice claim and esclate in not) slack- “is this you?” if not.. then…. :) DuoAuthentication “if you see a db dump on a server (whcih should not be there) do you know who to go to?” - great concrete example aalyze VPN connection logs (texas vs taiwan for a user) “if you capture the data, you can do cool things with it later” “trying to catch your engineers…” O.o “safe access to production” “understand why humans have to log into servers, and automate it away” threatstack/ts-ldap threatstack/authkeys github.com/threatstack/deputize integrates with ldap, pagerduty - when you go on call, your ldap group changes and you get the right access WOW bastion host, etc there is a big checklist buried in here somewhere - includes bastion, logging netcat, alterts, dynamic permissions no jet lag prblmes, just lack of sleep prpblems jason(josh?) green shirt, backk-up for oncall, marketing team, has jr devs, some bootcamp after being in fashion, randy (manager/hire-r_) https://osquery.io/ https://ossec.github.io/ https://collectd.org/
friday talk 2:j Paul Reed :) @devopshiq sysadmin stranded on island, bury fiber, backhoe comes up sponsor: UPMC enterprises - Mohindra- blue shirt :) Stichfix has a similar to https://github.com/threatstack/deputize tool but with a victorops plugin greenshort Jason(?) says that stitchfix has not bad gender ratio- reach out to (other dude) directly and mention Uptime chat. sponsor “traditional life sciences and critical decision supprt” EMR systems… I mentioned to him- their recruiters failed pretty hard- a list of hard questions for recruiters research task: think of anevent from your past (metup attended) and look up when&where it was, using your own archives / any methods you have (and do a write-up of how you did it- like “first I searched my email”) TODO Chicago trip, coordinate with Winans, get IIT / middle school tour? “this is a true technology company”
J Paul Reed success/failure - no one nows yet… what is succes for us? MS candidate in Human facors & system safety https://github.com/threatstack/deputize for my team (from prev taslk) Chef public postmortem google angout, blog posts jj and hotel for lunch? Find chetlin? @devopshaikus Petnet pet feeder outage Workshop: think of ways this company can fail (make some example company profiles- finance, pet feeder, hamburger-sale) if you don’t know a thing (like … then make it up, state your assumption, and keep going. If the question is “how does this work, I need to know so I can answer the question- for now the answer is “choose what would make sense, and keep going) “emergent behavior that we couldn’t reason about when desining the system” I say- thougthworks and people are like “yeah!!! great almni!” 8 hours of printing- first 4 minutes of the incident “normal accidents” I want TW stickers… can print? Find svg, make a print run… pass out in office yass inspirational confernce, CHECK winning crew rotation in high reliability orgs, active learning, decentralized, active review. (cross-checking) “red teaming” yep is good everything is domain terms, from “red teaming” to “organization” “I have friends at X” Human chaos monkey- give me your laptop, take a four day weekend whaaat hardware chaos monkey. give windows, go work incident. deal with oncall on a different laptop, see if it works / practice it aaaaaa Analogy with fire trucks- they take away the truck and gear, refill truck, coil hoses- time to get ready for next incindet (they CAN go direct to next fire, but it’s better to have everything back in place…) jpaureed.com/blame-aware-postmortems I prefer less-high-level talks… TMate - tux with two people at the same time. chatops yes of course is chatops no longer as big, or just so used al the tame? book “no asshole rule” wat happens when people make mistakes? “total cost of asshole” 160k/year (seems low but ok) TODO for workshop- give failure scenarios- how to you handle it? person X pushes their ssh key (junior vs lateral- two scenarios) chaos day- team can “cry humble” and pause/stop netflix can opt out of chaos money for 4 or 6 months… plane analogies outcome bias nice video playing working and audio “that was a horrible deployment but no one wants to hear about it because hey it got deployed” “counterfactual languge” is “why didn’t you notice x/ do z” caounterfactual language- it didn’t happen, and we are talking about it.. (learn more??) rhyme-as-reason effect, if it rhymes you will like/belive it more te audio experiemnt- can understand the distorted version after hearing the real “we have 8 million action items from one incident” sloppy incedent response - “main engineer who worked the incident, she’s on vacation” delaying retros makes it less godd. Rasmussen triangle, “exploring the discretionary space” “incidents, not accidents” saying the basics over and over, because it’s hard to internalize this stuff, and some people don’t know, and it’s important to know. Amazon- blameless description, learned a lot- did not just stop at “human error” “Your incident immune system” gets better the more you use it my notes are about to get a lot crappier if I run out of power Bino tool laptop dying I wince when she meses up, prepresenting more than yourself, for right or wrong- at least I can feel it in my head chatbots are great, visible auth “this is the slide to photograph” rainbow github shirt If I were sneaky and worked there, I would be putting her up oto this so hard I hope that these people watching don’t me@jam.fish “allowing me to speak”
Seth Vargo - Conusl as monitoring service Barclay is using Conul! TODO Linda - Seth from Breadcrumb it is a iven that your internal state is sucj that most other people do not usually have to care, but hig events stick up out of the coverability- mearriage, death- part of the pursepose of leave vacation is so that that spike won’t spread? consul service discovery intentional learning hour every other day Mon Tues Fri image: image broken silo “what monitors the monitoring the monitors are monitoring?” give a twitter image reference to the photot credit for the article I took it from dark theme for google docs? brag TW’s response to employee death- alltw emil, grief counselors. transparency and support.
beerops talk gabriel, zaar, indigo tool suite tools I haven’t heard of!! “sledgehammer” tool “unattended sledgehammer” - first thing after machien is racked “system had grown organically over time” “lolwhoops” cat lol “the indigo code was a lot less adorable than this” “this tangle of hope and code” could do a conf like- warning I will shake your hand and give you a sticker human single points of failure chetlin likes TWU concept, likes teaching TODOpittsburgh office? “a lot of ops is oh god you got paged at 3am, just duct tape it together and go back to sleep” 400 line method that fif 7k things no mocking or stubbing nthing other than copy paste pray critical path for infra porovisioning if broken, no new servers “Just like you can’t sprinkle security or performance on something at the end of the process..” “Not all of these are best practice, but we don’t always have the ability to burn it to the ground and you have to use least-worst practices” “[Indigo sweeper] turned out to be one of those things where- if it doesn’t work, nothing works.” #UptimeConf better solution- be on teh data center team? adding flags! prod was still hard-coded but new version could win!! “the api was not fancy enough to have version; if it did, I’m sure it would have been hardcoded to 0.1” Deployinator “I would fotet a piece and then things would catch on fire” deployinator!! re-watch this talk omg, esp for new devs? https://codeascraft.com/2016/02/22/putting-the-dev-in-devops-bringing-software-engineering-to-operations-infrastructure-tooling/ “Sometimes what you have is a tabgled mess with maybe a kitten in it” publish slides? this- good for rewartching optional parameters!! yass rating system for conf talks for me… polished-ness, full-of-content-ness, newness to me “campsite rule of code- leave it better than you found it!” (YAS) knife-spork dance (deploy process) “campsite rule of code- leave it better than you found it!” etsy indigo suite lost voce discussing with bcantrill “having some heated discussions with bcantrill last night” “some stories dont have a happy ending, somethines there is kust yaks and sadness” https://github.com/etsy/deployinator https://codeascraft.com/2015/02/20/re-introducing-deployinator-now-as-a-gem/ fast-loading website are great at conferences “involve other people as much as I could so that I was not just anothe rSPOF, more people can fix tools, teach other people.” “If only you understand it, only you can be paged at 2am when it goes wrong”
airport “Your incident immune system” gets better the more you use it #uptimeconf @jpaulreed Delaying a postmortem/retro greatly reduces its quality (because humans always always forget). “within 72 hours!” @jpaulreed #uptimeconf