Keeping A Bigeye On The Data Quality Market
Play • 49 min

Summary

One of the oldest aphorisms about data is "garbage in, garbage out", which is why the current boom in data quality solutions is no surprise. With the growth in projects, platforms, and services that aim to help you establish and maintain control of the health and reliability of your data pipelines it can be overwhelming to stay up to date with how they all compare. In this episode Egor Gryaznov, CTO of Bigeye, joins the show to explore the landscape of data quality companies, the general strategies that they are using, and what problems they solve. He also shares how his own product is designed and the challenges that are involved in building a system to help data engineers manage the complexity of a data platform. If you are wondering how to get better control of your own pipelines and the traps to avoid then this episode is definitely worth a listen.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
  • Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.
  • Your host is Tobias Macey and today I’m interviewing Egor Gryaznov about the state of the industry for data quality management and what he is building at Bigeye.

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by sharing your views on what attributes you consider when defining data quality?
  • You use the term "data semantics" – can you elaborate on what that means?
  • What are the driving factors that contribute to the presence or lack of data quality in an organization or data platform?
  • Why do you think now is the right time to focus on data quality as an industry?
  • What are you building at Bigeye and how did it get started?
  • How does Bigeye help teams understand and manage their data quality?
  • What is the difference between existing data quality approaches and data observability?
    • What do you see as the tradeoffs for the approach that you are taking at Bigeye?
  • What are the most common data quality issues that you’ve seen and what are some more interesting ones that you wouldn’t expect?
  • Where do you see Bigeye fitting into the data management landscape? What are alternatives to Bigeye?
  • What are some of the most interesting, innovative, or unexpected ways that you have seen Bigeye being used?
    • What are some of the most interesting homegrown approaches that you have seen?
  • What have you found to be the most interesting, unexpected, or challenging lessons that you have learned while building the Bigeye platform and business?
  • What are the biggest trends you’re following in data quality management?
  • When is Bigeye the wrong choice?
  • What do you see in store for the future of Bigeye?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Streaming Audio: A Confluent podcast about Apache Kafka
Streaming Audio: A Confluent podcast about Apache Kafka
Confluent, original creators of Apache Kafka®
Scaling Developer Productivity with Apache Kafka ft. Mohinish Shaikh
Confluent Cloud and Confluent Platform run efficiently largely because of the dedication of the Developer Productivity (DevProd) team, formerly known as the Tools team. Mohinish Shaikh (Software Engineer, Confluent) talks to Tim Berglund about how his team builds the software tooling and automation for the entire event streaming platform and ensures seamless delivery of several engineering processes across engineering and the rest of the org. With the right tools and the right data, developer productivity can understand the overall effectiveness of a development team and their ability to produce results. The DevProd team helps engineering teams at Confluent ship code from commit to end customers actively using Apache Kafka®. This team proficiently understands a wide scope of polyglot applications and also the complexities of using a diverse technology stack on a regular basis to help solve business-critical problems for the engineering org.  The team actively measures how each system interacts with one another and what programs are needed to properly run the code in various environments to help with the release of reliable artifacts for Confluent Cloud and Confluent Platform. An in-depth understanding of the entire framework and development workflow is essential for organizations to deliver software reliably, on time, and within their cost budget. The DevProd team provides that second line of defense and reliability before the code is released to end customers. As the need for compliance increases and the event streaming platform continues to evolve, the DevProd team is in place to make sure that all of the final touches are completed.  EPISODE LINKS * Leveraging Microservices and Apache Kafka to Scale Developer Productivity * Join the Confluent Community Slack * Learn more with Kafka tutorials, resources, and guides at Confluent Developer * Live demo: Kafka streaming in 10 minutes on Confluent Cloud * Use *60PDCAST* to get an additional $60 of free Confluent Cloud usage (details)
34 min
Kubernetes Podcast from Google
Kubernetes Podcast from Google
Adam Glick and Craig Box
CNCF and the Linux Foundation, with Chris Aniszcyzk
After building the Eclipse IDE and Twitter’s Open Source office, Chris Aniszcyzk bootstrapped the CNCF, joining its parent the Linux Foundation in 2015. He’s now a VP of DevRel there, as well as CTO at the CNCF and Executive Director of the Open Container Initiative. Chris joins us to share his technology journey and Cloud Native predictions for 2021. And all that is now And all that is gone And all that’s to come And everything under the sun is in tune But the sun is eclipsed by the moon Do you have something cool to share? Some questions? Let us know: * web: kubernetespodcast.com * mail: kubernetespodcast@google.com * twitter: @kubernetespod Chatter of the week * Adam on LinkedIn News of the week * Otomi from RedKubes * Nutanix now supports Anthos * Tanzu Advanced is GA * Pivotal Labs is Tanzu Labs * VMware needs a new CEO * New CSI driver for Google Kubernetes Engine * Slim.ai announces seed funding * Grafana Cloud introduces free tier * Sysdig container security usage report (PDF) * 63 node Kubernetes cluster using Firecracker by Álvaro Hernández * The definitive guide to Vertical Pod Autoscaling by Povilas Versockas Links from the interview * ZX Spectrum * R-Type and Jet Pac * GORILLA.BAS * Gentoo Linux * Java Virtual Machine (JVM) * Eclipse * Object Technology International * Erich Gamma * code9, Chris’s startup * Backstage and Roadie * Twitter OSS * Pants * Mesos * twemproxy * Linux Foundation, and its sub-projects CNCF and OCI * Services for projects * Linus Torvalds and Greg Kroah-Hartman * Chris’s Cloud Native predictions for 2021 * Developer experience: Gitpod, GitHub Codespaces or Google Cloud Shell * Wasm in Envoy * Wasi, the WebAssembly Systems Interface * Chris Aniszcyzk on Twitter and on the web * Canada Revenue Agency on Twitter
39 min
The Cloudcast
The Cloudcast
Cloudcast Media
A Cloud-First Look Ahead for 2021
Jeremy Burton (@jburton, CEO @Observe_Inc; board member @SnowflakeDB ) talks about the differences between traditional IT companies and Cloud-First companies, from product planning and roadmaps, to customer engagements and marketing messaging.  *SHOW: *484 *SHOW SPONSOR LINKS:* * Okta - You should not be building your own Auth * Learn how Okta helped Cengage improve student success rates during COVID. * BMC Wants to Know if your business is on its A-Game * BMC Autonomous Digital Enterprise * Datadog Security Monitoring Homepage - Modern Monitoring and Analytics * Try Datadog yourself by starting a free, 14-day trial today. Listeners of this podcast will also receive a free Datadog T-shirt. *CLOUD NEWS OF THE WEEK *- http://bit.ly/cloudcast-cnotw *CHECK OUT OUR NEW PODCAST - **"CLOUDCAST BASICS"* *SHOW NOTES:* * Observe Homepage * Snowflake Homepage - The Data Cloud *Topic 1 *- Welcome to the show. We’ve known each other for a while. You’ve had tremendous success in Leadership, Product and Marketing roles in the past. Tell our audience a little bit of your background and what ultimately brought you to your role at CEO of Observe.  *Topic 1a* - For people that aren’t familiar, what does Observe bring to the market?  *Topic 2 *- We wanted to do a little “before-and-after”, and focus on what it means to be a “Cloud-First” company. What are the most obvious differences between a company like Observe and a company like Dell or EMC? *   * *Topic 3 *- From a product perspective, how do you think about roadmaps and the ways in which you enable new features for customers? Since Observe runs only in the public cloud, how much do you need to think about integrating with the native cloud services?  *Topic 4 *- You have deep expertise in creating marketing messaging, but so much of how customers learn about your products is no longer the company website. How do you think about reaching potential customers, or generally getting your message into the market? *Topic 5 *- Traditional IT was often aligned to centralized buying and architecture groups. How much does Cloud-First change the consumption models for companies - experimentation, on-demand usage, dealing with scaling issues, etc. *Topic 6 *- Overall, what are some of the biggest lessons you’ve learned about the Cloud-First approach as you’ve transitioned over the last 3+ years.  *FEEDBACK?* * Email: show at thecloudcast dot net * Twitter: @thecloudcastnet
37 min
Software Defined Talk
Software Defined Talk
Software Defined Talk LLC
Episode 282: The Engine Should Not Be the Differentiator
This week we discuss Elasticsearch changing their license and the merits of Bitcoin. Plus, what is the prefect age for reincarnation. Rundown Elasticsearch and SSPL The SSPL is Not an Open Source License (https://opensource.org/node/1099) Give 'em SSPL, says Elastic. No thanks, say critics: 'Doubling down on open' not open at all (https://www.theregister.com/2021/01/18/elastics_doubling_down_on_open/) Truly Doubling Down on Open Source | Logz.io (https://logz.io/blog/open-source-elasticsearch-doubling-down/) Bitcoin and Blockchain Is blockchain coming to your bank? (https://thehustle.co/01082021-blockchain-banks/) 85% of Italian Banks Are Exchanging Interbank Transfer Data on Corda - CoinDesk (https://www.coindesk.com/85-of-italian-banks-are-exchanging-interbank-transfer-data-on-corda) Lost Passwords Lock Millionaires Out of Their Bitcoin Fortunes (https://www.nytimes.com/2021/01/12/technology/bitcoin-passwords-wallets-fortunes.html) Don’t Forget Your Bitcoins (https://www.bloomberg.com/opinion/articles/2021-01-12/don-t-forget-your-bitcoins) Relevant to your interests Intel lured new CEO Pat Gelsinger with a package valued at $116 million (https://www.oregonlive.com/silicon-forest/2021/01/intel-lured-new-ceo-pat-gelsinger-with-a-package-valued-at-116-million.html) New Intel CEO Making Waves: Rehiring Retired CPU Architects (https://www.anandtech.com/show/16438/new-intel-ceo-making-waves-rehiring-retired-cpu-architects) Cloud Native Predictions for 2021 and Beyond (https://www.aniszczyk.org/2021/01/19/cloud-native-predictions-for-2021-and-beyond/) How China Took Western Tech Firms Hostage (https://foreignpolicy.com/2021/01/19/china-huawei-western-tech-hostages-national-firms/) The Unauthorized Story of Andreessen Horowitz (https://www.newcomer.co/p/the-unauthorized-story-of-andreessen?utm_campaign=post&utm_medium=web&utm_source=copy) AWS is creating a 'new open source design system' with React (https://www.theregister.com/2021/01/18/aws_creating_new_open_source/) Malwarebytes said it was hacked by the same group who breached SolarWinds (https://www.zdnet.com/article/malwarebytes-said-it-was-hacked-by-the-same-group-who-breached-solarwinds/) The SolarWinds and US government breach is not a marketing opportunity (https://www.zdnet.com/article/the-solarwinds-and-us-government-breach-is-not-a-marketing-opportunity/) Behind a Secret Deal Between Google and Facebook (https://www.nytimes.com/2021/01/17/technology/google-facebook-ad-deal-antitrust.html) Software effort estimation is mostly fake research (http://shape-of-code.coding-guidelines.com/2021/01/17/software-effort-estimation-is-mostly-fake-research/) What You Should Know Before Leaking a Zoom Meeting (https://theintercept.com/2021/01/18/leak-zoom-meeting/) Apple Plans Podcasting Subscription Service in Threat to Spotify (https://www.theinformation.com/articles/apple-plans-podcasting-subscription-service-in-threat-to-spotify) We all love Atlassian ... the $60B SaaS leader that came out of Australia (https://twitter.com/jasonlk/status/1349393447199797250) GitLab CEO weighing options for going public after employee share sale valued company at $6 billion (https://www.cnbc.com/2021/01/15/gitlab-ceo-eyes-public-market-after-secondary-valued-it-at-6-billion-.html) Apple AirPods did ~$18 billion in revenue in 2020. (https://twitter.com/finvelt/status/1349052078195400705) Man who called Cloud a Bookstore hasn’t learned any lessons. (https://twitter.com/techmeme/status/1350129672529481728) BlackBerry (TSX:BB) Stock Soars 14% After Huge Win Over Facebook - The Motley Fool Canada (https://www.fool.ca/2021/01/20/blackberry-tsxbb-stock-soars-14-after-huge-win-over-facebook/) Wasmer - The Universal WebAssembly Runtime (https://wasmer.io/) CentOS is gone—but RHEL is now free for up to 16 production servers (https://arstechnica.com/gadgets/2021/01/centos-is-gone-but-rhel-is-now-free-for-up-to-16-production-servers/) Nonsense President Biden’s Peloton exercise equipment under scrutiny (https://securityaffairs.co/wordpress/113552/iot/joe-biden-peloton-risks.html) Sponsors strongDM — Manage and audit remote access to infrastructure. Start your free 14-day trial today at: strongdm.com/SDT (http://strongdm.com/SDT) Listener Feedback JustWatch - The Streaming Guide (https://www.justwatch.com/) recommend from Colin Conferences Call for Papers (https://sessionize.com/devopsdays-texas-2021/) ends on Jan. 31st for DevOpsDay Texas on March 2nd. (https://devopsdays.org/events/2021-texas/welcome/) SpringOne.io (https://springone.io) SDT news & hype Join us in Slack (http://www.softwaredefinedtalk.com/slack). Send your postal address to stickers@softwaredefinedtalk.com (mailto:stickers@softwaredefinedtalk.com) and we will send you free laptop stickers! Follow us on Twitch (https://www.twitch.tv/sdtpodcast), Twitter (https://twitter.com/softwaredeftalk), Instagram (https://www.instagram.com/softwaredefinedtalk/) and LinkedIn (https://www.linkedin.com/company/software-defined-talk/). Brandon built the Quick Concall iPhone App (https://itunes.apple.com/us/app/quick-concall/id1399948033?mt=8) and he wants you to buy it for $0.99. Use the code SDT to get $20 off Coté’s book, (https://leanpub.com/digitalwtf/c/sdt) Digital WTF (https://leanpub.com/digitalwtf/c/sdt), so $5 total. Become a sponsor of Software Defined Talk (https://www.softwaredefinedtalk.com/ads)! Recommendations Matt: MF DOOM X Tasuro Yamashita rabbithole (https://www.youtube.com/watch?v=bqkOQ46lxj8) The Day the Mixtape Died: DJ Drama (https://www.npr.org/2020/10/27/928307301/the-day-the-mixtape-died-dj-drama) Brandon: The Dark Forest (https://www.audible.com/ep/title/?asin=B010PKSKBA&source_code=GO1GB12609141890JF&device=d&cvosrc=ppc.google.the%20dark%20forest%20audiobook&cvo_campaign=1679154702&cvo_crid=468277763726&Matchtype=e&gclid=CjwKCAiA6aSABhApEiwA6Cbm_465mn1dRy3LmJGl_LPrZZYfpb7fvGNo5YmXA1IgfygmfWf_s2WDKxoCExoQAvD_BwE&gclsrc=aw.ds) Coté: Baudolino (https://www.goodreads.com/book/show/10507.Baudolino). Nutella Videos (https://www.youtube.com/playlist?list=PLk_5VqpWEtiVGmMo9cPPWITyimvyH740E) Photo Credit (https://unsplash.com/photos/aX1hN4uNd-I)
1 hr 6 min
More episodes
Search
Clear search
Close search
Google apps
Main menu