ML on AWS : Read from your favorite AWS Data Sources to a Pandas DataFrame using AWS Data Wrangler
AI generated image using DALL-E mini (The prompt was “Pandas in the Cloud) — Learn about DALL-E here.
Note: I have received no compensation for writing this piece. Please consider supporting mine and others’ writing by becoming a Medium member with this link.
Machine Learning is being used in many applications across the world. A lot of data is being generated by these applications and systems. This data needs to be stored, processed and analyzed to extract useful information. Amazon Web Services (AWS) provides a wide variety of services that can be used to store, process and analyze your application’s data. In this blog post, we will focus on how you can use AWS Data Wrangler to read from your favorite AWS Data Sources to a Pandas DataFrame.
AWS Wrangler is an AWS Professional Service open source python initiative that extends the power of Pandas library to AWS connecting DataFrames and AWS data related services. It provides easy integration with Athena, Glue, Redshift, Timestream, OpenSearch, Neptune, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL). Built on top of other open-source projects like Pandas, Apache Arrow and Boto3, it offers abstracted functions to execute usual ETL tasks like load/unload data from Data Lakes, Data Warehouses and Databases.
In this article, I guide you through the installation and the use of AWS Wrangler to load data living within Amazon S3, Amazon RDS, Amazon Athena and Amazon Redshift to memory as a pandas DataFrame.
Installing AWS Wrangler
AWS Data Wrangler runs on Python
3.10, and on several platforms (AWS Lambda, AWS Glue Python Shell, EMR, EC2, on-premises, Amazon SageMaker, local, etc).
Some good practices to follow for options below are:
- Use new and isolated Virtual Environments for each project…