Wednesday, November 30, 2022

New — Amazon SageMaker Data Wrangler Supports SaaS Applications as Data Sources

Data fuels machine learning. In machine learning, data preparation is the process of transforming raw data into a format that is suitable for further processing and analysis. The common process for data preparation starts with collecting data, then cleaning it, labeling it, and finally validating and visualizing it. Getting the data right with high quality can often be a complex and time-consuming process.

This is why customers who build machine learning (ML) workloads on AWS appreciate the ability of Amazon SageMaker Data Wrangler. With SageMaker Data Wrangler, customers can simplify the process of data preparation and complete the required processes of the data preparation workflow on a single visual interface. Amazon SageMaker Data Wrangler helps to reduce the time it takes to aggregate and prepare data for ML.

However, due to the proliferation of data, customers generally have data spread out into multiple systems, including external software-as-a-service (SaaS) applications like SAP OData for manufacturing data, Salesforce for customer pipeline, and Google Analytics for web application data. To solve business problems using ML, customers have to bring all of these data sources together. They currently have to build their own solution or use third-party solutions to ingest data into Amazon S3 or Amazon Redshift. These solutions can be complex to set up and not cost-effective.

Introducing Amazon SageMaker Data Wrangler Supports SaaS Applications as Data Sources
I’m happy to share that starting today, you can aggregate external SaaS application data for ML in Amazon SageMaker Data Wrangler to prepare data for ML. With this feature, you can use more than 40 SaaS applications as data sources via Amazon AppFlow and have these data available on Amazon SageMaker Data Wrangler. Once the data sources are registered in AWS Glue Data Catalog by AppFlow, you can browse tables and schemas from these data sources using Data Wrangler SQL explorer. This feature provides seamless data integration between SaaS applications and SageMaker Data Wrangler using Amazon AppFlow.

Here is a quick preview of this new feature:

This new feature of Amazon SageMaker Data Wrangler works by using integration with Amazon AppFlow, a fully managed integration service that enables you to securely exchange data between SaaS applications and AWS services. With Amazon AppFlow, you can establish bidirectional data integration between SaaS applications, such as Salesforce, SAP, and Amplitude and all supported services, into your Amazon S3 or Amazon Redshift.

Then, with Amazon AppFlow, you can catalog the data in AWS Glue Data Catalog. This is a new feature where with Amazon AppFlow, you can create an integration with AWS Glue Data Catalog for Amazon S3 destination connector. With this new integration, customers can catalog SaaS data applications into AWS Glue Data Catalog with a few clicks, directly from the Amazon AppFlow Flow configuration, without the need to run any crawlers.

Once you’ve established a flow and inserted it into the AWS Glue Data Catalog, you can use this data inside the Amazon SageMaker Data Wrangler. Then, you can do the data preparation as you usually do. You can write Amazon Athena queries to preview data, join data from multiple sources, or import data to prepare for ML model training.

With this feature, you need to do a few simple steps to perform seamless data integration between SaaS applications into Amazon SageMaker Data Wrangler via Amazon AppFlow. This integration supports more than 40 SaaS applications, and for a complete list of supported applications, please check the Supported source and destination applications documentation.

Get Started with Amazon SageMaker Data Wrangler Support for Amazon AppFlow
Let’s see how this feature works in detail. In my scenario, I need to get data from Salesforce, and do the data preparation using Amazon SageMaker Data Wrangler.

To start using this feature, the first thing I need to do is to create a flow in Amazon AppFlow that registers the data source into the AWS Glue Data Catalog. I already have an existing connection with my Salesforce account, and all I need now is to create a flow.

One important thing to note is that to make SaaS application data available in Amazon SageMaker Data Wrangler, I need to create a flow with Amazon S3 as the destination. Then, I need to enable Create a Data Catalog table in the AWS Glue Data Catalog settings. This option will automatically catalog my Salesforce data into AWS Glue Data Catalog.

On this page, I need to select a user role with the required AWS Glue Data Catalog permissions and define the database name and the table name prefix. In addition, in this section, I can define the data format preference, be it in JSON, CSV, or Apache Parquet formats, and filename preference if I want to add a timestamp into the file name section.

To learn more about how to register SaaS data in Amazon AppFlow and AWS Glue Data Catalog, you can read Cataloging the data output from an Amazon AppFlow flow documentation page.

Once I’ve finished registering SaaS data, I need to make sure the IAM role can view the data sources in Data Wrangler from AppFlow. Here is an example of a policy in the IAM role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "glue:SearchTables",
            "Resource": [
                "arn:aws:glue:*:*:table/*/*",
                "arn:aws:glue:*:*:database/*",
                "arn:aws:glue:*:*:catalog"
            ]
        }
    ]
} 

By enabling data cataloging with AWS Glue Data Catalog, from this point on, Amazon SageMaker Data Wrangler will be able to automatically discover this new data source and I can browse tables and schema using the Data Wrangler SQL Explorer.

Now it’s time to switch to the Amazon SageMaker Data Wrangler dashboard then select Connect to data sources.

On the following page, I need to Create connection and select the data source I want to import. In this section, I can see all the available connections for me to use. Here I see the Salesforce connection is already available for me to use.

If I would like to add additional data sources, I can see a list of external SaaS applications that I can integrate into the Set up new data sources section. To learn how to recognize external SaaS applications as data sources, I can learn more with the select How to enable access.

Now I will import datasets and select the Salesforce connection.

On the next page, I can define connection settings and import data from Salesforce. When I’m done with this configuration, I select Connect.

On the following page, I see my Salesforce data that I already configured with Amazon AppFlow and AWS Glue Data Catalog called appflowdatasourcedb. I can also see a table preview and schema for me to review if this is the data I need.

Then, I start building my dataset using this data by performing SQL queries inside the SageMaker Data Wrangler SQL Explorer. Then, I select Import query.

Then, I define a name for my dataset.

At this point, I can start doing the data preparation process. I can navigate to the Analysis tab to run the data insight report. The analysis will provide me with a report on the data quality issues and what transform I need to use next to fix the issues based on the ML problem I want to predict. To learn more about how to use the data analysis feature, see Accelerate data preparation with data quality and insights in the Amazon SageMaker Data Wrangler blog post.

In my case, there are several columns I don’t need, and I need to drop these columns. I select Add step.

One feature I like is that Amazon SageMaker Data Wrangler provides numerous ML data transforms. It helps me to streamline the process of cleaning, transforming and feature engineering my data in one dashboard. For more about what SageMaker Data Wrangler provides for transformation data, please read this Transform Data documentation page.

In this list, I select Manage columns.

Then, in the Transform section, I select the Drop column option. Then, I select a few columns that I don’t need.

Once I’m done, the columns I don’t need are removed and the Drop column data preparation step I just created is listed in the Add step section.

I can also see the visual of my data flow inside the Amazon SageMaker Data Wrangler. In this example, my data flow is quite basic. But when my data preparation process becomes complex, this visual view makes it easy for me to see all the data preparation steps.

From this point on, I can do what I require with my Salesforce data. For example, I can export data directly to Amazon S3 by selecting Export to and choosing Amazon S3 from the Add destination menu. In my case, I specify Data Wrangler to store the data in Amazon S3 after it has processed it by selecting Add destination and then Amazon S3.

Amazon SageMaker Data Wrangler provides me flexibility to automate the same data preparation flow using scheduled jobs. I can also automate feature engineering with SageMaker Pipelines (via Jupyter Notebook) and SageMaker Feature Store (via Jupyter Notebook), and deploy to Inference end point with SageMaker Inference Pipeline (via Jupyter Notebook).

Things to Know
Related news – This feature will make it easy for you to do data aggregation and preparation with Amazon SageMaker Data Wrangler. As this feature is an integration with Amazon AppFlow and also AWS Glue Data Catalog, you might want to learn more on Amazon AppFlow now supports AWS Glue Data Catalog integration and provides enhanced data preparation page.

Availability – Amazon SageMaker Data Wrangler supports SaaS applications as data sources available in all the Regions currently supported by Amazon AppFlow.

Pricing – There is no additional cost to use SaaS applications supports in Amazon SageMaker Data Wrangler, but there is a cost to running Amazon AppFlow to get the data in Amazon SageMaker Data Wrangler.

Visit Import Data From Software as a Service (SaaS) Platforms documentation page to learn more about this feature, and follow the getting started guide to start data aggregating and preparing SaaS applications data with Amazon SageMaker Data Wrangler.

Happy building!
Donnie



from AWS News Blog https://ift.tt/pDTqVwF
via IFTTT

Announcing Additional Data Connectors for Amazon AppFlow

Gathering insights from data is a more effective process if that data isn’t fragmented across multiple systems and data stores, whether on premises or in the cloud. Amazon AppFlow provides bidirectional data integration between on-premises systems and applications, SaaS applications, and AWS services. It helps customers break down data silos using a low- or no-code, cost-effective solution that’s easy to reconfigure in minutes as business needs change.

Today, we’re pleased to announce the addition of 22 new data connectors for Amazon AppFlow, including:

  • Marketing connectors (e.g., Facebook Ads, Google Ads, Instagram Ads, LinkedIn Ads).
  • Connectors for customer service and engagement (e.g., MailChimp, Sendgrid, Zendesk Sell or Chat, and more).
  • Business operations (Stripe, QuickBooks Online, and GitHub).

In total, Amazon AppFlow now supports over 50 integrations with various different SaaS applications and AWS services. This growing set of connectors can be combined to enable you to achieve 360 visibility across the data your organization generates. For instance, you could combine CRM (Salesforce), e-commerce (Stripe), and customer service (ServiceNow, Zendesk) data to build integrated analytics and predictive modeling that can guide your next best offer decisions and more. Using web (Google Analytics v4) and social surfaces (Facebook Ads, Instagram Ads) allows you to build comprehensive reporting for your marketing investments, helping you understand how customers are engaging with your brand. Or, sync ERP data (SAP S/4HANA) with custom order management applications running on AWS. For more information on the current range of connectors and integrations, visit the Amazon AppFlow integrations page.

Datasource connectors for Amazon AppFlow

Amazon AppFlow and AWS Glue Data Catalog
Amazon AppFlow has also recently announced integration with the AWS Glue Data Catalog to automate the preparation and registration of your SaaS data into the AWS Glue Data Catalog. Previously, customers using Amazon AppFlow to store data from supported SaaS applications into Amazon Simple Storage Service (Amazon S3) had to manually create and run AWS Glue Crawlers to make their data available to other AWS services such as Amazon Athena, Amazon SageMaker, or Amazon QuickSight. With this new integration, you can populate AWS Glue Data Catalog with a few clicks directly from the Amazon AppFlow configuration without the need to run any crawlers.

To simplify data preparation and improve query performance when using analytics engines such as Amazon Athena, Amazon AppFlow also now enables you to organize your data into partitioned folders in Amazon S3. Amazon AppFlow also automates the aggregation of records into files that are optimized to the size you specify. This increases performance by reducing processing overhead and improving parallelism.

You can find more information on the AWS Glue Data Catalog integration in the recent What’s New post.

Getting Started with Amazon AppFlow
Visit the Amazon AppFlow product page to learn more about the service and view all the available integrations. To help you get started, there’s also a variety of videos and demos available and some sample integrations available on GitHub. And finally, should you need a custom integration, try the Amazon AppFlow Connector SDK, detailed in the Amazon AppFlow documentation. The SDK enables you to build your own connectors to securely transfer data between your custom endpoint, application, or other cloud service to and from Amazon AppFlow‘s library of managed SaaS and AWS connectors.

— Steve

from AWS News Blog https://ift.tt/Ca4HSrh
via IFTTT

New – Redesigned UI for Amazon SageMaker Studio

Today, I’m excited to announce a new, redesigned user interface (UI) for Amazon SageMaker Studio.

SageMaker Studio provides a single, web-based visual interface where you can perform all machine learning (ML) development steps with a comprehensive set of ML tools. For example, you can prepare data using SageMaker Data Wrangler, build ML models with fully managed Jupyter notebooks, and deploy models using SageMaker’s multi-model endpoints.

Introducing the Redesigned UI for Amazon SageMaker Studio
The redesigned UI makes it easier for you to discover and get started with the ML tools in SageMaker Studio. One highlight of the new UI includes a redesigned navigation menu with links to SageMaker capabilities that follow the typical ML development workflow from preparing data to building, training, and deploying ML models.

We also added new dynamic landing pages for each of the navigation menu items. These landing pages will refresh automatically to show the ML resources relevant for the tool, such as clusters, feature groups, experiments, and model endpoints, as you create or update them. On each of these pages, you can also find links to videos, tutorials, blogs, or additional documentation, to help you get started with the corresponding ML tool in SageMaker Studio.

The new SageMaker Studio Home page gives you one-click access to common tasks and workflows. From here, you can also open the redesigned Launcher with quick links to some of the most frequent tasks, such as creating a new notebook, opening a code console, or opening an image terminal.

Let me give you a whirlwind tour of the redesigned UI.

New Navigation Menu
The new left navigation menu in SageMaker Studio now helps you discover and navigate to the right tools for each step in your ML development workflow. The menu offers clear entry points to key ML tasks, such as data preparation, experimentation, model building, and deployments. The menu also provides shortcuts to quick start solutions and helpful content to accelerate your work in SageMaker Studio.

Amazon SageMaker Studio - New Navigation Menu

New Landing Pages for SageMaker Features and Capabilities
The new left navigation menu groups relevant tools together. For example, if you click on Data, you can now see the relevant SageMaker capabilities for your data preparation tasks. From here, you can prepare your data with SageMaker Data Wrangler, create and store ML features with SageMaker Feature Store, or manage Amazon EMR clusters for large-scale data processing.

If you click on Data Wrangler, the new landing page opens. These landing pages are designed to help you get started more easily. You can find a brief introduction to the tool and links to additional resources, such as videos, tutorials, or blogs.

Amazon SageMaker Studio - New Feature Landing Pages

Similar landing pages exist for the other navigation menu items. For example, with one click on AutoML, you can now see your existing SageMaker Autopilot experiments or get started by creating a new one.

Amazon SageMaker Studio - New AutoML Landing Page

New SageMaker Studio Home Page
We also added a new SageMaker Studio Home page with tooltips on key controls in the UI.

The Home page includes a list of Quick actions for common tasks, such as Open Launcher to create notebooks and other resources. Import & prepare data visually takes you to SageMaker Data Wrangler and helps you get started with your data preparation tasks. You can open the new Getting Started notebook or find additional resources, such as documentation and tutorials.

The Prebuilt and automated solutions help you get started quickly with prebuilt solutions, pretrained open-source models, and AutoML.

In Workflows and tasks, you find a list of relevant tasks for each step in your ML development workflow that take you to the right tool for the job. For example, Store, manage, and retrieve features takes you to SageMaker Feature Store and opens the feature catalog. Similarly, View all experiments takes you to SageMaker Experiments and opens the experiments list view.

In Quick start solutions, you can find pretrained vision, text, and tabular models, notebooks, and end-to-end solutions for common use cases.

Amazon SageMaker Studio - New Home Page

New Getting Started Notebook
SageMaker Studio now includes a new Getting Started notebook that walks you through the basics of how to use SageMaker Studio. If you are a first-time user of SageMaker Studio, this is the perfect starting place. The notebook covers everything from the fundamentals of JupyterLab to a practical walkthrough of training an ML model. The notebook also provides detailed insight into SageMaker-specific functionality, resources, and tools.

New SageMaker Studio Launcher
The Launcher is designed to help you invoke JupyterLab actions and has been optimized to give you quick access to the most frequent tasks, such as creating a notebook, opening a code console, or opening an image terminal. In the same step, you can also choose the image, kernel, instance type, or startup script as needed. 

Amazon SageMaker Studio - New Launcher

Now Available
The redesigned Amazon SageMaker Studio UI is now available in all AWS Regions where SageMaker Studio is available. The redesigned UI is supported by SageMaker Studio domains running on JupyterLab 3. For instructions on how to update the JupyterLab version, see View and update the JupyterLab version of an app from the console.

Give the new user experience a try, and let us know what you think through the purple Feedback widget in SageMaker Studio, or through your usual AWS support contacts.

Start building your ML projects with Amazon SageMaker Studio today!

— Antje



from AWS News Blog https://ift.tt/JLbH4aj
via IFTTT

New — Amazon Athena for Apache Spark

When Jeff Barr first announced Amazon Athena in 2016, it changed my perspective on interacting with data. With Amazon Athena, I can interact with my data in just a few steps—starting from creating a table in Athena, loading data using connectors, and querying using the ANSI SQL standard.

Over time, various industries, such as financial services, healthcare, and retail, have needed to run more complex analyses for a variety of formats and sizes of data. To facilitate complex data analysis, organizations adopted Apache Spark. Apache Spark is a popular, open-source, distributed processing system designed to run fast analytics workloads for data of any size.

However, building the infrastructure to run Apache Spark for interactive applications is not easy. Customers need to provision, configure, and maintain the infrastructure on top of the applications. Not to mention performing optimal tuning resources to avoid slow application starts and suffering from idle costs.

Introducing Amazon Athena for Apache Spark
Today, I’m pleased to announce Amazon Athena for Apache Spark. With this feature, we can run Apache Spark workloads, use Jupyter Notebook as the interface to perform data processing on Athena, and programmatically interact with Spark applications using Athena APIs. We can start Apache Spark in under a second without having to manually provision the infrastructure.

Here’s a quick preview:

Quick preview of Amazon Athena for Apache Spark

How It Works
Since Amazon Athena for Apache Spark runs serverless, this benefits customers in performing interactive data exploration to gain insights without the need to provision and maintain resources to run Apache Spark. With this feature, customers can now build Apache Spark applications using the notebook experience directly from the Athena console or programmatically using APIs.

The following figure explains how this feature works:

How Amazon Athena for Apache Spark works

On the Athena console, you can now run notebooks and run Spark applications with Python using Jupyter notebooks. In this Jupyter notebook, customers can query data from various sources and perform multiple calculations and data visualizations using Spark applications without context switching.

Amazon Athena integrates with AWS Glue Data Catalog, which helps customers to work with any data source in AWS Glue Data Catalog, including data in Amazon S3. This opens possibilities for customers in building applications to analyze and visualize data to explore data to prepare data sets for machine learning pipelines.

As I demonstrated in the demo preview section, the initialization for the workgroup running the Apache Spark engine takes under a second to run resources for interactive workloads. To make this possible, Amazon Athena for Apache Spark uses Firecracker, a lightweight micro-virtual machine, which allows for instant startup time and eliminates the need to maintain warm pools of resources. This benefits customers who want to perform interactive data exploration to get insights without having to prepare resources to run Apache Spark.

Get Started with Amazon Athena for Apache Spark
Let’s see how we can use Amazon Athena for Apache Spark. In this post, I will explain step-by-step how to get started with this feature.

The first step is to create a workgroup. In the context of Athena, a workgroup helps us to separate workloads between users and applications.

To create a workgroup, from the Athena dashboard, select Create Workgroup.

Select Create Workgroup

On the next page, I give the name and description for this workgroup.

Creating a workgroup

On the same page, I can choose Apache Spark as the engine for Athena. In addition, I also need to specify a service role with appropriate permissions to be used inside a Jupyter notebook. Then, I check Turn on example notebook, which makes it easy for me to get started with Apache Spark inside Athena. I also have the option to encrypt Jupyter notebooks managed by Athena or use the key I have configured in AWS Key Management Service (AWS KMS).

After that, I need to define an Amazon Simple Storage Service (Amazon S3) bucket to store calculation results from the Jupyter notebook. Once I’m sure of all the configurations for this workgroup, I just have to select Create workgroup.

Configure Calculation Results Settings

Now, I can see the workgroup already created in Athena.

Select newly created workgroup

To see the details of this workgroup, I can select the link from the workgroup. Since I also checked the Turn on example notebook when creating this workgroup, I have a Jupyter notebook to help me get started. Amazon Athena also provides flexibility for me to import existing notebooks that I can upload from my laptop with Import file or create new notebooks from scratch by selecting Create notebook.

Example notebook is available in the workgroup

When I select the Jupyter notebook example, I can start building my Apache Spark application.

When I run a Jupyter notebook, it automatically creates a session in the workgroup. Subsequently, each time I run a calculation inside the Jupyter notebook, all results will be recorded in the session. This way, Athena provides me with full information to review each calculation by selecting Calculation ID, which took me to the Calculation details page. Here, I can review the Code and also Results for the calculation.

Review code and results of a calculation

In the session, I can adjust the Coordinator size and Executor size, with 1 data processing unit (DPU) by default. A DPU consists of 4 vCPU and 16 GB of RAM. Changing to a larger DPU allows me to process tasks faster if I have complex calculations.

Configuring session parameters

Programmatic API Access
In addition to using the Athena console, I can also use programmatic access to interact with the Spark application inside Athena. For example, I can create a workgroup with the create-work-group command, start a notebook with create-notebook, and run a notebook session with start-session.

Using programmatic access is useful when I need to execute commands such as building reports or computing data without having to open the Jupyter notebook.

With my Jupyter notebook that I’ve created before, I can start a session by running the following command with the AWS CLI:

$> aws athena start-session \
    --work-group <WORKGROUP_NAME>\
    --engine-configuration '{"CoordinatorDpuSize": 1, "MaxConcurrentDpus":20, "DefaultExecutorDpuSize": 1, "AdditionalConfigs":{"NotebookId":"<NOTEBOOK_ID>"}}'
    --notebook-version "Jupyter 1"
    --description "Starting session from CLI"

{
    "SessionId":"<SESSION_ID>",
    "State":"CREATED"
}

Then, I can run a calculation using the start-calculation-execution API.

$ aws athena start-calculation-execution \
    --session-id "<SESSION_ID>"
    --description "Demo"
    --code-block "print(5+6)"

{
    "CalculationExecutionId":"<CALCULATION_EXECUTION_ID>",
    "State":"CREATING"
}

In addition to using code inline, with the --code-block flag, I can also pass input from a Python file using the following command:

$ aws athena start-calculation-execution \
    --session-id "<SESSION_ID>"
    --description "Demo"
    --code-block file://<PYTHON FILE>

{
    "CalculationExecutionId":"<CALCULATION_EXECUTION_ID>",
    "State":"CREATING"
}

Pricing and Availability
Amazon Athena for Apache Spark is available today in the following AWS Regions: US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland). To use this feature, you are charged based on the amount of compute usage defined by the data processing unit or DPU per hour. For more information see our pricing page here.

To get started with this feature, see Amazon Athena for Apache Spark to learn more from the documentation, understand the pricing, and follow the step-by-step walkthrough.

Happy building,

Donnie



from AWS News Blog https://ift.tt/25h0tZO
via IFTTT

Tuesday, November 29, 2022

New – Amazon Redshift Integration with Apache Spark

Apache Spark is an open-source, distributed processing system commonly used for big data workloads. Spark application developers working in Amazon EMR, Amazon SageMaker, and AWS Glue often use third-party Apache Spark connectors that allow them to read and write the data with Amazon Redshift. These third-party connectors are not regularly maintained, supported, or tested with various versions of Spark for production.

Today we are announcing the general availability of Amazon Redshift integration for Apache Spark, which makes it easy to build and run Spark applications on Amazon Redshift and Redshift Serverless, enabling customers to open up the data warehouse for a broader set of AWS analytics and machine learning (ML) solutions.

With Amazon Redshift integration for Apache Spark, you can get started in seconds and effortlessly build Apache Spark applications in a variety of languages, such as Java, Scala, and Python.

Your applications can read from and write to your Amazon Redshift data warehouse without compromising on the performance of the applications or transactional consistency of the data, as well as performance improvements with pushdown optimizations.

Amazon Redshift integration for Apache Spark builds on an existing open source connector project and enhances it for performance and security, helping customers gain up to 10x faster application performance. We thank the original contributors on the project who collaborated with us to make this happen. As we make further enhancements we will continue to contribute back into the open source project.

Getting Started with Spark Connector for Amazon Redshift
To get started, you can go to AWS analytics and ML services, use data frame or Spark SQL code in a Spark job or Notebook to connect to the Amazon Redshift data warehouse, and start running queries in seconds.

In this launch, Amazon EMR 6.9, EMR Serverless, and AWS Glue 4.0 come with the pre-packaged connector and JDBC driver, and you can just start writing code. EMR 6.9 provides a sample notebook, and EMR Serverless provides a sample Spark Job too.

First, you should set AWS Identity and Access Management (AWS IAM) authentication between Redshift and Spark, between Amazon Simple Storage Service (Amazon S3) and Spark, and between Redshift and Amazon S3. The following diagram describes the authentication between Amazon S3, Redshift, the Spark driver, and Spark executors.

For more information, see Identity and access management in Amazon Redshift in the AWS documentation.

Amazon EMR
If you already have an Amazon Redshift data warehouse and the data available, you can create the database user and provide the right level of grants to the database user. To use this with Amazon EMR, you need to upgrade to the latest version of the Amazon EMR 6.9 that has the packaged spark-redshift connector. Select the emr-6.9.0 release when you create an EMR cluster on Amazon EC2.

You can use EMR Serverless to create your Spark application using the emr-6.9.0 release to run your workload.

EMR Studio also provides an example Jupyter Notebook configured to connect to an Amazon Redshift Serverless endpoint leveraging sample data that you can use to get started quickly.

Here is a Scalar example to build your applications both with Spark Dataframe and Spark SQL. Use IAM-based credentials for connecting to Redshift and use IAM role for unloading and loading data from S3.

// Create the JDBC connection URL and define the Redshift context
val jdbcURL = "jdbc:redshift:iam://<RedshiftEndpoint>:<Port>/<Database>?DbUser=<RsUser>"
val rsOptions = Map (
  "url" -> jdbcURL,
  "tempdir" -> tempS3Dir, 
  "aws_iam_role" -> roleARN,
  )
// Reference the sales table from Redshift 
val sales_df = spark
  .read 
  .format("io.github.spark_redshift_community.spark.redshift") 
  .options(rsOptions) 
  .option("dbtable", "sales") 
  .load() 
sales_df.createOrReplaceTempView("sales") 
// Reference the date table from Redshift using Data Frame 
sales_df.join(date_df, sales_df("dateid") === date_df("dateid"))
  .where(col("caldate") === "2008-01-05")
  .groupBy().sum("qtysold")
  .select(col("sum(qtysold)"))
  .show() 

If Amazon Redshift and Amazon EMR are in different VPCs, you have to configure VPC peering or enable cross-VPC access. Assuming both Amazon Redshift and Amazon EMR are in the same virtual private cloud (VPC), you can create a Spark job or Notebook and connect to the Amazon Redshift data warehouse and write Spark code to use the Amazon Redshift connector.

To learn more, see Use Spark on Amazon Redshift with a connector in the AWS documentation.

AWS Glue
When you use AWS Glue 4.0, the spark-redshift connector is available both as a source and target. In Glue Studio, you can use a visual ETL job to read or write to a Redshift data warehouse simply by selecting a Redshift connection to use within a built-in Redshift source or target node.

The Redshift connection contains Redshift connection details along with the credentials needed to access Redshift with the proper permissions.

To get started, choose Jobs in the left menu of the Glue Studio console. Using either of the Visual modes, you can easily add and edit a source or target node and define a range of transformations on the data without writing any code.

Choose Create and you can easily add and edit a source, target node, and the transform node in the job diagram. At this time, you will choose Amazon Redshift as Source and Target.

Once completed, the Glue job can be executed on Glue for the Apache Spark engine, which will automatically use the latest spark-redshift connector.

The following Python script shows an example job to read and write to Redshift with dynamicframe using the spark-redshift connector.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

print("================ DynamicFrame Read ===============")
url = "jdbc:redshift://<RedshiftEndpoint>:<Port>/dev"
read_options = {
    "url": url,
    "dbtable": dbtable,
    "redshiftTmpDir": redshiftTmpDir,
    "tempdir": redshiftTmpDir,
    "aws_iam_role": aws_iam_role,
    "autopushdown": "true",
    "include_column_list": "false"
}

redshift_read = glueContext.create_dynamic_frame.from_options(
    connection_type="redshift",
    connection_options=read_options
) 

print("================ DynamicFrame Write ===============")

write_options = {
    "url": url,
    "dbtable": dbtable,
    "user": "awsuser",
    "password": "Password1",
    "redshiftTmpDir": redshiftTmpDir,
    "tempdir": redshiftTmpDir,
    "aws_iam_role": aws_iam_role,
    "autopushdown": "true",
    "DbUser": "awsuser"
}

print("================ dyf write result: check redshift table ===============")
redshift_write = glueContext.write_dynamic_frame.from_options(
    frame=redshift_read,
    connection_type="redshift",
    connection_options=write_options
)

When you set up your job detail, you can only use the Glue 4.0 – Supports spark 3.3 Python 3 version for this integration.

To learn more, see Creating ETL jobs with AWS Glue Studio and Using connectors and connections with AWS Glue Studio in the AWS documentation.

Gaining the Best Performance
In the Amazon Redshift integration for Apache Spark, the Spark connector automatically applies predicate and query pushdown to optimize for performance. You can gain performance improvement by using the default Parquet format for the connector used for unloading with this integration.

As the following sample code shows, the Spark connector will turn the supported function into a SQL query and run the query in Amazon Redshift.

import sqlContext.implicits._val
sample= sqlContext.read
.format("io.github.spark_redshift_community.spark.redshift")
.option("url",jdbcURL )
.option("tempdir", tempS3Dir)
.option("unload_s3_format", "PARQUET")
.option("dbtable", "event")
.load()

// Create temporary views for data frames created earlier so they can be accessed via Spark SQL
sales_df.createOrReplaceTempView("sales")
date_df.createOrReplaceTempView("date")
// Show the total sales on a given date using Spark SQL API
spark.sql(
"""SELECT sum(qtysold)
| FROM sales, date
| WHERE sales.dateid = date.dateid
| AND caldate = '2008-01-05'""".stripMargin).show()

Amazon Redshift integration for Apache Spark adds pushdown capabilities for operations such as sort, aggregate, limit, join, and scalar functions so that only the relevant data is moved from the Redshift data warehouse to the consuming Spark application, thereby improving performance.

Available Now
The Amazon Redshift integration for Apache Spark is now available in all Regions that support Amazon EMR 6.9, AWS Glue 4.0, and Amazon Redshift. You can start using the feature directly from EMR 6.9 and Glue Studio 4.0 with the new Spark 3.3.0 version.

Give it a try, and please send us feedback either in the AWS re:Post for Amazon Redshift or through your usual AWS support contacts.

Channy



from AWS News Blog https://ift.tt/oHLOzyd
via IFTTT

Preview: Amazon OpenSearch Serverless – Run Search and Analytics Workloads without Managing Clusters

Most AWS analytics services have compelling serverless offerings that make it even easier for customers to analyze vast amounts of data without having to configure, scale, or manage the underlying infrastructure.

Along with other serverless analytics, such as Amazon QuickSight for business intelligence and AWS Glue for data integration, we have introduced Amazon EMR Serverless, Amazon MSK Serverless, and Amazon Redshift Serverless this year.

Today, we announce the preview release of a new serverless option for Amazon OpenSearch Service that makes it easy for customers to run large-scale search and analytics workloads without managing clusters. It automatically provisions and scales the underlying resources to deliver fast data ingestion and query responses for even the most demanding and unpredictable workloads, eliminating the need to configure and optimize clusters.

With Amazon OpenSearch Serverless, you do not need to account for factors that are hard to know in advance, such as the frequency and complexity of queries or the volume of data expected to be analyzed. Instead of managing infrastructure, you can focus on using OpenSearch for exploring and deriving insights from your data. You can also get started using familiar APIs to load and query data and use OpenSearch Dashboards for interactive data analysis and visualization.

Configure Your OpenSearch Serverless Collection
To get started with Amazon OpenSearch Serverless, you create a Collection via the AWS Management Console, AWS Command-Line Interface (AWS CLI), or AWS API.

Before the launch of OpenSearch Serverless, you created a managed cluster, specifying instance types, counts, and storage options, and then managed the lifecycle and shard strategy for indices within that cluster. With OpenSearch Serverless, you create a Collection, which manages a group of indices that work together to support a specific workload. You no longer need to specify the hardware or manage the indices directly.

To create an OpenSearch Serverless collection and secure data, set up Encryption policies to assign AWS KMS keys to one or more collections and attach Network policies to collections to control the access from specified VPCs and public IP addresses.

To create an encryption policy, choose Encryption policies in the left navigation pane and Create encryption policy. Encryption at rest secures the indices within your collection. For each collection, AWS KMS generates a unique, symmetric encryption key. Encryption policies are the optimal way to manage AWS KMS keys across multiple collections. You can define the target collection name or a prefix that automatically applies the encryption settings from this policy to the collection.

In order for users to access a collection, choose Network policies in the left navigation pane and Create network policy. Network policies determine whether your collection is accessible over the internet from public networks or whether it must be accessed through OpenSearch Serverless–managed VPC endpoints.

You can define multiple rules for each collection, either the Public or VPC, as a recommended option for the Access Type. If you select a public option, you can access the collection from OpenSearch Dashboards.

Also, you can configure access for OpenSearch Dashboards and the OpenSearch endpoint. For the Resource type, enable both Access to OpenSearch endpoints and Access to OpenSearch Dashboards. In both input boxes, select the Collection Name property and your collection name or prefix.

Finally, to create an OpenSearch Serverless collection, choose Create collection in the home page or choose Collections in the left navigation pane and choose Create collection.

Input your collection name, description, and collection type, either Time series or Search by your data type.

  • Time series – The log analytics segment that focuses on analyzing large volumes of semistructured, machine-generated data in real time for operational, security, user behavior, and business insights.
  • Search – Full-text search that powers applications in your internal networks (content management systems, legal documents) and internet-facing applications such as e-commerce website search and content search.

When you choose Create, a collection typically takes less than a minute to initialize.

Upload and Search Data in Your Collection
Before uploading and searching data in your collection, configure the IAM policy to access the actual data within a collection. Choose Data access policies in the left navigation pane and Create data access policy.

You can apply multiple policies simultaneously to the same resource. Each policy contains a set of rules. Each rule has a resource (collection or index), permissions for the resource, and a list of principals (IAM users, role ARNs, or SAML identities).

Here is a sample policy that provides a single user the minimum permissions required to create an index in your collection, index some data, and search for it. Replace the principal ARN with the ARN of the account that you’ll use to sign in to OpenSearch Dashboards.

[
  {
    "Rules": [
      {
        "ResourceType": "index",
        "Resource": [
          "index/books/*"
        ],
        "Permission": [
          "aoss:CreateIndex",
          "aoss:ReadDocument",
          "aoss:UpdateIndex",
          "aoss:DeleteIndex",
          "aoss:WriteDocument"
        ]
      }
    ],
    "Principal": [
      "arn:aws:iam::123456789012:user/admin"
    ]
  }
]

Now, you can upload data to an OpenSearch Serverless collection using Postman or curl. You can also use Dev Tools within the OpenSearch Dashboards console. Choose OpenSearch Dashboards on the detail page of your collection.

Sign in to OpenSearch Dashboards using the AWS access and secret keys for the principal that you specified in your data access policy. Within OpenSearch Dashboards, open the left navigation menu and choose Dev Tools.

To create a single index called books-index, run PUT books-index, and index your first single document into books-index.

You can also query search data in Dev Tools.

GET books_index/_search
{
    "query": {
    "simple_query_string": {
    "query": "Jeff",
    "fields": ["author"]
    } 
  }
}

In the case of time-series data, you can ingest data with all of the streaming ingestion options, such as native OpenSearch streaming APIs, Amazon Kinesis Data Firehose, AWS Glue, and a wide range of open-source streaming ingestion pipelines like Logstash, FluentBit, Fluentd, and Data Prepper.

In addition, you can snapshot your data from a managed cluster on OpenSearch Service and restore it to your collection, making it easy to migrate your workloads. Once your data is in your collection, you can then query it using your favorite OpenSearch client and interactively analyze and visualize your data using OpenSearch Dashboards.

Things to Know
Here are a couple of things to keep in mind about additional features and considerations when you choose Amazon OpenSearch Serverless:

  • SAML Authentications – You can use your existing identity provider to offer single sign-on (SSO) for the OpenSearch Dashboards endpoints of OpenSearch Serverless SAML authentication lets you use third-party identity providers to sign in to OpenSearch Dashboards to index and search data. OpenSearch Serverless supports providers that use the SAML 2.0 standard, such as Okta, Keycloak, Active Directory Federation Services, and Auth0.
  • Private VPC Endpoints – You can use AWS PrivateLink to create a private connection between your VPC and OpenSearch Serverless. You can access your collections as if they were in your VPC without the use of an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. To create an interface endpoint, choose VPC endpoints in the left navigation pane of OpenSearch Service.
  • Managed Clusters – You may prefer to use an option of Amazon OpenSearch Service’s managed clusters in scenarios where you need tight control over cluster configuration or specific customizations. For example, your workloads may need custom plugins that run best on accelerated computing instances and need more control on configuration such as data sharding strategy. You can choose either provisioned instances or serverless according to the requirements of your workload.

Join the Preview
The preview release of Amazon OpenSearch Serverless is now available in the US East (N. Virginia, Ohio), US West (Oregon), EU (Ireland), Asia Pacific (Tokyo). With OpenSearch Serverless, there are no upfront costs, and you pay only for the data that is ingest and the queries you run. For pricing details, see the OpenSearch Service pricing page. To learn more, visit the Amazon OpenSearch Service User Guide.

We want to hear more feedback during the preview. Please send feedback to AWS re:Post for Amazon OpenSearch Service or through your usual AWS support contacts.

Channy



from AWS News Blog https://ift.tt/mPAnqeE
via IFTTT