The official home of the Presto distributed SQL query engine for big data

Overview

Presto Build Status

Presto is a distributed SQL query engine for big data.

See the User Manual for deployment instructions and end user documentation.

Requirements

  • Mac OS X or Linux
  • Java 8 Update 151 or higher (8u151+), 64-bit. Both Oracle JDK and OpenJDK are supported.
  • Maven 3.3.9+ (for building)
  • Python 2.4+ (for running with the launcher script)

Building Presto

Presto is a standard Maven project. Simply run the following command from the project root directory:

./mvnw clean install

On the first build, Maven will download all the dependencies from the internet and cache them in the local repository (~/.m2/repository), which can take a considerable amount of time. Subsequent builds will be faster.

Presto has a comprehensive set of unit tests that can take several minutes to run. You can disable the tests when building:

./mvnw clean install -DskipTests

Running Presto in your IDE

Overview

After building Presto for the first time, you can load the project into your IDE and run the server. We recommend using IntelliJ IDEA. Because Presto is a standard Maven project, you can import it into your IDE using the root pom.xml file. In IntelliJ, choose Open Project from the Quick Start box or choose Open from the File menu and select the root pom.xml file.

After opening the project in IntelliJ, double check that the Java SDK is properly configured for the project:

  • Open the File menu and select Project Structure
  • In the SDKs section, ensure that a 1.8 JDK is selected (create one if none exist)
  • In the Project section, ensure the Project language level is set to 8.0 as Presto makes use of several Java 8 language features

Presto comes with sample configuration that should work out-of-the-box for development. Use the following options to create a run configuration:

  • Main Class: com.facebook.presto.server.PrestoServer
  • VM Options: -ea -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -Xmx2G -Dconfig=etc/config.properties -Dlog.levels-file=etc/log.properties
  • Working directory: $MODULE_DIR$
  • Use classpath of module: presto-main

The working directory should be the presto-main subdirectory. In IntelliJ, using $MODULE_DIR$ accomplishes this automatically.

Additionally, the Hive plugin must be configured with location of your Hive metastore Thrift service. Add the following to the list of VM options, replacing localhost:9083 with the correct host and port (or use the below value if you do not have a Hive metastore):

-Dhive.metastore.uri=thrift://localhost:9083

Using SOCKS for Hive or HDFS

If your Hive metastore or HDFS cluster is not directly accessible to your local machine, you can use SSH port forwarding to access it. Setup a dynamic SOCKS proxy with SSH listening on local port 1080:

ssh -v -N -D 1080 server

Then add the following to the list of VM options:

-Dhive.metastore.thrift.client.socks-proxy=localhost:1080

Running the CLI

Start the CLI to connect to the server and run SQL queries:

presto-cli/target/presto-cli-*-executable.jar

Run a query to see the nodes in the cluster:

SELECT * FROM system.runtime.nodes;

In the sample configuration, the Hive connector is mounted in the hive catalog, so you can run the following queries to show the tables in the Hive database default:

SHOW TABLES FROM hive.default;

Code Style

We recommend you use IntelliJ as your IDE. The code style template for the project can be found in the codestyle repository along with our general programming and Java guidelines. In addition to those you should also adhere to the following:

  • Alphabetize sections in the documentation source files (both in table of contents files and other regular documentation files). In general, alphabetize methods/variables/sections if such ordering already exists in the surrounding code.
  • When appropriate, use the Java 8 stream API. However, note that the stream implementation does not perform well so avoid using it in inner loops or otherwise performance sensitive sections.
  • Categorize errors when throwing exceptions. For example, PrestoException takes an error code as an argument, PrestoException(HIVE_TOO_MANY_OPEN_PARTITIONS). This categorization lets you generate reports so you can monitor the frequency of various failures.
  • Ensure that all files have the appropriate license header; you can generate the license by running mvn license:format.
  • Consider using String formatting (printf style formatting using the Java Formatter class): format("Session property %s is invalid: %s", name, value) (note that format() should always be statically imported). Sometimes, if you only need to append something, consider using the + operator.
  • Avoid using the ternary operator except for trivial expressions.
  • Use an assertion from Airlift's Assertions class if there is one that covers your case rather than writing the assertion by hand. Over time we may move over to more fluent assertions like AssertJ.
  • When writing a Git commit message, follow these guidelines.

Building the Web UI

The Presto Web UI is composed of several React components and is written in JSX and ES6. This source code is compiled and packaged into browser-compatible Javascript, which is then checked in to the Presto source code (in the dist folder). You must have Node.js and Yarn installed to execute these commands. To update this folder after making changes, simply run:

yarn --cwd presto-main/src/main/resources/webapp/src install

If no Javascript dependencies have changed (i.e., no changes to package.json), it is faster to run:

yarn --cwd presto-main/src/main/resources/webapp/src run package

To simplify iteration, you can also run in watch mode, which automatically re-compiles when changes to source files are detected:

yarn --cwd presto-main/src/main/resources/webapp/src run watch

To iterate quickly, simply re-build the project in IntelliJ after packaging is complete. Project resources will be hot-reloaded and changes are reflected on browser refresh.

Release Notes

When authoring a pull request, the PR description should include its relevant release notes. Follow Release Notes Guidelines when authoring release notes.

Issues
  • Fix optimized parquet reader complex hive types processing

    Fix optimized parquet reader complex hive types processing

    • Fix reading repeated fields, when parquet consists of multiple pages, so the beginning of the field can be on one page and it's ending on the next page.

    • Support empty arrays read

    • Determine null values of optional fields

    • Add tests for hive complex types: arrays, maps and structs

    • Rewrite tests to read parquets consising of multiple pages

    • Add TestDataWritableWriter with patch for empty array and empty map because the bug https://issues.apache.org/jira/browse/HIVE-13632 is already fixed in current hive version, so presto should be able to read empty arrays too

    CLA Signed 
    opened by kgalieva 77
  • Add Apache Accumulo connector and documentation

    Add Apache Accumulo connector and documentation

    A Presto connector for Apache Accumulo. See the RST in presto-docs for more details. I've completed the CLA.

    CLA Signed 
    opened by adamjshook 60
  • Presto ranger integration with row-level filters and column masking

    Presto ranger integration with row-level filters and column masking

    This PR includes #8980 and #10996, Also includes row-level filtering.

    CLA Signed stale 
    opened by cryptoe 51
  • GeoSpatial functions and optimization in Presto

    GeoSpatial functions and optimization in Presto

    CLA Signed changes-requested 
    opened by zhenxiao 47
  • Add support for prepared statements in JDBC driver

    Add support for prepared statements in JDBC driver

    I'm using presto-jdbc-0.66-SNAPSHOT.jar, and trying to execute presto query to presto-server on my java application.

    Below sample code, using jdbc statement, is working well.

        Class.forName("com.facebook.presto.jdbc.PrestoDriver");
        Connection connection = DriverManager.getConnection("jdbc:presto://192.168.33.33:8080/hive/default", "hive", "hive");
    
        Statement statement = connection.createStatement();
        ResultSet rs = statement.executeQuery("SHOW TABLES");
        while(rs.next()) {
            System.out.println(rs.getString(1));
        }
    

    However, using jdbc preparedstatement, throw exception. Is presto-jdbc not support yet "preparedstatement" ? Here's my test code and exception info.

    Test Code :

        Class.forName("com.facebook.presto.jdbc.PrestoDriver");
        Connection connection = DriverManager.getConnection("jdbc:presto://192.168.33.33:8080/hive/default", "hive", "hive");
    
        PreparedStatement ps = connection.prepareStatement("SHOW TABLES");
        ResultSet rs = ps.executeQuery();
        while(rs.next()) {
            System.out.println(rs.getString(1));
        }
    

    Exception Info :

        java.lang.UnsupportedOperationException: PreparedStatement
    at com.facebook.presto.jdbc.PrestoPreparedStatement.<init>(PrestoPreparedStatement.java:44)
    at com.facebook.presto.jdbc.PrestoConnection.prepareStatement(PrestoConnection.java:93)
    at com.nsouls.frescott.hive.mapper.PrestoConnectionTest.testShowTable(PrestoConnectionTest.java:37)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
    at org.springframework.test.context.junit4.statements.RunBeforeTestMethodCallbacks.evaluate(RunBeforeTestMethodCallbacks.java:74)
    at org.springframework.test.context.junit4.statements.RunAfterTestMethodCallbacks.evaluate(RunAfterTestMethodCallbacks.java:83)
    at org.springframework.test.context.junit4.statements.SpringRepeat.evaluate(SpringRepeat.java:72)
    at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:231)
    at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:88)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
    at org.springframework.test.context.junit4.statements.RunBeforeTestClassCallbacks.evaluate(RunBeforeTestClassCallbacks.java:61)
    at org.springframework.test.context.junit4.statements.RunAfterTestClassCallbacks.evaluate(RunAfterTestClassCallbacks.java:71)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
    at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.run(SpringJUnit4ClassRunner.java:174)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
    at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
    at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:202)
    at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:65)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
    
    opened by felika 46
  • Implement EXPLAIN ANALYZE

    Implement EXPLAIN ANALYZE

    This should work similarly to Postgresql (http://www.postgresql.org/docs/9.4/static/sql-explain.html), by executing the query, recording stats, and then rendering the stats along with the plan. A first pass at implementing this could probably be to render similarly to EXPLAIN (TYPE DISTRIBUTED) with the stage & operator stats inserted

    enhancement 
    opened by cberner 42
  • Add support for query pushdown to S3 using S3 Select

    Add support for query pushdown to S3 using S3 Select

    This change will allow Presto users to improve the performance of their queries using S3SelectPushdown. It pushes down projections and predicate evaluations to S3. As a result Presto doesn't need to download full S3 objects and only data required to answer the user's query is returned to Presto, thereby improving performance.

    S3SelectPushdown Technical Document: S3SelectPushdown.pdf

    This PR is a continuation of https://github.com/prestodb/presto/pull/11033.

    CLA Signed 
    opened by same3r 42
  • Performance Regressions in Presto 0.206?

    Performance Regressions in Presto 0.206?

    I was recently benchmarking Presto 0.206 vs 0.172. The tests are run on Parquet datasets stored on S3.

    We found that Presto 0.206 was generally faster on smaller datasets, there were some significant performance regressions on larger datasets. The CPU time reported by EXPLAIN ANALYZE was lower in 0.206 than 0.172, but the wall time was much longer.

    This possibly indicates either stragglers or some sort of scheduling bug that adversely affects parallelism. Note that the concurrency settings like task.concurrency are the same in both clusters.

    For instance, on the TPCH scale 1000 dataset, query#7 slowed down by a factor of 2x in wall time. The query was:

    SELECT supp_nation,
           cust_nation,
           l_year,
           sum(volume) AS revenue
    FROM
      (SELECT n1.n_name AS supp_nation,
              n2.n_name AS cust_nation,
              substr(l_shipdate, 1, 4) AS l_year,
              l_extendedprice * (1 - l_discount) AS volume
       FROM lineitem_parq,
            orders_parq,
            customer_parq,
            supplier_parq,
            nation_parq n1,
            nation_parq n2
       WHERE s_suppkey = l_suppkey
         AND o_orderkey = l_orderkey
         AND c_custkey = o_custkey
         AND s_nationkey = n1.n_nationkey
         AND c_nationkey = n2.n_nationkey
         AND ((n1.n_name = 'KENYA'
               AND n2.n_name = 'PERU')
              OR (n1.n_name = 'PERU'
                  AND n2.n_name = 'KENYA'))
         AND l_shipdate BETWEEN '1995-01-01' AND '1996-12-31' ) AS shipping
    GROUP BY supp_nation,
             cust_nation,
             l_year
    ORDER BY supp_nation,
             cust_nation,
             l_year;
    

    I compared the output of EXPLAIN ANALYZE from both versions of Presto and cannot find anything that could explain this. Here are some observations:

    • The CPU time reported by each stage was usually lower in 0.206. This probably rules out operator performance regressions.
    • Some of the leaf stages were using ScanProject in 0.172, but they use ScanFilterProject in 0.205. This actually reduces the output rows and leads to drastically lower CPU usage in upper stages of the query tree. This is a big improvement and should have led to faster query processing.

    References

    • Explain analyze from 0.206 - https://gist.github.com/anoopj/40eea820c1c310dff72139d495ac98b0
    • Explain analyze from 0.172 - https://gist.github.com/anoopj/01985fe0ad298dad4c22b1444e1f1e21
    opened by anoopj 39
  • Feature explain analyze v2

    Feature explain analyze v2

    CLA Signed 
    opened by sopel39 38
  • Prune Nested Fields for Parquet Columns

    Prune Nested Fields for Parquet Columns

    Read necessary fields only for Parquet nested columns Currently, Presto will read all the fields in a struct for Parquet columns. e.g.

    select s.a, s.b
    from t
    

    if it is a parquet file, with struct column s: {a int, b double, c long, d float} current Presto will read a, b, c, d from s, and output just a and b

    For columnar storage as Parquet or ORC, we could do better, by just reading the necessary fields. In the previous example, just read {a int, b double} from s. Not reading other fields to save IO.

    This patch introduces an optional NestedFields in ColumnHandle. When optimizing the plan, PruneNestedColumns optimizer will visit expressions, and put candidate nested fields into ColumnHandle. When scanning parquet files, the record reader could use NestedFields to specify necessary fields only for parquet files.

    This has an dependency on @jxiang 's https://github.com/prestodb/presto/pull/4714, which gives us the flexibility to specify metastore schemas differently from parquet file schemas.

    @dain @martint @electrum @cberner @erichwang any comments are appreciated

    CLA Signed 
    opened by zhenxiao 36
  • source code

    source code

    where there is no commits in presto source code?

    opened by ziyangx 0
  • Bump lodash from 4.17.10 to 4.17.21 in /presto-main/src/main/resources/webapp/src

    Bump lodash from 4.17.10 to 4.17.21 in /presto-main/src/main/resources/webapp/src

    Bumps lodash from 4.17.10 to 4.17.21.

    Commits
    • f299b52 Bump to v4.17.21
    • c4847eb Improve performance of toNumber, trim and trimEnd on large input strings
    • 3469357 Prevent command injection through _.template's variable option
    • ded9bc6 Bump to v4.17.20.
    • 63150ef Documentation fixes.
    • 00f0f62 test.js: Remove trailing comma.
    • 846e434 Temporarily use a custom fork of lodash-cli.
    • 5d046f3 Re-enable Travis tests on 4.17 branch.
    • aa816b3 Remove /npm-package.
    • d7fbc52 Bump to v4.17.19
    • Additional commits viewable in compare view
    Maintainer changes

    This version was pushed to npm by bnjmnt4n, a new releaser for lodash since your current version.


    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies javascript 
    opened by dependabot[bot] 1
  • Short-circuit TupleDomain columnWiseUnion and intersect

    Short-circuit TupleDomain columnWiseUnion and intersect

    == NO RELEASE NOTE ==
    
    opened by zhenxiao 0
  • [doc] fix typo

    [doc] fix typo

    fix typo

    opened by linlinnn 0
  • Revert

    Revert "Fix dynamic pruning for null keys in hive partition"

    == NO RELEASE NOTE ==
    
    opened by kewang1024 0
  • Adding support to java.time in HiveMetadataFactory

    Adding support to java.time in HiveMetadataFactory

    Test plan - (Please fill in how you tested your changes)

    Please make sure your submission complies with our Development, Formatting, and Commit Message guidelines. Don't forget to follow our attribution guidelines for any code copied from other projects.

    Fill in the release notes towards the bottom of the PR description. See Release Notes Guidelines for details.

    == RELEASE NOTES ==
    
    General Changes
    * HiveMetadataFactory constructor accepts joda DataTimeZone. Adding another constructor that can accept java.util.Timezone so that dependent project can transition from joda to java util.
    
    

    If release note is NOT required, use:

    == NO RELEASE NOTE ==
    
    opened by pavithranrao 1
  • Add Weibull CDF and inverse Weibull CDF

    Add Weibull CDF and inverse Weibull CDF

    Part of #15798 (Presto Hackathon)

    PR: #15820

    cc: @rongrong

    opened by jsawruk 0
  • Add documentation for Glue Catalog support in Hive

    Add documentation for Glue Catalog support in Hive

    Cherry pick of https://github.com/trinodb/trino/pull/3689 and https://github.com/vincentpoon/prestosql/commit/c1ac9ac257bc5a07f32a359c7aac6735fd6ef69f

    Co-authored-by: Philippe Gagnon [email protected] Co-authored-by: Ashhar Hasan [email protected]

    == NO RELEASE NOTE ==
    
    opened by v-jizhang 0
  • Add Cauchy CDF and inverse Cauchy CDF

    Add Cauchy CDF and inverse Cauchy CDF

    Part of #15798 (Presto Hackathon)

    PR: #15818

    cc: @rongrong

    opened by jsawruk 0
  • Add hidden column $file_modified_time and $filesize in Hive

    Add hidden column $file_modified_time and $filesize in Hive

    Fixes #16054

    == RELEASE NOTES ==
    
    Hive Changes
    * Add hidden column ``$file_size``. 
    * Add hidden column ``$file_modified_time ``.
    
    
    opened by agrawalreetika 0
Owner
Presto
Distributed SQL query engine for big data
Presto
:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop

Elasticsearch Hadoop Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Apache Hive, Apache Pig, Apach

elastic 1.8k Mar 12, 2021
Apache Hive

Apache Hive (TM) The Apache Hive (TM) data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storag

The Apache Software Foundation 3.6k Mar 13, 2021
Real-time Query for Hadoop; mirror of Apache Impala

Welcome to Impala Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters. Impala is a modern, massively-distri

Cloudera 5 May 6, 2021
Apache Druid: a high performance real-time analytics database.

Website | Documentation | Developer Mailing List | User Mailing List | Slack | Twitter | Download Apache Druid Druid is a high performance real-time a

The Apache Software Foundation 10.6k Mar 13, 2021
SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.

SAMOA: Scalable Advanced Massive Online Analysis. This repository is discontinued. The development of SAMOA has moved over to the Apache Software Foun

Yahoo Archive 426 Feb 4, 2021
Please visit https://github.com/h2oai/h2o-3 for latest H2O

Caution: H2O-3 is now the current H2O! Please visit https://github.com/h2oai/h2o-3 H2O H2O makes Hadoop do math! H2O scales statistics, machine learni

H2O.ai 2.2k Mar 6, 2021
Mirror of Apache Storm

Master Branch: Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processi

The Apache Software Foundation 6.2k Mar 12, 2021
Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

LinkedIn's Attic 588 Feb 16, 2021
Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter

Heron is a realtime analytics platform developed by Twitter. It has a wide array of architectural improvements over it's predecessor. Heron in Apache

The Apache Software Foundation 3.5k Mar 10, 2021
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Apache Zeppelin Documentation: User Guide Mailing Lists: User and Dev mailing list Continuous Integration: Contributing: Contribution Guide Issue Trac

The Apache Software Foundation 5.2k Mar 13, 2021
Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

IMPORTANT NOTE!!! Storm has Moved to Apache. The official Storm git repository is now hosted by Apache, and is mirrored on github here: https://github

Nathan Marz 9k Mar 11, 2021
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Apache Gobblin Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems. Ca

The Apache Software Foundation 1.9k Mar 13, 2021
Machine Learning Platform and Recommendation Engine built on Kubernetes

Update January 2018 Seldon Core open sourced. Seldon Core focuses purely on deploying a wide range of ML models on Kubernetes, allowing complex runtim

Seldon 1.4k Mar 8, 2021
This code base is retained for historical interest only, please visit Apache Incubator Repo for latest one

Apache Kylin Apache Kylin is an open source Distributed Analytics Engine to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supp

Kylin OLAP Engine 557 Mar 10, 2021