{"id":24816,"date":"2021-04-08T14:51:21","date_gmt":"2021-04-08T22:51:21","guid":{"rendered":"https:\/\/www.spokeo.com\/compass\/?p=24816"},"modified":"2022-09-02T10:37:27","modified_gmt":"2022-09-02T18:37:27","slug":"whats-the-fastest-way-to-store-intermediate-results-in-spark","status":"publish","type":"post","link":"https:\/\/www.spokeo.com\/compass\/whats-the-fastest-way-to-store-intermediate-results-in-spark\/","title":{"rendered":"What&#8217;s the fastest way to store intermediate results in Spark?"},"content":{"rendered":"\n<p>TL;DR: Sometimes, writing an intermediate result to disk and then reading it back again is faster than checkpointing or unoptimized caching!<\/p>\n\n\n\n<p>At Spokeo, we use Spark on Amazon EMR to build data from our data lake into documents that can be consumed by our backend services and eventually shown in the frontend. Therefore, many of our ETL jobs are concerned with producing one result at the end of a script. Getting to the end may involve many heavy calculations, e.g. windowing on id, before joining back to the main dataframe. In some cases, the main dataframe is later split into two with filters, then something expensive is done to the smaller dataframe before we union them back again.<\/p>\n\n\n\n<p>For our purposes, storing intermediate results is desirable for two reasons.&nbsp;<\/p>\n\n\n\n<p>First &#8211; storing the intermediate result improves fault tolerance. If we spend a lot of time and work doing a complex transformation and something downstream of it fails, storing the intermediate result prevents us from having to start again.&nbsp;<\/p>\n\n\n\n<p>Second &#8211; some methods of storing intermediate results truncate the lineage of the data transformation. This can help prevent applications from crashing due to memory exhaustion. This is a risk with applications that process a lot of data through complex and expensive transformations.<\/p>\n\n\n\n<p>Methods that come immediately to mind are `.cache` and `checkpoint`. Let\u2019s explore the different usages in context of pyspark ETL.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Cache and Checkpoint Variations<\/h2>\n\n\n\n<p>Cache: <a href=\"https:\/\/medium.com\/swlh\/caching-spark-dataframe-how-when-79a8c13254c0\">Quote:<\/a> &#8220;If you\u2019re executing multiple actions on the same DataFrame then cache it.&#8221; Common advice is to follow `df.cache()` with an action `df.count()` to trigger caching right away. Can we still trigger cache with one action only at the very end of the script?<\/p>\n\n\n\n<p>Cache and count: The intuition behind this is that counting a dataframe imperatively forces its contents into memory. This is a similar intuition to calling `df.show()`, which may only cache a fraction of the dataframe, whereas `df.count()` caches 100%.<\/p>\n\n\n\n<p>Checkpoint: `checkpoint` is used to truncate logical plans. It\u2019s useful when the logical plan becomes very large, e.g. in iterative unions causing out of memory errors (<a href=\"https:\/\/changhsinlee.com\/pyspark-dataframe-basics\/\">ref<\/a>). This is expected to be slower than cache as it commits the result to disk. There\u2019s rdd checkpoint API (<a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/python\/pyspark.html#pyspark.RDD.checkpoint\">ref<\/a>) and dataframe checkpoint API (<a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/python\/pyspark.sql.html?highlight=dateformat#pyspark.sql.DataFrame.cache\">ref<\/a>). We will try the latter.<\/p>\n\n\n\n<p>Cache and checkpoint: There is a warning on the RDD checkpoint API: \u201cIt is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.\u201d So we will try cache and checkpoint, and look out for signs of skipped recomputation.<\/p>\n\n\n\n<p>Writing and then reading a dataframe back as itself (using parquet storage): While in theory this produces an extremely clean truncation of the logical plan, this may be less performant due to writing to disk.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Experiment<\/h1>\n\n\n\n<p>We will try these four approaches in our crazy ETL script below. Here is our flow:&nbsp;<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Do something expensive first (self-join)<\/li><li>Store the intermediate layer with different methods<\/li><li>Split the dataframe with filters<\/li><li>Union them back to write.&nbsp;<\/li><\/ol>\n\n\n\n<p>We will run this locally in pyspark 2.4.4, inspect SparkUI, and run each method 20 times to compare performance. We will take measurements in pyspark 3.0.1.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"786\" height=\"551\" src=\"https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/image.png?resize=786%2C551&#038;ssl=1\" alt=\"\" class=\"wp-image-24817\" srcset=\"https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/image.png?w=786&amp;ssl=1 786w, https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/image.png?resize=300%2C210&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/image.png?resize=768%2C538&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/image.png?resize=585%2C410&amp;ssl=1 585w\" sizes=\"auto, (max-width: 786px) 100vw, 786px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"786\" height=\"805\" src=\"https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/image-1.png?resize=786%2C805&#038;ssl=1\" alt=\"\" class=\"wp-image-24818\" srcset=\"https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/image-1.png?w=786&amp;ssl=1 786w, https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/image-1.png?resize=293%2C300&amp;ssl=1 293w, https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/image-1.png?resize=768%2C787&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/image-1.png?resize=585%2C599&amp;ssl=1 585w\" sizes=\"auto, (max-width: 786px) 100vw, 786px\" \/><\/figure>\n\n\n\n<p>Performance is measured like this:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"785\" height=\"415\" src=\"https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/image-2.png?resize=785%2C415&#038;ssl=1\" alt=\"\" class=\"wp-image-24819\" srcset=\"https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/image-2.png?w=785&amp;ssl=1 785w, https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/image-2.png?resize=300%2C159&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/image-2.png?resize=768%2C406&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/image-2.png?resize=585%2C309&amp;ssl=1 585w\" sizes=\"auto, (max-width: 785px) 100vw, 785px\" \/><\/figure>\n\n\n\n<h1 class=\"wp-block-heading\">Results<\/h1>\n\n\n\n<h3 class=\"wp-block-heading\">SparkUI Observations<\/h3>\n\n\n\n<p>It may be too crowded to post all the screenshots here. We will describe only differences in words and leave here some screenshots of interest.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Test Control<\/h3>\n\n\n\n<p>In `test_control` we see the filter pushdown where filters are applied first before each self join. So we get four filters, two self joins, and one final union.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Test Cache<\/h3>\n\n\n\n<p>In `test_cache` we see a plan that follows our code more closely, performing the self join first, caching the result second (see green dot), then applying filters before final union.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Test Cache and Count<\/h3>\n\n\n\n<p>In `test_cache_and_count` we see the same plans, logical and physical as for `test_cache`. But here we have two jobs! Job0 to count &#8211; it includes one more stage to aggregate compared to Job0 in `test_cache`; Job1 to parquet &#8211; it resumes work from the green dot in the last stage.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Test Checkpoint<\/h3>\n\n\n\n<p>In `test_checkpoint` we see the same parsed logical plan but shorter Optimized Logical Plan and Physical Plan. It takes three jobs to complete: Job0 to checkpoint the self-join; Job1 to checkpoint again but skipping the first two stages; Job2 to parquet runs from beginning to finish like in `test_control` without skipping any stages.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Test Cache and Checkpoint<\/h3>\n\n\n\n<p>In `test_cache_and_checkpoint` we see the same plans as for `test_cache`, except we have three jobs here. Job0 to checkpoint the self join; Job1 to checkpoint again but now resumes work from green dot, Job2 to parquet resumes work from the green dot.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Test Write and Read<\/h3>\n\n\n\n<p>In `test_write_and_read` we see the final output plan reflects only work from the intermediate result. Job0 to write to intermediate, Job1 to read, Job2 to final parquet.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/us8VZpshheeSmdzUaVDXRoSF9xAB88xzDTxTKSPkebsPWkg-cPF7rWTsDkWjXe-AjqErnT0vlalUgXHkMCNeINvBgBcgmxZ3AHaJmphy_G7-C6xPjZmOUwtzUEYGfzUNNfH6be23\" alt=\"\"\/><\/figure>\n\n\n\n<p><strong>[CAPTION] <\/strong>Figure 1: Control DAG<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/zSoV4aF5s9Gamx0LxFx5CP2R1jqpiHVDC7NlUJA-BfWo-H4z9OnlSanz00IkYB7zjh6v8E6GzcxSqFpvmTUDLVWpfpryn7dxfJG7J2Cm6O88-9kUHSdeQfTkmRb5wrk09dHLPkVe\" alt=\"\"\/><\/figure>\n\n\n\n<p><strong>[CAPTION]<\/strong> Figure 2: Cache Job DAG<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/f_CRtZWlZesjHxY3-pZ197uYrK4K3p1JdFe8g7Cslt1jCje_-mx_yrm2E2wJSzeoBPW6rtZNQmQYon86Ws53-Q2kg2Pu_qT-NYvQYuB6nF1B9VXr-I5Lf9FhgEQss5cYTgZMhzSS\" alt=\"\"\/><\/figure>\n\n\n\n<p><strong>[CAPTION] <\/strong>Figure 3: Checkpoint Job0 DAG<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/ika5AO26Y86vmO9xyQD2mZjloB4MJK1SYBPQEzYgGb283lje3lo5lze_82DT7IH3JRh0imUvgt5iC5BeOEkxaHoObEw_zp76Ohsyt2fZ7TYMxbW2K4ovyVKaAjjP_t_2uPdCUTPB\" alt=\"\"\/><\/figure>\n\n\n\n<p><strong>[CAPTION]<\/strong> Figure 4: Checkpoint Job1 DAG<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/p15tuwyonafAZH8VfdhhpIrA0QDeROu-pYYU9-Fc3OJeJvQT1OJVpCYstctA34jBuNtXplEKSQa4P_G78lDXrkLAigOZbb_VXnpHfLHVHgvNdS7UHIHT-SYCGyHT16TLFDbJj_AO\" alt=\"\"\/><\/figure>\n\n\n\n<p><strong>[CAPTION] <\/strong>Figure 5: Checkpoint Job2 DAG<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Performance<\/h2>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/dcmr0C5rP9Q1mx5a4cxkBqGjVlIdd-4Ebpi9NuvuY8tbu0y3ZFzbwVWDicYqBm50xVM3Yb5v8OZ0lBuZTY7Zuye71xFpzXJle-cqwOvPGex4eQuezRufzFgu1Tw5Ey5bb2UyIzrm\" alt=\"\"\/><\/figure>\n\n\n\n<p><strong>[CAPTION]<\/strong> Figure 6: Time to finish by intermediate storage methods in spark 2.4.4 and 3.0.1<\/p>\n\n\n\n<p>Figure 6 shows performance per test group in seconds; each box represents 20 measurements.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Discussion<\/h2>\n\n\n\n<p>The filter pushdown has helped `test_control` to achieve the best time. This means that even though it\u2019s good to store products from a long compute for faster discovery, we still need to be careful about introducing stop points as they block catalyst optimizations (any listed <a href=\"https:\/\/jaceklaskowski.gitbooks.io\/mastering-spark-sql\/content\/spark-sql-Optimizer.html#operatorOptimizationRuleSet\">here<\/a>) of the general flow. We should not insert cache point here to fork data.<\/p>\n\n\n\n<p>Simple `cache` performs a bit better than `cache and count` This dispels the need to count right away after cache. Green dot in `cache` DAG confirms that intermediate is saved to memory and utilized.<\/p>\n\n\n\n<p>`write and read` performs comparably to `cache`! Note `cache` here means `persist(StorageLevel.MEMORY_AND_DISK)`, see pyspark 2.4.4 <a href=\"https:\/\/spark.apache.org\/docs\/2.4.4\/api\/python\/pyspark.sql.html#pyspark.sql.DataFrame.cache\">ref<\/a>. `cache` not doing better here means there is room for memory tuning. See <a href=\"https:\/\/spark.apache.org\/docs\/2.4.1\/tuning.html#memory-tuning\">guide<\/a>. Tuning parameters include using Kryo serializer (a high recommendation), and using serialized caching, e.g. MEMORY_AND_DISK_SER, to reduce footprint and GC pressure. The same guide on garbage collection <a href=\"https:\/\/spark.apache.org\/docs\/2.4.1\/tuning.html#garbage-collection-tuning)\">here<\/a>: \u201c when Java needs to evict old objects to make room for new ones, it will need to trace through all your Java objects and find the unused ones.\u201d<\/p>\n\n\n\n<p>`write and read` via parquet is a good alternative to using `df.checkpoint`. It is faster, lets us control how we want to read (hdfs or s3), and write. We can avoid rework by `df.write.mode(\u201cignore\u201d)` in a subsequent run! Another benefit is that it relieves memory pressure from processes that may have memory leaks (pandas udfs) Note we do need to avoid special characters in column names when storing to parquet.&nbsp;<\/p>\n\n\n\n<p>`checkpoint` results confirm the recomputation that takes place. Recomputation is reduced in `cache and checkpoint` where work is resumed from the green dot. However, both checkpoint methods don&#8217;t break the lineage as cleanly as `write and read`. Lineage is still preserved in both parsed logical plans.<\/p>\n\n\n\n<p>To fully enjoy the benefits of storing intermediate results in a way that writes to disk, it is important to use a cluster architecture with strong rack locality for storage (for example, using a core instance type in EMR that has attached storage).<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Takeaways<\/h1>\n\n\n\n<p>Here are the key takeaways:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Insert intermediate points with caution as it may block Catalyst optimizations<\/li><li>Look for opportunities to apply filters early when introducing intermediate points<\/li><li>Invest time in memory tuning when using cache, else use write and read via parquet<\/li><li>For our purposes, write and read is as good as, if not better than, checkpoint<\/li><\/ol>\n","protected":false},"excerpt":{"rendered":"<p>TL;DR: Sometimes, writing an intermediate result to disk and then reading it back again is faster than checkpointing or unoptimized caching! At Spokeo, we use Spark on Amazon EMR to&hellip;<\/p>\n","protected":false},"author":88,"featured_media":24820,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[722],"tags":[592],"class_list":["post-24816","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech-blog","tag-recruiting"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.9 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>What&#039;s the fastest way to store intermediate results in Spark? | Spokeo<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.spokeo.com\/compass\/whats-the-fastest-way-to-store-intermediate-results-in-spark\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"What&#039;s the fastest way to store intermediate results in Spark? | Spokeo\" \/>\n<meta property=\"og:description\" content=\"TL;DR: Sometimes, writing an intermediate result to disk and then reading it back again is faster than checkpointing or unoptimized caching! At Spokeo, we use Spark on Amazon EMR to&hellip;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.spokeo.com\/compass\/whats-the-fastest-way-to-store-intermediate-results-in-spark\/\" \/>\n<meta property=\"og:site_name\" content=\"The Compass Blog | Digital Identity and People Search | Spokeo\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Spokeo\/\" \/>\n<meta property=\"article:published_time\" content=\"2021-04-08T22:51:21+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-09-02T18:37:27+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.spokeo.com\/compass\/image\/Grey-3-2260x1160-1-scaled.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"800\" \/>\n\t<meta property=\"og:image:height\" content=\"411\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Olivia Tighe\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Spokeo\" \/>\n<meta name=\"twitter:site\" content=\"@Spokeo\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Olivia Tighe\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.spokeo.com\/compass\/whats-the-fastest-way-to-store-intermediate-results-in-spark\/\",\"url\":\"https:\/\/www.spokeo.com\/compass\/whats-the-fastest-way-to-store-intermediate-results-in-spark\/\",\"name\":\"What's the fastest way to store intermediate results in Spark? | Spokeo\",\"isPartOf\":{\"@id\":\"https:\/\/www.spokeo.com\/compass\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.spokeo.com\/compass\/whats-the-fastest-way-to-store-intermediate-results-in-spark\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.spokeo.com\/compass\/whats-the-fastest-way-to-store-intermediate-results-in-spark\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/Grey-3-2260x1160-1-scaled.jpg?fit=800%2C411&ssl=1\",\"datePublished\":\"2021-04-08T22:51:21+00:00\",\"dateModified\":\"2022-09-02T18:37:27+00:00\",\"author\":{\"@id\":\"https:\/\/www.spokeo.com\/compass\/#\/schema\/person\/79de1b18e01fb71637ea971e65b66b34\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.spokeo.com\/compass\/whats-the-fastest-way-to-store-intermediate-results-in-spark\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.spokeo.com\/compass\/whats-the-fastest-way-to-store-intermediate-results-in-spark\/#primaryimage\",\"url\":\"https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/Grey-3-2260x1160-1-scaled.jpg?fit=800%2C411&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/Grey-3-2260x1160-1-scaled.jpg?fit=800%2C411&ssl=1\",\"width\":800,\"height\":411},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.spokeo.com\/compass\/#website\",\"url\":\"https:\/\/www.spokeo.com\/compass\/\",\"name\":\"The Compass Blog | Digital Identity and People Search | Spokeo\",\"description\":\"The official Spokeo blog covers topics such as digital identity, consumer protection and privacy, how to avoid scams and catfishing, and more.\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.spokeo.com\/compass\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.spokeo.com\/compass\/#\/schema\/person\/79de1b18e01fb71637ea971e65b66b34\",\"name\":\"Olivia Tighe\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.spokeo.com\/compass\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/5147448fc308e61aecaa6782b45b69738d5ed842652b02bffc24ca6e0a7a1911?s=96&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/5147448fc308e61aecaa6782b45b69738d5ed842652b02bffc24ca6e0a7a1911?s=96&r=g\",\"caption\":\"Olivia Tighe\"},\"url\":\"https:\/\/www.spokeo.com\/compass\/author\/olivia\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"What's the fastest way to store intermediate results in Spark? | Spokeo","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.spokeo.com\/compass\/whats-the-fastest-way-to-store-intermediate-results-in-spark\/","og_locale":"en_US","og_type":"article","og_title":"What's the fastest way to store intermediate results in Spark? | Spokeo","og_description":"TL;DR: Sometimes, writing an intermediate result to disk and then reading it back again is faster than checkpointing or unoptimized caching! At Spokeo, we use Spark on Amazon EMR to&hellip;","og_url":"https:\/\/www.spokeo.com\/compass\/whats-the-fastest-way-to-store-intermediate-results-in-spark\/","og_site_name":"The Compass Blog | Digital Identity and People Search | Spokeo","article_publisher":"https:\/\/www.facebook.com\/Spokeo\/","article_published_time":"2021-04-08T22:51:21+00:00","article_modified_time":"2022-09-02T18:37:27+00:00","og_image":[{"width":800,"height":411,"url":"https:\/\/www.spokeo.com\/compass\/image\/Grey-3-2260x1160-1-scaled.jpg","type":"image\/jpeg"}],"author":"Olivia Tighe","twitter_card":"summary_large_image","twitter_creator":"@Spokeo","twitter_site":"@Spokeo","twitter_misc":{"Written by":"Olivia Tighe","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.spokeo.com\/compass\/whats-the-fastest-way-to-store-intermediate-results-in-spark\/","url":"https:\/\/www.spokeo.com\/compass\/whats-the-fastest-way-to-store-intermediate-results-in-spark\/","name":"What's the fastest way to store intermediate results in Spark? | Spokeo","isPartOf":{"@id":"https:\/\/www.spokeo.com\/compass\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.spokeo.com\/compass\/whats-the-fastest-way-to-store-intermediate-results-in-spark\/#primaryimage"},"image":{"@id":"https:\/\/www.spokeo.com\/compass\/whats-the-fastest-way-to-store-intermediate-results-in-spark\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/Grey-3-2260x1160-1-scaled.jpg?fit=800%2C411&ssl=1","datePublished":"2021-04-08T22:51:21+00:00","dateModified":"2022-09-02T18:37:27+00:00","author":{"@id":"https:\/\/www.spokeo.com\/compass\/#\/schema\/person\/79de1b18e01fb71637ea971e65b66b34"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.spokeo.com\/compass\/whats-the-fastest-way-to-store-intermediate-results-in-spark\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.spokeo.com\/compass\/whats-the-fastest-way-to-store-intermediate-results-in-spark\/#primaryimage","url":"https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/Grey-3-2260x1160-1-scaled.jpg?fit=800%2C411&ssl=1","contentUrl":"https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/Grey-3-2260x1160-1-scaled.jpg?fit=800%2C411&ssl=1","width":800,"height":411},{"@type":"WebSite","@id":"https:\/\/www.spokeo.com\/compass\/#website","url":"https:\/\/www.spokeo.com\/compass\/","name":"The Compass Blog | Digital Identity and People Search | Spokeo","description":"The official Spokeo blog covers topics such as digital identity, consumer protection and privacy, how to avoid scams and catfishing, and more.","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.spokeo.com\/compass\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/www.spokeo.com\/compass\/#\/schema\/person\/79de1b18e01fb71637ea971e65b66b34","name":"Olivia Tighe","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.spokeo.com\/compass\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/5147448fc308e61aecaa6782b45b69738d5ed842652b02bffc24ca6e0a7a1911?s=96&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/5147448fc308e61aecaa6782b45b69738d5ed842652b02bffc24ca6e0a7a1911?s=96&r=g","caption":"Olivia Tighe"},"url":"https:\/\/www.spokeo.com\/compass\/author\/olivia\/"}]}},"jetpack_featured_media_url":"https:\/\/i0.wp.com\/www.spokeo.com\/compass\/image\/Grey-3-2260x1160-1-scaled.jpg?fit=800%2C411&ssl=1","jetpack_shortlink":"https:\/\/wp.me\/p8V62u-6sg","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.spokeo.com\/compass\/wp-json\/wp\/v2\/posts\/24816","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.spokeo.com\/compass\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.spokeo.com\/compass\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.spokeo.com\/compass\/wp-json\/wp\/v2\/users\/88"}],"replies":[{"embeddable":true,"href":"https:\/\/www.spokeo.com\/compass\/wp-json\/wp\/v2\/comments?post=24816"}],"version-history":[{"count":2,"href":"https:\/\/www.spokeo.com\/compass\/wp-json\/wp\/v2\/posts\/24816\/revisions"}],"predecessor-version":[{"id":24822,"href":"https:\/\/www.spokeo.com\/compass\/wp-json\/wp\/v2\/posts\/24816\/revisions\/24822"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.spokeo.com\/compass\/wp-json\/wp\/v2\/media\/24820"}],"wp:attachment":[{"href":"https:\/\/www.spokeo.com\/compass\/wp-json\/wp\/v2\/media?parent=24816"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.spokeo.com\/compass\/wp-json\/wp\/v2\/categories?post=24816"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.spokeo.com\/compass\/wp-json\/wp\/v2\/tags?post=24816"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}