hadoop - How can we solve a two-sum algorithm as a big data problem leveraging mapreduce or spark?

Question

Welcome To Ask or Share your Answers For Others

hadoop - How can we solve a two-sum algorithm as a big data problem leveraging mapreduce or spark?

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

hadoop - How can we solve a two-sum algorithm as a big data problem leveraging mapreduce or spark?

Assuming the list/array of numbers are present in a very huge data file and we need to find the pair of sums that match a particular number 'k'. I know how to solve it typically using data structures but I am unable to think of an approach to solve it leveraging Hadoop MR or spark particularly.

Assuming a file having 1,2,3,6,7,7,8,9 My thought process: -Considering the data into a dataframe and then adding one more column to it which identifies the difference i.e.if i<=k/2 then k-i else i. then now my dataframe for above data is like:

number	number_2
1	9
2	8
3	7
7	7
7	7
8	8
9	9

question from:https://stackoverflow.com/questions/65670954/how-can-we-solve-a-two-sum-algorithm-as-a-big-data-problem-leveraging-mapreduce

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T18:51:15+0000

Imagine you have a file named numbers.txt like the following:

you can achieve your goal like this:

int desiredSum = 15;
SparkSession spark = SparkSession
        .builder()
        .appName("My App")
        .master("local[*]")
        .getOrCreate();
Dataset<Row> rdd = spark
        .read()
        .text("numbers")
        .withColumnRenamed("value", "number")
        .withColumn("number", col("number").cast(DataTypes.LongType));
rdd.createOrReplaceTempView("myTable");
spark.sql("select first.number, second.number as number_2 from myTable  first inner join myTable second on first.number + second.number =" + desiredSum + " where first.number <= second.number").show();



+------+--------+
|number|number_2|
+------+--------+
|     5|      10|
|     7|       8|
|     6|       9|
+------+--------+

Or if data is small you can achieve your goal using Cartesian product in Spark like this:

int desiredSum = 15;
SparkSession spark = SparkSession
        .builder()
        .appName("My App")
        .master("local[*]")
        .getOrCreate();
Dataset<Row> rdd = spark
        .read()
        .text("numbers.txt")
        .withColumnRenamed("value", "number")
        .withColumn("number", col("number").cast(DataTypes.LongType));
Dataset<Row> joinedRdd = rdd.crossJoin(rdd.withColumnRenamed("number", "number_2")).filter("number <= number_2");
UserDefinedFunction mode = udf((UDF2<Long, Long, Object>) Long::sum, DataTypes.LongType);
joinedRdd = joinedRdd.withColumn("sum", mode.apply(col("number"), col( "number_2"))).filter("sum = " + desiredSum);
joinedRdd.show();

in which result to this:

+------+--------+---+
|number|number_2|sum|
+------+--------+---+
|     5|      10| 15|
|     7|       8| 15|
|     6|       9| 15|
+------+--------+---+


**take into account the Order of time and space complexity when you use Cross join**

Categories

hadoop - How can we solve a two-sum algorithm as a big data problem leveraging mapreduce or spark?

hadoop - How can we solve a two-sum algorithm as a big data problem leveraging mapreduce or spark?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags