How to load a textfile to Spark RDD and convert it to a Spark DataFrame?
import os
os.system("cat /home/mehaa/family.csv") -- Please use the path where the file is in your machine. On my machine I have saved the file in above path.
rdd=sc.textFile('/home/mehaa/family.csv')
rdd.collect()
Now, we need to split the records.
>>> rdd1=rdd.map(lambda x:x.split(','))
>>> rdd1.collect()
[['102', 'Gokula', '37', 'Mother'], ['103', 'Mehaa', '5', 'Daughter'], ['104', 'Rithihaa', '2', 'Daughter']]
>>>
As you can see from the above image, we have 4 columns and 3 rows.
We need to provide the meaningful column names to those.
We must import Row function to create the columns from the RDD.
>>> from pyspark.sql import Row
>>> rdd2=rdd1.map(lambda x:Row(id=x[0],name=x[1],age=int(x[2]),reln=x[3]))
>>> df=rdd2.toDF()
>>> df.show()
+---+--------+---+--------+
| id| name|age| reln|
+---+--------+---+--------+
|102| Gokula| 37| Mother|
|103| Mehaa| 5|Daughter|
|104|Rithihaa| 2|Daughter|
+---+--------+---+--------+
We can create DataFrame using createDataFrame method as well.
>>> df1=spark.createDataFrame(rdd2)
>>> df1.show()
+---+--------+---+--------+
| id| name|age| reln|
+---+--------+---+--------+
|102| Gokula| 37| Mother|
|103| Mehaa| 5|Daughter|
|104|Rithihaa| 2|Daughter|
+---+--------+---+--------+
No comments:
Post a Comment