Big Data Expertise: Data Engineering and Architecture @ BNP Paribas Cardif Insurance

Experiences

Big Data Expertise: Data Engineering and Architecture

BNP Paribas Cardif Insurance

November 2014 to December 2015

Freelancer

Nanterre

France

Development, Analysis, Administration, Architecture:
- Innovating and assisting the Datascientists who develop their business Use Cases with Machine Learning by using the benefits provided by Big Data technologies such as Hadoop with Spark, with DataFrames and Machine Learning algorithms (e.g of UCs: Anti-Fraud detection, churn, appetence, text analysis for classification).
Recommendation system:
- Implementation of a Proof Of Concept consisting of a real time recommendation system to recommend insurance products on new clients's sales receipt when they go through the tills of hypermarkets.
- Spark Machine Learning Pipeline for learning in batch mode: Supervised learning with the Alternative Least Square of Spark-ML (matrix factorization algorithm, using the vector of users and the rank parameter for dimensions reduction).Unsupervised learning with the K-Means algorithm of Spark-ML in order to find clusters of similarities in term of purchase behavior.
- Hadoop HDFS for storing logs coming from the tills in order to process later feature engineering and learning algorithms. Hadoop YARN for Spark and Kafka clusters (for processing and computing in batch mode and in real time mode).
- Real time with Kafka and Spark Streaming for predictions (evaluations of sales receipts coming from the tills of supermarkets).
- Spark Python programming. Developing a Python-Scala bridge in order to improve the performances of UDF (User Defined Functions). Linux shell programming for packaging and deployment in production environment. Jupyter Notebook and Eclipse-PyDev IDE used as development environments.
PageRank:
- Understanding of the PageRank algorithm and developing it with Spark RDDs in order to make a pedagogical presentation of Graph Processing for the Datascientists and to convince them to use Spark GraphX coupled with a Graph database like Neo4J.
- Implementation of Proof Of Concept based on Spark RDDs (to present the PageRank algorithm) and Spark GraphX (for PageRank, Connected Components, and Triangle Counting). Tehnos: Spark / Hadoop YARN, Hadoop HDFS, Python, Scala, Jupyter Notebook.
POC with Spark DataFrames (SQL) and Spark Machine Learning:
- Implementation of Proof Of Concepts with Spark DataFrames and SQL, and Spark Machine Learning (by using objects like Transformer and Estimator for Pipelines, Evaluator, CrossValidator).
- Developments by using ML algorithms like the Linear and Logistic Regressions, Random Forest, Neural Networks and ALS for recommendations.
- Presentation of these works to the DataScientists in order to explain them how from their Jupyter environment they can develop both with Python Scikit-Learn and Pandas and with Spark ML and DataFrames.
- Technos: Spark ML, Spark DF and SQL, Jupyter Notebook, Python, Pandas.
Bench project – Pig Hive vs Spark:
- Implementation of a bench in order to compare performances between Pig Hive and Spark SQL on a five nodes Hadoop V2.6 cluster, based on a Left- Outer Join query with tables containing retail data.
- Technos: Shell Linux programming for packaging and deployment of the project on different Hadoop-Spark clusters, Spark DataFrames, Python.
Development of Csv2Hive:
- Development of an injector named Csv2Hive available on GitHub https://github.com/enahwe/Csv2Hive.
- This tool infers dynamicaly the schema from big CSV files containing lot of columns; this tool enables quick automatic injections of external data to feed Hive metastore and Hadoop HDFS.Technos: 95% of Linux Shell scripting, 5% of Python.
Miscelaneous tasks:
- Administration of a Cloudera 4 nodes cluster (Hadoop 2.3.0-cdh5.0.2) for using mainly Pig and Hive.
- Installation and administration of a 5 nodes Cloudera cluster (Hadoop 2.6.0- cdh5.4.4), Sizing for each node: 4 cores 2.6 Ghz, 96 GB Mem, 1 TB Disk.
- Installation of Spark on YARN with Anaconda on each Hadoop node.
- Configuration of Jupyter Notebook for Spark on YARN, to allow DataScientists to discover Spark in Hadoop cluster.
- In charge of feeding of business data towards the HadoopDataLake (hence Csv2Hive).
- Developed MapReduce jobs in Java (e.g: inverted index).
Machine-Learning Challenge (Retail domain):
- Multi-categorization for CDiscount company ( challenge on https://www.datascience.net/fr/challenge/20/details ). Developed a program with more than 500 Multinomial-NaiveBayes models, Stemming, Stratified sampling and Mutual Information.

Company website

https://www.cardif.fr/

Votre navigateur est obsolète !

Mathis ROSSIGNOL

Big Data Consultant: Architect, Data Engineer, Certified in Data Science

Big Data Expertise: Data Engineering and Architecture