Found input variables with inconsistent numbers of samples как исправить

Обучаю RandomRorest

Вот код:

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                  test_size=0.8, 
                                  random_state=241)

RFC = RandomForestClassifier(n_estimators=37, random_state=241)
RFC.fit(X_train, y_train)

scor_test = []
for predict in RFC.predict_proba(X_test):
    x_scor = log_loss(y_test, predict)
    scor_test.apend(x_scor)

После выполнения последнего блока, ошибка:

ValueError                                Traceback (most recent call last)
<ipython-input-152-01347a72f1da> in <module>
      1 scor_test = []
      2 for predict in RFC.predict_proba(X_test):
----> 3     x_scor = log_loss(y_test, predict)
      4     scor_test.apend(x_scor)

~Anaconda3libsite-packagessklearnmetricsclassification.py in log_loss(y_true, y_pred, eps, normalize, sample_weight, labels)
   1762     """
   1763     y_pred = check_array(y_pred, ensure_2d=False)
-> 1764     check_consistent_length(y_pred, y_true, sample_weight)
   1765 
   1766     lb = LabelBinarizer()

~Anaconda3libsite-packagessklearnutilsvalidation.py in check_consistent_length(*arrays)
    233     if len(uniques) > 1:
    234         raise ValueError("Found input variables with inconsistent numbers of"
--> 235                          " samples: %r" % [int(l) for l in lengths])
    236 
    237 

ValueError: Found input variables with inconsistent numbers of samples: [2, 3001]    
Found input variables with inconsistent numbers of samples

Где я ошибся?

Дополнительная информация:

y_test.shape - (3001,)
RFC.predict_proba(X_test).shape - (3001, 2)

Может проблема в размерности матриц?

Fairly new to Python but building out my first RF model based on some classification data. I’ve converted all of the labels into int64 numerical data and loaded into X and Y as a numpy array, but I am hitting an error when I am trying to train the models.

Here is what my arrays look like:

>>> X = np.array([[df.tran_cityname, df.tran_signupos, df.tran_signupchannel, df.tran_vmake, df.tran_vmodel, df.tran_vyear]])

>>> Y = np.array(df['completed_trip_status'].values.tolist())

>>> X
array([[[   1,    1,    2,    3,    1,    1,    1,    1,    1,    3,    1,
            3,    1,    1,    1,    1,    2,    1,    3,    1,    3,    3,
            2,    3,    3,    1,    1,    1,    1],
        [   0,    5,    5,    1,    1,    1,    2,    2,    0,    2,    2,
            3,    1,    2,    5,    5,    2,    1,    2,    2,    2,    2,
            2,    4,    3,    5,    1,    0,    1],
        [   2,    2,    1,    3,    3,    3,    2,    3,    3,    2,    3,
            2,    3,    2,    2,    3,    2,    2,    1,    1,    2,    1,
            2,    2,    1,    2,    3,    1,    1],
        [   0,    0,    0,   42,   17,    8,   42,    0,    0,    0,   22,
            0,   22,    0,    0,   42,    0,    0,    0,    0,   11,    0,
            0,    0,    0,    0,   28,   17,   18],
        [   0,    0,    0,   70,  291,   88,  234,    0,    0,    0,  222,
            0,  222,    0,    0,  234,    0,    0,    0,    0,   89,    0,
            0,    0,    0,    0,   40,  291,  131],
        [   0,    0,    0, 2016, 2016, 2006, 2014,    0,    0,    0, 2015,
            0, 2015,    0,    0, 2015,    0,    0,    0,    0, 2015,    0,
            0,    0,    0,    0, 2016, 2016, 2010]]])

>>> Y
array(['NO', 'NO', 'NO', 'YES', 'NO', 'NO', 'YES', 'NO', 'NO', 'NO', 'NO',
       'NO', 'YES', 'NO', 'NO', 'YES', 'NO', 'NO', 'NO', 'NO', 'NO', 'NO',
       'NO', 'NO', 'NO', 'NO', 'NO', 'NO', 'NO'], 
      dtype='|S3')

>>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/sklearn/cross_validation.py", line

2039, in train_test_split
arrays = indexable(*arrays)
File «/Library/Python/2.7/site-packages/sklearn/utils/validation.py», line
206, in indexable
check_consistent_length(*result)
File «/Library/Python/2.7/site-packages/sklearn/utils/validation.py», line
181, in check_consistent_length
» samples: %r» % [int(l) for l in lengths])

ValueError: Found input variables with inconsistent numbers of samples: [1, 29]

Describe the bug

Trying to use fit_params with CalibratedClassifierCV in v1.1 but receives fail of fit parameters when pass to classifier.

  • I have 1000 rows.
  • I split it into train and validation, 800 and 200 relatively.
  • The validation data part is passed to eval_set parameterr in fit_params and I fit with train part which is 800 size.
  • The train data part is using to do learning and I have cross-val in optimization with n_splits=5 splits, i.e., I have each of 160 rows (800/5=160).
    Finally, I receive ValueError: Found input variables with inconsistent numbers of samples: [640, 1] and 640 it seems 4/5 of data, so it’s sub-train part in inner cv to evaluate on 1/5 since we have 5 folds.

What I miss here? Where I fail?

See details below.

Steps/Code to Reproduce

# Description
# This code generates pseudo-data for this test. PyTorch is needed.
# In case you use some libs to install in the environment, please run your installation to have additionally pytorch be installed just by this command below to obtain pytorch
# pip install -r requirements.txt -f https://download.pytorch.org/whl/cu111/torch_stable.html

import random
import numpy as np
import pandas as pd
from datetime import datetime
from typing import List, Dict, Any

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, LabelBinarizer, LabelEncoder, OrdinalEncoder
from sklearn.model_selection import KFold, GroupKFold, GridSearchCV, train_test_split
from sklearn.calibration import CalibratedClassifierCV 

from pytorch_tabnet.tab_model import TabNetClassifier

import gc
import torch
torch.cuda.empty_cache()

# Generate random data: 20 features, id, label
df = pd.DataFrame()
size = 1000
df[f'id'] = [k for k in range(size)]
for c in range(1,11):
    df[f'feature{c}_float'] = [random.uniform(-100,100) for k in range(size)]
df[f'feature{c}_int'] = [random.randrange(0, 1000, 10) for k in range(size)]; c+=1
df[f'feature{c}_int'] = [random.randrange(-100, 100, 10) for k in range(size)]; c+=1
df[f'feature{c}_int'] = [random.randrange(2015, 2020, 1) for k in range(size)]; c+=1
df[f'feature{c}_int'] = [random.choice([-1, 1, np.nan]) for k in range(size)]; c+=1
df[f'feature{c}_int'] = [random.choice([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]) for k in range(size)]; c+=1
df[f'feature{c}_cat'] = [random.choice(["a", "b", "c", "d", "e", np.nan]) for k in range(size)]; c+=1
df[f'feature{c}_cat'] = [random.choice(["red", "blue", "green", np.nan]) for k in range(size)]; c+=1
df[f'feature{c}_cat'] = [random.choice(["yes", "no", "neutral", np.nan]) for k in range(size)]; c+=1
df[f'feature{c}_cat'] = [random.choice(["animal", "human", np.nan]) for k in range(size)]; c+=1
df[f'feature{c}_cat'] = [random.choice(["male", "female", "N/A", np.nan]) for k in range(size)]; c+=1
for col in range(15,20):
    df[f'feature{col}_cat'] = df[f'feature{col}_cat'].astype('category') # set type for categorical features
df['label'] = [random.choice([-1, 0, 1]) for k in range(size)]
model_features = set(df.drop(columns=['id','label']).columns)
#model_features = set(df.drop(columns=['label']).columns)
    
def make_model_pipeline(model_class, categoricals: List[str], numericals: List[str],
                        drops: List[str], model_parameters: Dict[str, Any]) -> Pipeline:
    
    model_preprocessing = ("preprocessing", ColumnTransformer([
        ('cat', Pipeline([('imputer', SimpleImputer(strategy='constant')), 
                          ('oenc', OrdinalEncoder(handle_unknown ='use_encoded_value',unknown_value = -1))
                          ]), categoricals),
        ('num', Pipeline([("scaler", RobustScaler()), 
                          ("imputer", SimpleImputer(strategy="median"))
                          ]), numericals),
        ('drop', 'drop', drops),
    ], remainder='drop'))
    
    calibrated_classifier = ("calibrated_classifier", CalibratedClassifierCV(
        base_estimator=model_class(**model_parameters), method='isotonic', cv=5))
    
    pipeline = Pipeline([model_preprocessing, calibrated_classifier])
    
    return pipeline

x_train, y_train = df.drop(columns='label'), df['label']

# Features
drop = sorted(set(x_train.columns) - set(x_train[model_features].columns))
cat = sorted(x_train[model_features].select_dtypes(include=['category']).columns)
num = sorted(set(x_train[model_features].columns) - set(cat))
use_features = sorted(set(cat).union(set(num)) - set(drop))

# Folds yearly
year = x_train["feature12_int"] # year
year_cv = GroupKFold(n_splits=year.nunique())

# Make pipeline
model_class = TabNetClassifier
model_parameters = {
    'n_d': 16, 'n_a': 16,
    'n_steps': 5, 
    'n_independent': 2, 
    'n_shared': 2, 
    'clip_value': 2.0, 
    'gamma': 1.5, 
    'lambda_sparse': 0.01
    }
param_grid = {
    'n_steps': [3,5], 
    'momentum': [0.3, 0.5]
    }  

opt_pipeline = make_model_pipeline(model_class, cat, num, drop,
                                   {k: v for k, v in model_parameters.items() if k not in param_grid})
opt_pipeline[1].base_estimator.set_params(
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=0.02, weight_decay = 1e-5),
    scheduler_params = {"gamma": 0.95, "step_size": 20},
    scheduler_fn=torch.optim.lr_scheduler.StepLR, 
    epsilon=1e-15 
)

param_grid = {'calibrated_classifier__base_estimator__n_steps': [3, 5]}
param_grid = {'calibrated_classifier__base_estimator__momentum': [0.3, 0.5]}

# Data split
print(f"nAll data: {x_train.shape} {y_train.shape}")
x_train_prep, x_valid_prep, y_train_prep, y_valid_prep = train_test_split(x_train, y_train, 
                                                                          test_size=0.20, 
                                                                          random_state=123)

# Preprocessing for eval_set
le = LabelEncoder() 
le.fit(y_train_prep)
y_train_prep, y_valid_prep = le.transform(y_train_prep), le.transform(y_valid_prep)

scc = opt_pipeline.get_params()['preprocessing'].transformers[0][1].named_steps['imputer']
scc.fit(x_train_prep[cat])
oenc = opt_pipeline.get_params()['preprocessing'].transformers[0][1].named_steps['oenc']

sc = opt_pipeline.get_params()['preprocessing'].transformers[1][1].named_steps['scaler']
sc.fit(x_train_prep[num])

imp = opt_pipeline.get_params()['preprocessing'].transformers[1][1].named_steps['imputer']
imp.fit(x_train_prep[num])

def preprocessing(x_t, x_v, cat, num, sc, scc, oenc, imp):
    # Preprocessing manually to have train/valid split for ANN
    def prep(prep, data, variables):
        df = pd.DataFrame(prep.transform(data[variables]),
                          columns=data[variables].columns,
                          index=data[variables].index).values
        return df
    # For train and validation
    x_t[cat] = prep(scc, x_t, cat)
    oenc.fit(x_t[cat])
    x_t[cat] = prep(oenc, x_t, cat)
    x_t[num] = prep(sc, x_t, num)
    x_t[num] = prep(imp, x_t, num)
    x_v[cat] = prep(scc, x_v, cat)
    x_v[cat] = prep(oenc, x_v, cat)
    x_v[num] = prep(sc, x_v, num)
    x_v[num] = prep(imp, x_v, num)
    return x_t, x_v

x_train_prep, x_valid_prep = preprocessing(
    x_train_prep,
    x_valid_prep,
    cat, num,
    sc, scc, oenc, imp
    )

# Find best params on whole dataset
model = GridSearchCV(estimator=opt_pipeline,
                        param_grid=param_grid,
                        cv=KFold(**inner_cv_params),
                        scoring='balanced_accuracy', 
                        refit=False, 
                        verbose=2)
fit_params = {}
fit_params['calibrated_classifier__eval_set']=[(x_valid_prep[use_features].values,y_valid_prep)]
fit_params['calibrated_classifier__eval_name']=['valid']
fit_params['calibrated_classifier__max_epochs']=100
fit_params['calibrated_classifier__patience']=10
fit_params['calibrated_classifier__batch_size']=32 
fit_params['calibrated_classifier__virtual_batch_size']=16 
fit_params['calibrated_classifier__drop_last']=False
#fit_params['calibrated_classifier__weights']=np.ones([y_train_prep.size]) / y_train_prep.size
model.fit(x_train_prep, y_train_prep, **fit_params) # ----> errors here.
# Fit model with best params
best_model_parameters = {k.split("__")[-1]: v for k, v in model.best_params_.items()}
pipeline = make_model_pipeline(model_class, sorted(set(cat) - set(drop)), 
                                sorted(set(num) - set(drop)), [], best_model_parameters)
pipeline.fit(x_train_prep, y_train_prep, **fit_params)

Expected Results

No error is expected, smooth learning process.

Actual Results

Traceback (most recent call last):
  File "/home/kabartay/sklearn_v1.1_test.py", line 203, in <module>
    model.fit(x_train_prep, y_train_prep, **fit_params)
  File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 875, in fit
    self._run_search(evaluate_candidates)
  File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 1375, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 852, in evaluate_candidates
    _warn_or_raise_about_fit_failures(out, self.error_score)
  File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 367, in _warn_or_raise_about_fit_failures
    raise ValueError(all_fits_failed_message)
ValueError: 
All the 10 fits failed.
It is is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/pipeline.py", line 382, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/calibration.py", line 283, in fit
    check_consistent_length(y, sample_aligned_params)
  File "/home/anaconda3/envs/sklearn11/lib/python3.9/site-packages/sklearn/utils/validation.py", line 383, in check_consistent_length
    raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [640, 1]

Issues with this
https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/validation.py#L365
when we check here
https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/calibration.py#L283

Versions

System:
    python: 3.9.12 (main, Apr  5 2022, 06:56:58)  [GCC 7.5.0]
executable: /home/utilisateur/anaconda3/envs/sklearn11/bin/python3
   machine: Linux-5.13.0-41-generic-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.1.0
          pip: 21.2.4
   setuptools: 49.2.0
        numpy: 1.21.0
        scipy: 1.8.0
       Cython: None
       pandas: 1.1.5
   matplotlib: 3.3.4
       joblib: 1.0.1
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/utilisateur/anaconda3/envs/sklearn11/lib/python3.9/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
    num_threads: 12

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/utilisateur/anaconda3/envs/sklearn11/lib/python3.9/site-packages/numpy.libs/libopenblasp-r0-5bebc122.3.13.dev.so
        version: 0.3.13.dev
threading_layer: pthreads
   architecture: Haswell
    num_threads: 12

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/utilisateur/anaconda3/envs/sklearn11/lib/python3.9/site-packages/scipy.libs/libopenblasp-r0-8b9e111f.3.17.so
        version: 0.3.17
threading_layer: pthreads
   architecture: Haswell
    num_threads: 12

Answer by Zahir Ballard

What conditions are studless winter tires good in?

,Data Science Stack Exchange is a question and answer site for Data science professionals, Machine Learning specialists, and those interested in learning more about the field. It only takes a minute to sign up.,

Meta

Discuss the workings and policies of this site

,

Why solve a problem twice? Design patterns let you apply existing solutions…

You are running into that error because your X and Y don’t have the same length (which is what train_test_split requires), i.e., X.shape[0] != Y.shape[0]. Given your current code:

>>> X.shape
(1, 6, 29)
>>> Y.shape
(29,)

Answer by Gordon Harmon

I am trying to create one Machine Learning model using LinearRegression model, but I am getting the below error.,Found input variables with inconsistent numbers of samples: [30, 120] Dec 21, 2020 ,value error found with inconsistent samples Dec 26, 2020 ,66648/valueerror-found-variables-inconsistent-numbers-samples

I am trying to create one Machine Learning model using LinearRegression model, but I am getting the below error.

import pandas as pd
data = pd.read_csv('db.csv')
x = data['TV']
y = data['Sales']
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x,y)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/user/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 512, in fit
    y_numeric=True, multi_output=True)
  File "/user/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py", line 531, in check_X_y
    check_consistent_length(X, y)
  File "/user/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py", line 181, in check_consistent_length
    " samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [1, 1000]

Answer by Augustus Beltran

What conditions are studless winter tires good in?

,

Stack Overflow
Public questions & answers

,Thanks for contributing an answer to Stack Overflow!,

Stack Overflow for Teams
Where developers & technologists share private knowledge with coworkers

This is working for me fine. Before reshaping make sure that the arrays are numpy arrays.

import numpy as np
from sklearn.linear_model import LinearRegression

X = np.asarray([ 1994.,  1995.,  1996.,  1997.,  1998.,  1999.])
y = np.asarray([1.2, 2.3, 3.4, 4.5, 5.6, 6.7])

clf = LinearRegression()
clf.fit(X.reshape(-1,1),y)


clf.predict([1997])
#Output: array([ 4.5])

clf.predict([2001])
#Output: array([ 8.9])

Answer by Kaisley Leblanc

Powered by Discourse, best viewed with JavaScript enabled,the correct format is
x_train, x_test,y_train, y_test = train_test_split(x,y,test_size=0.2,random_state = 0),The length of x and y are exact the same. However when I use train_test_split to split it, it somehow become different. why?
I am so desperate now,x_train, y_train, x_test, y_test = train_test_split(x,y,test_size=0.2,random_state = 0)

I try to do the linear regression but when I used the test_train_split to create a set of training data. It went wrong. But when I eliminate the function, only use the origin one without splitting. It went well. I don’t understand why the error comes and how to fix it. The code and the origin data set look like this:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
x = visual_data.loc[:,('Relative Humidity AVG', 'Solar Radiation AVG', 'Temperature AVG',  'Wind Speed Daily AVG')]#loc vs iloc:We must convert the boolean Series into a numpy array. loc gets rows (or columns) with particular labels from the index. iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
y = visual_data.loc[:,('pecentage_of_success')]
print(x.shape)
print(y.shape)
x_train, y_train, x_test, y_test = train_test_split(x,y,test_size=0.2,random_state = 0)
print(x_train.shape)
print(y_train.shape)
linreg = LinearRegression()
model = linreg.fit(x_train,y_train)


Result:
(464, 4)
(464,)
(371, 4)
(93, 4)
---> 13 model = linreg.fit(x_train,y_train)
ValueError: Found input variables with inconsistent numbers of samples: [371, 93]
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
x = visual_data.loc[:,('Relative Humidity AVG', 'Solar Radiation AVG', 'Temperature AVG',  'Wind Speed Daily AVG')]#loc vs iloc:We must convert the boolean Series into a numpy array. loc gets rows (or columns) with particular labels from the index. iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
y = visual_data.loc[:,('pecentage_of_success')]
print(x.shape)
print(y.shape)
x_train, y_train, x_test, y_test = train_test_split(x,y,test_size=0.2,random_state = 0)
print(x_train.shape)
print(y_train.shape)
linreg = LinearRegression()
model = linreg.fit(x,y)

Answer by Alessandra Riley

ValueError: Found input variables with inconsistent numbers of samples: [1, 1000],
Resources

Blogs
Tutorials
Interview Questions
Sample Resumes
Webinars
Community

,Resources

Blogs
Tutorials
Interview Questions
Sample Resumes
Webinars
Community

, COBIT® is a trademark of ISACA® registered in the United States and other countries.

I am trying to create one Machine Learning model using LinearRegression model, but I am getting the below error.

import pandas as pddata = pd.read_csv('db.csv')x = data['TV']y = data['Sales']from sklearn.linear_model import LinearRegressionmodel = LinearRegression()model.fit(x,y)Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/user/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 512, in fit y_numeric=True, multi_output=True) File "/user/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py", line 531, in check_X_y check_consistent_length(X, y) File "/user/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py", line 181, in check_consistent_length " samples: %r" % [int(l) for l in lengths])ValueError: Found input variables with inconsistent numbers of samples: [1, 1000]

Answer by Magnolia Jordan

View a Printable Version,

Mark all forums read

# Importing Libraries 
import numpy as np   
import pandas as pd  
  
# Import dataset 
dataset = pd.read_csv("../output.tsv", delimiter = 't')


# library to clean data 
import re  
  
# Natural Language Tool Kit 
import nltk  
  
nltk.download('stopwords') 
  
# to remove stopword 
from nltk.corpus import stopwords 
  
# for Stemming propose  
from nltk.stem.porter import PorterStemmer 
  
# Initialize empty array 
# to append clean text  
corpus = []  
  
# 1000 (reviews) rows to clean 
for i in range(0, 5):  
      
    # column : "Review", row ith 
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) 
    
      
    # convert all cases to lower cases 
    review = review.lower()  
      
    # split to array(default delimiter is " ") 
    review = review.split()  
      
    # creating PorterStemmer object to 
    # take main stem of each word 
    ps = PorterStemmer()  
      
    # loop for stemming each word 
    # in string array at ith row     
    review = [ps.stem(word) for word in review 
                if not word in set(stopwords.words('english'))]  
                  
    # rejoin all string array elements 
    # to create back into a string 
    review = ' '.join(review)   
      
    # append each string to create 
    # array of clean text  
    corpus.append(review)

# Creating the Bag of Words model 
from sklearn.feature_extraction.text import CountVectorizer 
  
# To extract max 1500 feature. 
# "max_features" is attribute to 
# experiment with to get better results 
cv = CountVectorizer(max_features = 9)  
  
# X contains corpus (dependent variable) 
X = cv.fit_transform(corpus).toarray()  
  
# y contains answers if review 
# is positive or negative 
y = dataset.iloc[:, 1].values 

# Splitting the dataset into 
# the Training set and Test set 
from sklearn.model_selection import train_test_split


dataset.dropna(inplace=True)
print(X.shape)
print(y.shape)


# experiment with "test_size" 
# to get better results 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
print(X_train.shape)
print(y_train.shape)

Answer by Foster Adams

ValueError: Found input variables with inconsistent numbers of samples: [143, 426], ValueError: Found input variables with inconsistent numbers of samples: [143, 426] ,How can I fix this error it throws? ValueError: Found input variables with inconsistent numbers of samples:[143, 426],Running random forest algorithm with one variable

#split the data set into independent (X) and dependent (Y) data sets
X = df.iloc[:,2:31].values
Y = df.iloc[:,1].values

#split the data qet into 75% training and 25% testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)

#scale the data (feature scaling)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_train = sc.fit_transform(X_test)

#Using Logistic Regression Algorithm to the Training Set

classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, Y_train)

and the shape of X_train, Y_train:

X_train.shape
(143, 29)
Y_train.shape
(426,)

The following code is written in python to run the liner regression algorithm on a given set of data. Two columns were chosen, namely X1 and and Y1 were chosen on which linear regression was to be performed. The code used for the same was

On executing the above code the following error was encountered.

The above error generally comes when the X and Y have different number of samples. But in this case the error appeared even though the number of samples in X_train and Y_train were same as shown in the output.

The problem with the code is that we converted the given data into numpy arrays, but it was required to convert them to numpy matrix to be able to pass it to fit. Thus we modify the reading of data to

While doing the train test split we will need to transpose the matrix

Passing these test data to the fit function should not throw the value error encountered previously. The modified code thus would be

Понравилась статья? Поделить с друзьями:
  • Как найти мин если есть нули
  • Как найти вконтакте скрытых людей
  • Как найти вин на сканию
  • Как можно найти человека где он проживает
  • Как найти пароль вайфая в телефоне айфон